How to Set Up NAL (Nagios Alarm Handler) to monitor an EPICS network

From EPICSWIKI

Here’s how to install NAL using yum on RedHat Enterprise 5 x86 Linux box

Nagios default installation

Nagios application is provided by rpmforge repository, so you have to install it to configure yum properly.
See https://www.tecmint.com/enable-rpmforge-repository/ for instruction on how to achieve this.

After that, it is possible to install the Nagios application. For the “server side”, you need these following packages: 1) nagios: the main application package 2) nagios-plugins: provides all the command scripts used by users to define nagios services. In some cases there is also nagios-plugins-all (that’s better) 3) nagios-plugins-nrpe: provides the check_nrpe script used to communicate with nagios clients and run remote services

    root> yum install -y nagios
    root> yum install -y nagios-plugins
    root> yum install -y nagios-plugins-nrpe

With that I’ve installed Nagios 3.2.3 version.

Nagios: configuration

When you install Nagios by yum, all the apache configuration are done by default.

To check the web interface you must define the password for nagiosadmin user (default nagios administrator). This passsword must be encrypted. you can use htpasswd command to set that an save in

         /etc/nagios/htpasswd.users

Start the apache and nagios services

    root> service httpd start (restart)
    root> service nagios start

and check the nagios webpage at http://localhost.localdomain/nagios . If it is all correct, you have to see the authentication popup. When you are in the main page, you can monitor the localhost machine (nagios provides some information about hosts and services); all the services should be OK, but in some case you have to check some permissions/configurations.

The main configuration file is /etc/nagios/nagios.cfg

in this file you can configure every feature of nagios. We use most of the default options, the only parameters enabled are:

  • enable servers directory: you can define all the servers’ cfg files into this directory (cleaner job)
 cfg_dir=/etc/nagios/servers

and create the folder:

 # mkdir /etc/nagios/servers

In the servers folder you have to define all the hosts you want to monitor. For a correct management, you have to define 2 different files:

  • HOST.cfg: define the specifics that nagios uses to monitor the host desired. You must define one file per host!
  • groups.cfg: indicate all the different groups of hosts. It is very useful to manage and monitor a large number of machines

Example:

  • File servers/example.cfg:
define host{
       use                     linux-server            
       host_name               example
       alias                   example display in web interface
       address                 10.6.0.1
       notification_period     24x7
       icon_image              example.jpg
       }
define service{
       use                             local-service   
       host_name                       example
       service_description             PING
       check_command                   check_ping!100.0,20%!500.0,60%
       }
define service{
       use                             local-service   
       host_name                       example
       service_description             SSH
       check_command                   check_ssh
       }

in this code: 1) notification_period is defined in /etc/nagios/objects/timeperiods.cfg –> you can edit this file to add/set different time periods 2) icon_image is situated in /usr/share/nagios/images/logos/. if you want to add new images you must save them in this place 3) service_description is the service name displayed in the web interface 4) check_command define the command desired and situated in /usr/lib/nagios/plugins

  • File servers/groups.cfg:
define hostgroup{
       hostgroup_name  example ; The name of the hostgroup
       alias           example @ MyLab ; Long name of the group
       members         localhost, example
       }

You have to define the host_name used before in members variable.

After these changes, verify the configuration files through

# nagios -v /etc/nagios/nagios.cfg

and then, if there aren’t any error, restart the service

# service nagios restart

Nagios Default Folder Locations

By default Nagios yum installation, Nagios stores the following file location into your harddisk

   * /etc/nagios/ - Nagios configuration folder locations
   * /var/log/nagios/nagios.log - Nagios log 
   * /usr/share/nagios/ - Nagios, docs, sounds, and image folder locations
   * /usr/lib/nagios/cgi/ - Nagios CGI folder location
   * /usr/bin - Nagios binaries
   * /etc/httpd/conf.d/nagios.conf - Nagios Apache folder files

Insert the EPICS Nagios Plugins

What you did in the chapters above was a generic Nagios installation/setup.

Going to here, You will find the nagios plugin to EPICS. Download the plugin and save into the

/usr/lib/nagios/plugins/

Change the permission to check_caget.sh

    root> chmod  +x check_caget.sh

now verify that is usable with:

    > ./check_caget.sh --help

verfing using camonitor a PV, i.e. for me giacchinHost:aiExample

    > camonitor giacchinHost:aiExample

Note: After the version 1.3 the plugin assume the presence of caget in /usr/bin If that is not true at your site, please fix it making a symbolic link like (i.e.): ln -s /opt/epics/base-3.14.9/bin/linux-x86/caget /usr/bin/caget

Using the EPICS environment variables you should avoid to broadcast to the network, for me the applicable values were:

    EPICS_CA_AUTO_ADDR_LIST=NO
    EPICS_CA_ADDR_LIST=127.0.0.1

therefore may I test the plugin with the follow command:

    > ./check_caget_dev_gw.sh -pv giacchinHost:aiExample -H 127.0.0.1
    > STATE_OK: giacchinHost:aiExample 5 2007-11-16 15:23:18.560231  ; te: 0 sec.

if that reply correctly the status of your PV you can continue the installation.

Now install the EPICS logos images

Download the epics.gif image available from the same place

and install that:

   root> mv epics,gif /usr/share/nagios/images/logos/

Save the original Nagios setup and replace it

Go to /etc folder and save the original setup

    root> tar cvf nagios.or.tar ./nagios/

download there the etc.nagios.tar available at same place

and restore the nagios folder with that:

    root> tar xvf  ./etc.nagios.tar

Note: Now looking around the files into /etc/nagios and adjust that to meet your network setup requirements. You will find an epicsExample.cfg which contains a pre-setted PV names, please adjust that to meet the yours.

NAGIOS check configuration file

For sanity checking, make sure you verify Nagios config files. This can be done like so

    root> nagios -v /etc/nagios/nagios.cfg

The above command would show you for any erroneous lines frin Nagios config file.

HTTPD configuration

Check the presence of line: “include conf.d/*.conf”

in /etc/httpd/conf/httpd.conf

Check the paths into the file : /etc/httpd/conf.d/nagios.conf

Make a file named .htaccess into /usr/lib/nagios/cgi-bin/ and /usr/share/nagios/html/

which will contains:

   AuthName "Nagios Access"
   AuthType Basic
   AuthUserFile /etc/nagios/passwd
   require valid-user

Now create a nagios user with the following command:

    root> htpasswd -c /etc/nagios/passwd nagiosadmin

SELinux setup

For the first test: set it permissive by

   root> system-config-securitylevel

NAGIOS as a Linux service

Basically, at this point of basic Nagios configuration, restarting Nagios should be successful.

Reload your apache service together with your Nagios service like so

    root> service httpd restart
    root> service Nagios stop
    root> service Nagios start
    root> service Nagios status

Open your favorite web-browser on http://localhost/nagios/

login like “nagiosadmin”, give your password and enjoy!

NB. If you are using my etc.nagios.tar the passwd to login is “nagiosadmin”

See my nagios screen shots in action:

Nagios Service Details

Nagios Alert Histogram

Nagios Status Map


Conclusions

There are a lot of other interesting feature that comes from free using NAGIOS, looking around you should find a lot yourself. There is a cool Firefox plugin https://exchange.nagios.org/directory/Addons/Frontends-%28GUIs-and-CLIs%29/Web-Interfaces/nagioschecker–2D-Firefox-Addon/details which give you the possibility to continuous monitoring the PVs during the regular usage of the browser.

At this time Ralph Lange has realized a test to NAL at Bessy. A great acknowledgments to him, he has supported me since the idea of use Nagios born in my mind. Acknowledgments to Maurizio Montis, who made a kickstart script to deploy a RHEL5 box equipped with Nagios ready to use, and, adjust and fix the old notes on FC7 to the new OS: RHEL5.

More information about NAL could be found here. A special LivEPICS version (Linux Live CD EPICS fully equipped) with NAGIOS pre-setted and ready to use here .

Thank you for your attention! Please, give me a your feedback, and fell free to drop me an email, I’ll be happy to continue to work on this idea if someone is interested to use it.

Mauro Giacchini (INFN-LNL)

MauroGiacchini 15.54, 2 Dec 2011


The Plugin Script

/usr/lib/nagios/plugins/check_caget_dev_gw.sh script for Nagios

#!/bin/sh
#
#####################################################################################
#####################################################################################
##                           Nagios plugin to check EPICS PV Status                ##
#####################################################################################
#####################################################################################
#
# Script to retrieve EPICS PV Name status using the "caget" command.
# Written by Mauro Giacchini (mauro.giacchini@lnl.infn.it)
# Last Modified: 17-11-2007
#
# Usage: ./check_caget.sh -pv <PV name>
#
# Description:
#   	This script uses caget command to retrieve the PV status. 
#
# Limitations:
# 	This script has been tested on Linux Fedora Core 6.
#
# Output:
# 	The output contains the "te" time elapsed calculated like a difference from PV's
# timestamp and the linux "date" command (suggestion: use ntp common server
# to IOCs and Nagios server box). The STATUS of the service (..of the PV)
# follow the severity rules:
#
# Severity (none) >>>> STATE_OK		# OK = green
#
# Severity MINOR  >>>> STATE_WARNING	# WARNING = yellow
#
# Severity MAJOR  >>>> STATE_CRITICAL	# CRITICAL = red
#
# PV not found    >>>> STATE_UNKNOWN	# UNKNOWNN = orange
#
# In case of Severity (none) it show the stdout of "caget -a" with appended the "te".
#
# Other notes:
#  Firefox Plugin : A FireFox extension is avilable to monitor Nagios server.
#  https://exchange.nagios.org/directory/Addons/Frontends-%28GUIs-and-CLIs%29/Web-Interfaces/nagioschecker--2D-Firefox-Addon/details
#
# Nagios configuration setup: 
# 	You need to add the command to commands.cfg
# 
# define command{
# 	command_name	check_caget
# 	command_line	$USER1$/check_caget.sh -pv $ARG1$
# 	}
#
#	And, you need to add the service to services.cfg
#
# define service{
#        use         		generic-service	;
#        host_name		IOC_Example	;
#        service_description   	aiExample	;
#        is_volatile           	0		;
#        check_period		24x7		;
#        max_check_attempts    	3		;
#        normal_check_interval 	3		;
#        retry_check_interval  	1		;
#        contact_groups        	admins		;
#        notification_interval 	120		;
#        notification_period   	24x7		;
#        notification_options  	w,u,c,r		;
#        check_command         	check_caget!rootHost:aiExample	;
#        }
#
# then place this script in the /usr/lib/nagios/plugins/ on the Nagios box server.
# Don't forget to set the right execution permission to this file.
#
# Threshold and ranges: please, have a look at:
# http://nagios-plugins.org/doc/guidelines.html#THRESHOLDFORMAT
#
# Last: This script still needs debugging and fixups (exercise for reader) :-)
#
#####################################################################################
# DEBUGGING OPTION
# This option determines whether or not debugging messages are showed 
# Values: 0=debugging off, 1=debugging on
DEBUG="0"
#####################################################################################
# CAGET LOCATION
# This option determines where the caget executable is located.
# The default /usr/bin/caget should be made with a symbolic link
# made by root (i.e.): ln -s /opt/epics/base-3.14.9/bin/linux-x86/caget /usr/bin/caget
CAGET_LOCATION=/usr/bin/caget
#####################################################################################
# Script exit status
STATE_OK=0		# OK = green
STATE_WARNING=1		# WARNING = yellow
STATE_CRITICAL=2	# CRITICAL = red
STATE_UNKNOWN=3  	# UNKNOWNN = orange
VERSION="v1.3"
#####################################################################################
# print_revision() function
print_revision (){
echo "Check_caget (nagios-plugins 1.4 to nagios 2.9) (EPICS base 3.14.9) $VERSION"
}
#####################################################################################
# print_usage() function
print_usage() {
echo ""
echo "Usage: check_caget_dev_gw -pv <PV name> "
echo "Usage: check_caget_dev_gw -pv <PV name> -H <EPICS_CA_ADDR_LIST>"
echo "Usage: check_caget_dev_gw -pv <PV name> -p <EPICS_CA_SERVER_PORT>"
echo "Usage: check_caget_dev_gw -pv <PV name> -expval <EXPECTED VALUE>"
echo "Usage: check_caget_dev_gw [-h] [--help]"
echo "Usage: check_caget_dev_gw [-V]"
echo ""
}
#####################################################################################
# print_help() function
print_help() {
echo ""
print_usage
echo ""
echo "Script to retrieve the PV status for EPICS control systems."
echo ""
echo "This plugin not developped by the Nagios Plugin group."
echo "Please do not e-mail them for support on this plugin, since"
echo "they won't know what you're talking about :P"
echo ""
echo "For contact info: mauro.giacchini@lnl.infn.it"
echo "Download : https://web.infn.it/epics/index.php/resources"
echo ""
}
#####################################################################################
# Check the caget presence.
verify_caget_presence() {
if ! type $CAGET_LOCATION >/dev/null 2>&1; then
echo "STATUS CRITICAL: caget not found (Did you set up the rigth one Nagios USERn? _or_ caget not found!)"
exit $STATE_CRITICAL
fi
}
#####################################################################################
# Control caget plugin input parameters
EXPVAL=""
EPICS_CA_ADDR_LIST="" 	# Default YES
EPICS_CA_SERVER_PORT="" # Default 5064 _and_  	value > 5000
EPICS_CA_SERVER_PORT_MIN="5000"
while test -n "$1"; do
case "$1" in
--help)
print_help
exit $STATE_OK
;;
-h)
print_help
exit $STATE_OK
;;
-V)
print_revision
exit $STATE_OK
;;
-pv)
PVNAME=$2
shift
;;
-expval)
EXPVAL=$2
if [ -z $EXPVAL ]; then
echo "STATUS CRITICAL: Expected value absent"
exit $STATE_CRITICAL
fi
shift
;;
-H)
EPICS_CA_ADDR_LIST=$2
if [ -z $EPICS_CA_ADDR_LIST ]; then
echo "STATUS CRITICAL: Expected EPICS_CA_ADDR_LIST absent"
exit $STATE_CRITICAL
fi
export EPICS_CA_ADDR_LIST
EPICS_CA_AUTO_ADDR_LIST="NO"
export EPICS_CA_AUTO_ADDR_LIST
shift
;;
-p)
EPICS_CA_SERVER_PORT=$2
if [ -z $EPICS_CA_SERVER_PORT ]; then
echo "STATUS CRITICAL: Expected EPICS_CA_SERVER_PORT absent"
exit $STATE_CRITICAL
fi
if [ $EPICS_CA_SERVER_PORT -le $EPICS_CA_SERVER_PORT_MIN ]; then
echo "STATUS CRITICAL: Expected EPICS_CA_SERVER_PORT minor than allowed (5001)"
exit $STATE_CRITICAL
fi
export EPICS_CA_SERVER_PORT
shift
;;
*)
echo ""
echo "Unknow argument: $1"
print_usage
exit $STATE_UNKNOWN
;;
esac
shift
done
verify_caget_presence
if [ -z $PVNAME ]; then
echo "STATUS CRITICAL: PV Name not specified"
exit $STATE_CRITICAL
fi
#####################################################################################
# FINALLY... RETRIEVING THE VALUES (caget)
#CAGET_REPLY=`caget -a $PVNAME`
CAGET_REPLY=`$CAGET_LOCATION -a $PVNAME`
IFS=" "
read pvname date time value status severity<<END
$CAGET_REPLY
END
if [ -z $pvname ]; then
echo "STATE_UNKNOWN: $PVNAME not found"
exit $STATE_UNKNOWN
fi
#####################################################################################
# Calculus difference between the PV timestamp and the actual time
SPACE=" "
dte1=$(date --date "$date$SPACE$time" +%s)
dte2=$(date +%s)
diffSec=$((dte2-dte1))
if ((diffSec < 0)); then abs=-1; else abs=1; fi
te=$((diffSec/abs))
#    	echo "Time elapsed (sec.): $te"
#####################################################################################
# Output the NAGIOS status using an expected value
if [ $EXPVAL ]; then
if  [[ $value -eq $EXPVAL ]] ;
then echo "STATE_OK: Expected value ($EXPVAL) to $pvname match; te: $te sec."
exit $STATE_OK;
else  echo "STATUS CRITICAL: Expected value ($EXPVAL) to $pvname didn't match"
exit $STATE_CRITICAL; 
fi
fi
#####################################################################################
# Output the NAGIOS status using the Severity field
case $severity in
MAJOR)
echo "STATUS CRITICAL: $pvname in MAJOR severity status; te: $te sec."
exit $STATE_CRITICAL
;;
MINOR)
echo "STATE_WARNING: $pvname in MINOR severity status; te: $te sec."
exit $STATE_WARNING
;;
*)
echo "STATE_OK: $pvname $value $date $time $status ; te: $te sec."
exit $STATE_OK
;;
esac