Title:	How to monitor cluster servers using NRPE

Submitted By:	Martin Mielke
Last Updated:	12. April 2005

Description:	How to monitor cluster servers using NRPE

Solution:

HOWTO: NRPE + CLUSTERS

This solution has been proven to work under this common scenario, although
it should be possible to deploy it on an n-nodes
environment:

	* master-master cluster (2 nodes); that is, both nodes run pplications
	* load-balanced

Some definitions for such an scenario, among others:

	* IP address for node A (IPa)
	* IP address for node B (IPb)
	* IP address for the cluster itself an IP (IPc)
	* cluster services (i.e. Oracle instances, exported filesystems, etc)
	* host services (i.e. system load, local filesystems such as /, /var, /opt, etc depending on  how you partitioned the hard disk or volume or whatever).

The remote checks using TCP connections from the Nagios box, such as
PING,check_http, check_ftp, check_tcp!port
don't represent a problem. In my case, I had to think of something when
wanting to check for cluster services
such as shared storage, Oracle instances, etc because check_cluster
segfaulted and check_cluster2 *always* returned "ok"
(maybe this is a design philosophy).

Because you have multiple IP addresses on the cluster nodes you need to
tell NRPE to listen to everything and not to bind
to a single IP address. This has been taken from nrpe.cfg:

---
# SERVER ADDRESS
# Address that nrpe should bind to in case there are more than one interface
# and you do not want nrpe to bind on all interfaces.
# NOTE: This option is ignored if NRPE is running under either inetd or
xinetd

#server_address=your.ip.address.here
---

Hint: comment 'server_address' for it to work. If you leave it blank NRPE
won't even start.

Then define on nrpe.cfg *both* cluster and host services. Something like
this:

---
# The following 6 lines are for host (node-specific) services:
#
# we want to monitor /, /var, zombie procs, system load, users on the
system and total amount of procs.
#
command[check_disk_root]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/rootvol
command[check_disk_var]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/var
command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z
command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20
command[check_users]=/usr/local/nagios/libexec/check_users -w 50 -c 75
command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 1000 -c 1200

#
# These lines define the cluster services
#
# a lot of check_disk stuff...
#
command[check_disk1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p  /dev/vx/dsk/vgai1dg/pvgai101
command[check_disk2]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p
/dev/vx/dsk/vgai1dg/pvgai102
command[check_disk3]=/usr/local/nagios/libexec/check_disk -w 8% -c 2% -p
/dev/vx/dsk/vgai1dg/pvgai103
	.
	.
	.

# and some Oracle instances, for the sake of completeness
command[check_oracle_YOUR_ORACLESID1]=/usr/local/nagios/libexec/check_oracle --db YOUR_ORACLESID1
command[check_oracle_YOUR_ORACLESID2]=/usr/local/nagios/libexec/check_oracle --db YOUR_ORACLESID2
---

This part of the nrpe.cfg must be the same on all cluster nodes.
In this case, we have taken special care to configure both nodes the same
way, that is, even the device names
and mount points are the same. If this is not your case don't worry! it
will also work but you'll must pay close
attention to configure everything correctly, otherwise you'll get a lot of
false positives.

Now it's time to define some things on hosts.cfg and services.cfg (and
serviceextinfo.cfg if you use it):

* hosts.cfg:
------------

define host{
        use                     generic-host            ; Name of host
template to use

        host_name               node-A
        alias                   MYCLUSTER (node 1)
        address                 IPa			<-- node A real IPv4 address
        check_command           check-host-alive
        max_check_attempts      10
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
        }

define host{
        use                     generic-host            ; Name of host
template to use

        host_name               node-B
        alias                   MYCLUSTER (node 2)
        address                 IPb			<-- node B real IPv4 address
        check_command           check-host-alive
        max_check_attempts      10
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
        }

define host{
        use                     generic-host            ; Name of host
template to use

        host_name               mycluster
        alias                   MYCLUSTER
        address                 IPc			<-- cluster virtual IPv4 address
        check_command           check-host-alive
        max_check_attempts      10
        notification_interval   120
        notification_period     24x7
        notification_options    d,u,r
        }


* services.cfg:
---------------

For node-A (just some services on this example):

define service{
        use                             generic-service         ; Name of
service template to use

        host_name                       node-A
        service_description             SysLoad
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_nrpe!check_load
}

define service{
        use                             generic-service         ; Name of
service template to use

        host_name                       node-A
        service_description             ROOT_DISK
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_nrpe!check_disk_root
}

Proceed the same way for node-B or just define the services you want to check.

For the cluster (also just some service examples...):

define service{
        use                             generic-service         ; Name of service template to use

        host_name                       mycluster
        service_description             SysLoad
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_nrpe!check_load
}

define service{
        use                             generic-service         ; Name of service template to use

        host_name                       mycluster
        service_description             ORACLE_MYORACLESID
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           5
        retry_check_interval            1
        contact_groups                  admins
        notification_interval           120
        notification_period             24x7
        notification_options            c,r
        check_command                   check_nrpe!check_oracle_myoraclesid
}


After all these steps, you will end up with 3 (or more, depending on how many nodes your cluster is made of) machines on the web interface:

	* node-A
		checking for SysLoad
		checking for ROOT_DISK

	* node-B
		checking for AnotherService_1
		checking for AnotherService_2

	* cluster
		checking for SysLoad
		checking for ORACLE_MYORACLESID


Please write with any comments or corrections to
martin@mielke.com__but_remove_this_crap_first_:-)