Title: How to monitor cluster servers using NRPE Submitted By: Martin Mielke Last Updated: 12. April 2005 Description: How to monitor cluster servers using NRPE Solution: HOWTO: NRPE + CLUSTERS This solution has been proven to work under this common scenario, although it should be possible to deploy it on an n-nodes environment: * master-master cluster (2 nodes); that is, both nodes run pplications * load-balanced Some definitions for such an scenario, among others: * IP address for node A (IPa) * IP address for node B (IPb) * IP address for the cluster itself an IP (IPc) * cluster services (i.e. Oracle instances, exported filesystems, etc) * host services (i.e. system load, local filesystems such as /, /var, /opt, etc depending on how you partitioned the hard disk or volume or whatever). The remote checks using TCP connections from the Nagios box, such as PING,check_http, check_ftp, check_tcp!port don't represent a problem. In my case, I had to think of something when wanting to check for cluster services such as shared storage, Oracle instances, etc because check_cluster segfaulted and check_cluster2 *always* returned "ok" (maybe this is a design philosophy). Because you have multiple IP addresses on the cluster nodes you need to tell NRPE to listen to everything and not to bind to a single IP address. This has been taken from nrpe.cfg: --- # SERVER ADDRESS # Address that nrpe should bind to in case there are more than one interface # and you do not want nrpe to bind on all interfaces. # NOTE: This option is ignored if NRPE is running under either inetd or xinetd #server_address=your.ip.address.here --- Hint: comment 'server_address' for it to work. If you leave it blank NRPE won't even start. Then define on nrpe.cfg *both* cluster and host services. Something like this: --- # The following 6 lines are for host (node-specific) services: # # we want to monitor /, /var, zombie procs, system load, users on the system and total amount of procs. # command[check_disk_root]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/rootvol command[check_disk_var]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/var command[check_zombie_procs]=/usr/local/nagios/libexec/check_procs -w 5 -c 10 -s Z command[check_load]=/usr/local/nagios/libexec/check_load -w 15,10,5 -c 30,25,20 command[check_users]=/usr/local/nagios/libexec/check_users -w 50 -c 75 command[check_total_procs]=/usr/local/nagios/libexec/check_procs -w 1000 -c 1200 # # These lines define the cluster services # # a lot of check_disk stuff... # command[check_disk1]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/vgai1dg/pvgai101 command[check_disk2]=/usr/local/nagios/libexec/check_disk -w 20% -c 10% -p /dev/vx/dsk/vgai1dg/pvgai102 command[check_disk3]=/usr/local/nagios/libexec/check_disk -w 8% -c 2% -p /dev/vx/dsk/vgai1dg/pvgai103 . . . # and some Oracle instances, for the sake of completeness command[check_oracle_YOUR_ORACLESID1]=/usr/local/nagios/libexec/check_oracle --db YOUR_ORACLESID1 command[check_oracle_YOUR_ORACLESID2]=/usr/local/nagios/libexec/check_oracle --db YOUR_ORACLESID2 --- This part of the nrpe.cfg must be the same on all cluster nodes. In this case, we have taken special care to configure both nodes the same way, that is, even the device names and mount points are the same. If this is not your case don't worry! it will also work but you'll must pay close attention to configure everything correctly, otherwise you'll get a lot of false positives. Now it's time to define some things on hosts.cfg and services.cfg (and serviceextinfo.cfg if you use it): * hosts.cfg: ------------ define host{ use generic-host ; Name of host template to use host_name node-A alias MYCLUSTER (node 1) address IPa <-- node A real IPv4 address check_command check-host-alive max_check_attempts 10 notification_interval 120 notification_period 24x7 notification_options d,u,r } define host{ use generic-host ; Name of host template to use host_name node-B alias MYCLUSTER (node 2) address IPb <-- node B real IPv4 address check_command check-host-alive max_check_attempts 10 notification_interval 120 notification_period 24x7 notification_options d,u,r } define host{ use generic-host ; Name of host template to use host_name mycluster alias MYCLUSTER address IPc <-- cluster virtual IPv4 address check_command check-host-alive max_check_attempts 10 notification_interval 120 notification_period 24x7 notification_options d,u,r } * services.cfg: --------------- For node-A (just some services on this example): define service{ use generic-service ; Name of service template to use host_name node-A service_description SysLoad is_volatile 0 check_period 24x7 max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_interval 120 notification_period 24x7 notification_options c,r check_command check_nrpe!check_load } define service{ use generic-service ; Name of service template to use host_name node-A service_description ROOT_DISK is_volatile 0 check_period 24x7 max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_interval 120 notification_period 24x7 notification_options c,r check_command check_nrpe!check_disk_root } Proceed the same way for node-B or just define the services you want to check. For the cluster (also just some service examples...): define service{ use generic-service ; Name of service template to use host_name mycluster service_description SysLoad is_volatile 0 check_period 24x7 max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_interval 120 notification_period 24x7 notification_options c,r check_command check_nrpe!check_load } define service{ use generic-service ; Name of service template to use host_name mycluster service_description ORACLE_MYORACLESID is_volatile 0 check_period 24x7 max_check_attempts 3 normal_check_interval 5 retry_check_interval 1 contact_groups admins notification_interval 120 notification_period 24x7 notification_options c,r check_command check_nrpe!check_oracle_myoraclesid } After all these steps, you will end up with 3 (or more, depending on how many nodes your cluster is made of) machines on the web interface: * node-A checking for SysLoad checking for ROOT_DISK * node-B checking for AnotherService_1 checking for AnotherService_2 * cluster checking for SysLoad checking for ORACLE_MYORACLESID Please write with any comments or corrections to martin@mielke.com__but_remove_this_crap_first_:-)