Event handlers

Chris Wilson chris at aidworld.org
Wed Mar 23 19:22:30 CET 2005


Hi all,

I have an interesting problem with Nagios event handlers, and I hope
that someone here will find the problem interesting or be willing to
help me.

I'm trying to use Nagios to monitor the realservers in our load balancer
setup, and an event handler to remove failing servers from the load
balancing pool, and add them back again when they come back up.

I have Nagios configured to monitor each server at the host level, a
number of services on the host, and a "virtual" service combining
various real services using the check_cluster plugin. 

The idea is that if too many of the services on the host go critical, or
the host itself goes down, it should be removed from the pool until it
recovers.

However, I'm having difficulty in making the script respond
appropriately to Nagios events. A typical sequence of Nagios events,
when I manually reboot one of the servers, goes like this:

* HOST DOWN
* HOST UP
* SERVICE CRITICAL (the host is "up", but the web server isn't running
yet)
* SERVICE OK (web server up and running)

At the moment, the script removes the host from the pool if either the
host is DOWN, or the service is not OK. However, in the above sequence,
there is a "false OK" period after the host is reported UP, and before
the service is reported CRITICAL. During this time, the host is in the
pool when it shouldn't be.

If I ignore either HOST or SERVICE events, then it doesn't work properly
either. If I reboot the host, then I don't get a SERVICE DOWN event
until after it comes back up, and maybe not even then, depending on
exactly when the next service check is scheduled.

It seems to me that although you don't want to send a notification about
services being down when the host is down, it would be useful for event
handlers to know about it. Then I could just write a script to pay
attention to the SERVICE and ignore the HOST state entirely.

Can anyone think of a way to improve the script or the event handling in
Nagios to make this work better? 

Perhaps I should have a fake host with check_dummy as the host
check_command, and copy the service state from the real host? That might
work but it seems really ugly and hacky, not to mention difficult to do
efficiently and make respond quickly. 

Perhaps an OCSP command which detects certain types of events, rewrites
them to a different host name and posts them back to Nagios as external
commands? Or maybe change my event handler script to an OCSP script?

Thanks in advance for your help.

Cheers, Chris.
-- 
(aidworld) chris wilson | chief engineer (chris at aidworld.org)



-------------------------------------------------------
This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon 2005
Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest Windows
Embedded(r) & Windows Mobile(tm) platforms, applications & content.  Register
by 3/29 & save $300 http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list