service alert aggregation?

Joshua Barratt jbarratt at serialized.net
Tue Sep 30 04:43:00 CEST 2003


I just spent a very interesting afternoon reading through the last few 
months of list archives, but was unable to come up with an answer to my 
question. I apoligize if this has been dealt with to death.

Basically, quite often if there is a problem with a host, many of it's 
services will be down, but it will still be pingable. (The TCP/IP stack 
is a hardy beast.) Possible causes: disk filling up, ram+swap filling 
up, very heavy load, etc (even some kernel panics!) -- all of these can 
cause more than one service to become unreachable, and in many cases, 
*all* services unreachable -- but still the host check will not fail. 
This causes the admins to get a flurry of service down alerts, and, when 
the problem is corrected, a flurry of service up alerts.

I tried doing the service dependency route, but the basic problem is 
still that because of the nagios scheduler, it may decide that the SMTP 
server is critical, say, 2 minutes before deciding that the service that 
SMTP depends on is critical, and thus you get paged for both.

Is it possible to configure things so you don't have that problem? I 
understand escalations, but that still doesn't really solve things, 
unless I'm missing something. I'll still get individual pages for every 
individual service that is experiencing a problem.

My idea (if simple configuration is not the solution) is to do something 
like this:
When a service alert is generated, instead of being emailed directly, it 
is emailed (or piped) to a script. That script then communicates with 
the nagios daemon and shedules immediate checks for all the services on 
the affected server. It waits some suitable time period, and then 
packages all the alerts received within that window into a single 
message which it then sends to the admins. (The same process would 
happen with the service up alerts.)

This might not be foolproof, but I think it would cut down on a lot of 
spurious paging.

Has anyone else solved this problem?

Thanks for any input,

Joshua Barratt




-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list