nagios blocking on notifications?

Mike Lindsey mike-nagios at 5dninja.net
Fri Jan 15 03:22:31 CET 2010


Turns out nagios doesn't fork before handling notifications, and also 
waits for the children of any notification commands to exit, so forking 
inside my notification script won't help.

I took the part of the script that was taking 5-6 seconds to complete 
and added in a cache mechanism, which changed the 90+ second 
notification cycle, to a 6-8 second notification cycle.

Might be overkill, but I've also wrapped some fork() logic around the 
service_notification() call inside handle_async_service_check_result()..

Compiles and runs, I'll stress test it tonight and see how it does with 
real load, tomorrow.

Also, if there's a better way to do this, I'm all ears.

Mike Lindsey wrote:
> I've got a high volume site.  Everything seems to keep up reasonably 
> well, unless there are a good number of state changes.  Once services 
> start changing state, and notifications start getting sent out, nagios 
> falls behind.
> 
> Did some digging in the logs and it looks like while a batch of 
> notifications are being sent out, it's rate limiting to about one per 
> five seconds.  Also, from the first notification for a service to the 
> last notification for that service, nothing else is written to the logs.
> 
> Since a typical notification goes out to 15+ people, that's over a 
> minute with no service check handling.
> 
> Is there something going on under the hood that I'm not aware of (like, 
> is it just not doing the log writing, but still doing the passive 
> service check handling, and there's something else causing my latency?)
> 
> Is that delay configurable?  I don't see anything in the docs for that.
> 
> I've even set my notification script to just call and background a 
> secondary script, to try and see if it wasn't a delay in the 
> notification script, but that seemed not to do anything at all.  Should 
> I be forking the notification script instead?
> 
> Here's a log snippet:
> [1263505735] EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;<redacted>;System Check;0;OK load mem ntp 
> swap cfengine disk|
> [1263505735] EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;<redacted>;System Check;0;OK load mem ntp 
> swap cfengine disk|
> [1263505735] EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;<redacted>;System Check;1;WARNING [swap 
> utilization 25%] [/data/ at 77% (inodes 0%)]|
> [1263505735] PASSIVE SERVICE CHECK: 
> <redacted>;check_mtime-redlist.txt;0;OK - redlist.txt 102 seconds old
> [1263505735] PASSIVE SERVICE CHECK: <redacted>;pre_queuedepth;2;CRITICAL 
> - <redacted> pre_queuedepth status: 2159 > 500
> <There's close to 50 line entries with that time stamp>
> [1263505735] SERVICE NOTIFICATION: 
> <redacted>;<redacted>;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL - 
> <redacted> pre_queuedepth status: 2159  500
> [1263505741] SERVICE NOTIFICATION: 
> <redacted>;<redacted>;pre_queuedepth;CRITICAL;notify-by-email;CRITICAL - 
> <redacted> pre_queuedepth status: 2159  500
> 
> 
> The SERVICE NOTIFICATION entries keep rolling in every 5-6 seconds for 
> the next minute+, then it goes back to it's usual happy speed.
> 
> Is this an artifact of the way it logs, or is the whole system choking 
> while it sends email?  I've searched the list archives and not found 
> anything on this.
> 


-- 
Mike Lindsey

------------------------------------------------------------------------------
Throughout its 18-year history, RSA Conference consistently attracts the
world's best and brightest in the field, creating opportunities for Conference
attendees to learn about information security's most important issues through
interactions with peers, luminaries and emerging and established companies.
http://p.sf.net/sfu/rsaconf-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list