distributed servers not updating master with host problems

Noel Platzke neufpas at gmail.com
Mon Nov 9 20:46:52 CET 2009


I've just started noticing a strange problem. If a host being monitored by a
distributed server dies the master server isn't receiving the information it
needs to generate an email. This doesn't happen every time only on occasion
but I can't figure out why it's an intermittent problem.

*This is an excerpt from the logs on the master:*
[11-09-2009 14:05:56] SERVICE ALERT: prodhv28;Alert 11003 - Check Local
Disks;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system
call
[11-09-2009 14:05:26] SERVICE ALERT: prodhv28;Alert 11002 - Check
Load;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system call
[11-09-2009 14:05:16] HOST ALERT: prodhv28;DOWN;SOFT;1;PING CRITICAL -
Packet loss = 100%
[11-09-2009 14:04:56] SERVICE ALERT: prodhv28;Alert 11002 - Check
Load;CRITICAL;SOFT;1;CRITICAL - Plugin timed out while executing system call

*But this is what I see on the slave:*
[11-09-2009 14:06:50] HOST ALERT: ashprdhv28;DOWN;HARD;3;PING CRITICAL -
Packet loss = 100%
[11-09-2009 14:05:29] SERVICE ALERT: ashprdhv28;Alert 11003 - Check Local
Disks;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system
call
[11-09-2009 14:05:29] HOST ALERT: ashprdhv28;DOWN;SOFT;2;PING CRITICAL -
Packet loss = 100%
[11-09-2009 14:05:06] SERVICE ALERT: ashprdhv28;Alert 11002 - Check
Load;CRITICAL;HARD;1;CRITICAL - Plugin timed out while executing system call
[11-09-2009 14:04:24] HOST ALERT: ashprdhv28;DOWN;SOFT;1;PING CRITICAL -
Packet loss = 100%
[11-09-2009 14:04:04] SERVICE ALERT: ashprdhv28;Alert 11002 - Check
Load;CRITICAL;SOFT;1;CRITICAL - Plugin timed out while executing system call


The two subsequent host check attempts never made it back to the master. I'd
chalk it up to a problem with nsca but I'm not missing any service checks,
only host.

I've currently got every host using the same template, which includes a 5
minute check interval with a 1 minute retry and a 600 second freshness
threshold. Somewhere one of the servers should be kicking off these checks
but for some reason the master isn't doing what it's configured to do. I'm
at a loss. Any ideas?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20091109/4497e728/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list