Nagios occasionally does not send notifications when a service goes down

Toby Kraft Toby_Kraft at KSAinc.com
Mon Feb 21 18:53:00 CET 2005


Hi all,

I've been using Nagios 1.2 (and Netsaint before) with some clients for a 
while.  One installation (on Fedora Core 2) has an issue where a service 
will go down, but Nagios does not send any notification.

The service check is a simple tcp port check, the host_alive_check is 
*default (ping), the host can be pinged.  This host has one and only one 
service.  It's a pretty vanilla install and everything works fine most of 
the time.

This past weekend, a host went down.  No notifications were sent.  Monday 
morning the staff came in, saw the host was down and restarted it.  After 
they restarted the target host, Nagios then sent out a bunch of Host Down 
alerts followed by a Host Up alert.  Notifications for this server or host 
were NOT disabled (nagios.log archives show they were enabled on 2/9/05).

Okay now you're saying - it's your mail server.  But Nagios did not log 
any notifications at the time of the problem!

The Host Alert History shows:
Sun Feb 20 00:00:00 CST 2005 to Mon Feb 21 00:00:00 CST 2005 

[02-20-2005 18:08:43] SERVICE ALERT: ucisvr5.champlabs.com;Sandbox - 
DB;CRITICAL;HARD;1;Connection refused or timed out
[02-20-2005 18:08:43] HOST ALERT: 
ucisvr5.champlabs.com;DOWN;HARD;3;/bin/ping -n -U -c 1 
ucisvr5.champlabs.com
[02-20-2005 18:08:40] HOST ALERT: 
ucisvr5.champlabs.com;DOWN;SOFT;2;/bin/ping -n -U -c 1 
ucisvr5.champlabs.com
[02-20-2005 18:08:37] HOST ALERT: 
ucisvr5.champlabs.com;DOWN;SOFT;1;/bin/ping -n -U -c 1 
ucisvr5.champlabs.com

The Host Notification History shows:
Sun Feb 20 00:00:00 CST 2005 to Mon Feb 21 00:00:00 CST 2005 
No notifications have been recorded for this host in this archived log 
file 

The Service Alert History shows:
Sun Feb 20 00:00:00 CST 2005 to Mon Feb 21 00:00:00 CST 2005 
[02-20-2005 18:08:43] SERVICE ALERT: ucisvr5.champlabs.com;Sandbox - 
DB;CRITICAL;HARD;1;Connection refused or timed out 

The Service Notification History shows:
Sun Feb 20 00:00:00 CST 2005 to Mon Feb 21 00:00:00 CST 2005 
No notifications have been recorded for this service in this archived log 
file 

It seems that this occurs after Nagios has been up and running for a 
while.  The system and Nagsio have been up for 11 days which doesn't seem 
like a long time.

Mainly just fishing for any ideas on what could cause this or how to 
troubleshoot the problem.  It would be nice if Nagios logged some info 
when it processes an event and then decides NOT to send a notification, 
like "Notification for event xxxx suppressed because yyyyy" or some such.

Thanks for listening.  I'll check into any debug and/or logging options.

Toby

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20050221/0240ab4d/attachment.html>


More information about the Users mailing list