Various timing related issues with Nagios, cri tical impact on monitoring.

Steven Hajducko Steven.Hajducko at DigitalInsight.com
Tue Nov 16 20:19:29 CET 2004


1 - Are you going by the timestamps in the actual email or at what time you
received the email?  I know often times we get an alert that seems strange
and then check the actual time IN the email and it turns out to be just a
delayed email that finally got sent.  ( Especially when the email server you
are using to send the nagios emails goes down and it sends a CRITICAL about
the mail server.  Usually we end up getting the OK message before we get the
CRITICAL message, but the timestamps within the email are correct. )
 
2 - No, it won't be preceeded by service failures.  If a service fails, I
believe it checks the host.  If the host works, then it sends a service
notification.  Otherwise, if the host doesn't work, it just sends a host
alert and doesn't bother with all the service checks.  After all, if the
host is down, you know all the services are going to be.
 
3 - Can you paste the stanza for your check_http service check?  What's your
retry_check_interval? max_check_attempts?  These all play into when a
notification is sent, because a notification is only sent upon change of
state and those parameters all concern when nagios declares a change of
state.  Recently, we know we had a problem with SMTP, so we only wanted to
be notified if the problem continued for 30 minutes, so our stanza ended up
looking like this.
 
define service {
        use                             generic-service
        hostgroup_name                 webfarm
        service_description             SMTP
        max_check_attempts         6
        retry_check_interval            5
        check_command                check_smtp
}
 
It basically means that when the service fails for the first time, nagios
will try again 5 more times, at 5 minute intervals, before it declares SMTP
as critical and notifies us.  I really can't say what's wrong with your
config without seeing the stanzas, but I hope this helps.
 
--
sh

-----Original Message-----
From: Jon Gefaell [mailto:jgefaell at netblue.com] 
Sent: Tuesday, November 16, 2004 10:46 AM
To: nagios-users at lists.sourceforge.net
Subject: [Nagios-users] Various timing related issues with Nagios, critical
impact on monitoring.



Not only am I having problems with the notification interval, other
strangeness is rearing it's head. Last night there was a host down alarm, no
service check problems, just a host down. Then, 40 seconds (SECONDS!) later,
Host OK. What could possibly explain this? There's no check interval of 40
seconds... And shouldn't a host check be preceeded by service failures?

Another host had an check_http service check timeout. Generated a critical
alert notificiation. Nothing else until 20 minutes later when it went 'OK.
This is confusing,  Given that my notification interval seems stuck at 25
minutes it makes sense we only got one notification, but our alternate alarm
systems (remote services) never alarmed at all, the service was fully
available during these 20 minutes. The alarm was just a timeout on the
plugin and then it seems nothing happened for 20 minutes???

I set notification interval to 15 and still get notifications every 25
minutes. I can't figure it out.

 

Please do try to help address these issues, our use of Nagios is heavily
impacted by the inability to configure things like the notification interval
and the other behaviours described here.

 

---

Hello,

I am running Nagios 1.2 on Linux Redhat 9 and 7.3

I am seeing that if a service condition is let's say 'warning' and remains
that way notifications are sent every 25 minutes.

Now I have interval_length=60 and notification_interval 120

This makes me think that I should be getting these notifications every 2
hours, which is the desired behaviour. Can anyone tell me what may be wrong
here? Maybe I don't understand this properly? Are these two parameters what
should be controlling this behaviour?

Thank you very much for your kind consideration.

Jon Gefaell

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20041116/cb09f552/attachment.html>


More information about the Users mailing list