Services get stuck in warning/critical state

Andreas Ericsson ae at op5.se
Fri Oct 3 10:54:24 CEST 2008


Bartlomiej Korupczynski wrote:
> Hello,
> 
> Since upgrade to 3.0.3 from ports on FreeBSD 6.2-RELEASE. It's
> used to monitor approx. 300 services on 150 hosts. The problem is, that
> sometimes random services get stuck in warning/critical state. I've checked
> scheduling queue in CGI and it seems right, but as I see, it says only about
> hosts, not services.


Odd stuff.

> I probably should also mention, that nagios host
> sometimes has quite high load (up to 2.0 on uniprocessor machine), as a
> result of monitoring scripts. Next thing is that monitoring host has big
> constant clock skew that I can't get rid of (time runs faster, ca. 5s for
> every 2 minutes, this gets corrected by ntpdate every 2 minutes).
> 

And this is almost certainly the problem. Are you running Nagios in a
vmware system? If yes, what happens when you move it out to its own hardware?
Nagios relies on a reasonably accurate system clock. One that jumps backwards
and forwards will cause problems.

You could try running the snapshot from the branch 'unplanned-checks'. Download
from http://www.op5.org/git/?p=nagios.git;a=shortlog;h=refs/heads/unplanned-checks
(click the "snapshot" link in the upper right).

> - example host status:
> Host Status: UP (for 1d 9h 35m 23s)
> Status Information: PING OK - Packet loss = 0%, RTA = 15.33 ms
> Performance Data: rta=15.333000ms;3000.000000;5000.000000;0.000000
> pl=0%;80;100;0
> Current Attempt: 1/5  (HARD state)
> Last Check Time: 02-10-2008 10:32:44
> Check Type: ACTIVE
> Check Latency / Duration: 0.732 / 0.152 seconds
> Next Scheduled Active Check: 02-10-2008 10:37:54
> Last State Change: 01-10-2008 01:02:43
> Last Notification: N/A (notification 0)
> Is This Host Flapping? N/A
> In Scheduled Downtime? NO  
> Last Update: 02-10-2008 10:37:58  ( 0d 0h 0m 8s ago)
> 
> - service status on the host above:
> Current Status: CRITICAL (for 2d 2h 55m 9s)
> Status Information: PING CRITICAL - Packet loss = 0%, RTA = 313.92 ms
> Performance Data:
> Current Attempt: 3/5  (SOFT state)
> Last Check Time: 30-09-2008 07:44:36
> Check Type: ACTIVE
> Check Latency / Duration: 0.116 / 4.657 seconds
> Next Scheduled Check: 30-09-2008 07:45:36
> Last State Change: 30-09-2008 07:44:36
> Last Notification: N/A (notification 0)
> Is This Service Flapping? N/A
> In Scheduled Downtime? NO  
> Last Update: 02-10-2008 10:39:43  ( 0d 0h 0m 2s ago)
> 
> Notice the Last Check Time, the service status is two days old. Problem can
> be resolved by nagios restart, or by "Re-schedule the next check of this
> service" in the CGI. Parent host is up, and the service has no parent.
> Is there any configuration directive that may cause service check to be
> dropped by the scheduler?
> 

No, but there's a bug in Nagios 3 that can cause checks to be marked as
"do not reschedule" and also causes them to be scheduled one year into the
future. By the looks of it, this is not the problem you're running into,
although it could be worth examining if the issue is related.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list