Services get stuck in warning/critical state

Bartlomiej Korupczynski bartek at net-serwis.pl
Thu Oct 2 11:26:40 CEST 2008


Hello,

Since upgrade to 3.0.3 from ports on FreeBSD 6.2-RELEASE. It's
used to monitor approx. 300 services on 150 hosts. The problem is, that
sometimes random services get stuck in warning/critical state. I've checked
scheduling queue in CGI and it seems right, but as I see, it says only about
hosts, not services. I probably should also mention, that nagios host
sometimes has quite high load (up to 2.0 on uniprocessor machine), as a
result of monitoring scripts. Next thing is that monitoring host has big
constant clock skew that I can't get rid of (time runs faster, ca. 5s for
every 2 minutes, this gets corrected by ntpdate every 2 minutes).

- example host status:
Host Status: UP (for 1d 9h 35m 23s)
Status Information: PING OK - Packet loss = 0%, RTA = 15.33 ms
Performance Data: rta=15.333000ms;3000.000000;5000.000000;0.000000
pl=0%;80;100;0
Current Attempt: 1/5  (HARD state)
Last Check Time: 02-10-2008 10:32:44
Check Type: ACTIVE
Check Latency / Duration: 0.732 / 0.152 seconds
Next Scheduled Active Check: 02-10-2008 10:37:54
Last State Change: 01-10-2008 01:02:43
Last Notification: N/A (notification 0)
Is This Host Flapping? N/A
In Scheduled Downtime? NO  
Last Update: 02-10-2008 10:37:58  ( 0d 0h 0m 8s ago)

- service status on the host above:
Current Status: CRITICAL (for 2d 2h 55m 9s)
Status Information: PING CRITICAL - Packet loss = 0%, RTA = 313.92 ms
Performance Data:
Current Attempt: 3/5  (SOFT state)
Last Check Time: 30-09-2008 07:44:36
Check Type: ACTIVE
Check Latency / Duration: 0.116 / 4.657 seconds
Next Scheduled Check: 30-09-2008 07:45:36
Last State Change: 30-09-2008 07:44:36
Last Notification: N/A (notification 0)
Is This Service Flapping? N/A
In Scheduled Downtime? NO  
Last Update: 02-10-2008 10:39:43  ( 0d 0h 0m 2s ago)

Notice the Last Check Time, the service status is two days old. Problem can
be resolved by nagios restart, or by "Re-schedule the next check of this
service" in the CGI. Parent host is up, and the service has no parent.
Is there any configuration directive that may cause service check to be
dropped by the scheduler?

- configuration related to host:
define host {
        register 0
        name generic-host
        check_command check-host-alive
        notification_period 24x7
        notification_options d,u,r
        max_check_attempts 5
        notification_interval 240
        notifications_enabled 1
        event_handler_enabled 1
        flap_detection_enabled 1
        process_perf_data 1
        retain_status_information 1
	retain_nonstatus_information 1
}
define host {
        register 0
        use generic-host
        name generic-device-ext
        contact_groups noc,tech
}
define host {
        use generic-device-ext
        host_name ...
        alias ...
        address ...
        parents ...
}

- service:
define service {
        name generic-service
        active_checks_enabled 1
        passive_checks_enabled 0
        parallelize_check 1
        obsess_over_service 1
        check_freshness 0
        notifications_enabled 1
        event_handler_enabled 1
        flap_detection_enabled 1
        process_perf_data 1
        retain_status_information 1
        retain_nonstatus_information 1
        is_volatile 0
        check_period 24x7
        max_check_attempts 5
        normal_check_interval 5
        retry_check_interval 1
        notification_interval 120
        notification_period 24x7
        notification_options u,w,c,r
        register 0
}
define service {
        use generic-service
        name nrpe
        max_check_attempts 5
        normal_check_interval 5
        retry_check_interval 1
        register 0
}
define service {
        use nrpe
        name nrpe-ping
        service_description PING-nrpe
        check_command check_nrpe_ping!...
	contact_groups noc-prio
        host_name ...
}


Thanks in advance for any clues.

Best regards,
Bartłomiej Korupczyński

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list