Service checks pending forever in distributed monitoring configuration

Fred f1216 at yahoo.com
Thu Sep 1 19:53:03 CEST 2005


I have a 1000+ node system plus a number of switches etc that are all
monitored by Nagios.  I'm running 2.0b3.   

Our configuration is generated automatically based on the clusters
configuration and in smaller configurations has no issues.  

Recently, nagios started delaying execution of active service checks.  I
have 5 nagios monitors reporting via nsca to a 6th nagios master (which also
monitors 1/6th of the cluster).  I removed all the retention caches for all
the monitor nodes and restarted.  Nagios then reports that the next service
check is scheduled for hours later (when it should be fairly close).  Attached
is output from nagiostats.  There are quite a few services, most all are
passive checks with each monitor node running some active checks that will
push data to the FIFO where it is then picked up and reported on a
per-node/service basis.   The pending checks do not execute even when the
time passes.  The monitor nodes are working just fine, the master node which
is configured to obsessing is disabled (on the master) and freshness checking
is enabled.  There is nothing in nagios.log other then stale check messages.
Following is an example service description from a service that
is not getting scheduled:

define service{
        use                             nagios
        host_name                       nh
        name                            slurmMonitor
        service_description             Slurm Monitor
 
        active_checks_enabled           1
        check_command                   check_slurm
        register                        1
 
        }

and the template:

# Generic template for services
                                                                               
define service{
        use                     generic-service         ; default service
        name                    nagios
                                                                               
        normal_check_interval   5
        retry_check_interval    2
        check_period            24x7
        is_volatile             0
        max_check_attempts      3
                                                                               
        notification_interval   240
        notification_period     24x7
        notification_options    w,u,c,r
                                                                               
        contact_groups          admins
        register                0
        }

and finally, the generic-service template:

# Generic service definition template
define service{
        name                            generic-service ; The 'name' of this
service template, referenced in other service definitions
        active_checks_enabled           1       ; Active service checks are
enabled
        passive_checks_enabled          1       ; Passive service checks are
enabled/accepted
        parallelize_check               1       ; Active service checks should
be parallelized (disabling this can lead to major performance problems)
        obsess_over_service             1       ; We should obsess over this
service (if necessary)
        check_freshness                 0       ; Default is to NOT check
service 'freshness'
        notifications_enabled           1       ; Service notifications are
enabled
        event_handler_enabled           1       ; Service event handler is
enabled
        flap_detection_enabled          1       ; Flap detection is enabled
        process_perf_data               1       ; Process performance data
        retain_status_information       1       ; Retain status information
across program restarts
        retain_nonstatus_information    1       ; Retain non-status information
across program restarts
                                                                               
                                       
        register                        0       ; DONT REGISTER THIS DEFINITION
- ITS NOT A REAL SERVICE, JUST A TEMPLATE!
        }

Clocks are correct and synchronized on the system.

Nagios Stats 2.0b3
Copyright (c) 2003-2005 Ethan Galstad (www.nagios.org)
Last Modified: 04-03-2005
License: GPL
                                                                               
CURRENT STATUS DATA
----------------------------------------------------
Status File:                          /opt/hptc/nagios/var/status.log
Status File Age:                      0d 0h 0m 1s
Status File Version:                  2.0b3
                                                                               
Program Running Time:                 0d 48h 0m 56s
                                                                               
Total Services:                       10388
Services Checked:                     8472
Services Scheduled:                   246
Active Service Checks:                4774
Passive Service Checks:               5614
Total Service State Change:           0.000 / 63.550 / 2.210 %
Active Service Latency:               0.000 / 2714.925 / 1220.973 %
Active Service Execution Time:        0.000 / 180.065 / 0.119 sec
Active Service State Change:          0.000 / 17.830 / 1.222 %
Active Services Last 1/5/15/60 min:   0 / 0 / 0 / 4
Passive Service State Change:         0.000 / 63.550 / 3.050 %
Passive Services Last 1/5/15/60 min:  0 / 440 / 2566 / 4724
Services Ok/Warn/Unk/Crit:            7420 / 2866 / 0 / 102
Services Flapping:                    0
Services In Downtime:                 0
 
Total Hosts:                          1094
Hosts Checked:                        1030
Hosts Scheduled:                      0
Active Host Checks:                   1094
Passive Host Checks:                  0
Total Host State Change:              0.000 / 0.000 / 0.000 %
Active Host Latency:                  0.000 / 0.000 / 0.000 %
Active Host Execution Time:           0.000 / 0.000 / 0.000 sec
Active Host State Change:             0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:      0 / 0 / 0 / 0
Passive Host State Change:            0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:     0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                1094 / 0 / 0
Hosts Flapping:                       0
Hosts In Downtime:                    0
 
Anyone have any suggestions as to what to look for next?  
If I force the scheduling of the service, it eventually gets scheduled
and runs, it does update the pending time in the web display right away.

Thanks in advance for any insight.

-FredC






-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list