host check strangeness - odd behavior in Nagios scheduling queue

Frater, Greg J GJFRATER at bechtel.com
Tue Jul 7 16:25:59 CEST 2009


Greetings All, 

I'm seeing a problem with our host check scheduling.  There are two
major issues, I can't tell if they are symptoms of the same problem or
two separate issues.  I've provided the configs and information that I
know to be applicable, if there's other pertinent information please let
me know, I'm more than happy to provide it.  

First Here's my Nagios config:
Single Nagios box (no distributed setup)
64-bit RHEL 5.3
Nagios 3.1.2 (I upgraded from 3.0.6 to see if that would fix the issues)


Problem 1. Some host checks are getting *stuck* in scheduling queue.
When I look at the scheduling queue these hosts are always listed with
the 'last check' time the same as it's 'next check' time.  See attached
screen shot (problem 1).  They typically stay at the top of the queue
for an hour or two.

Host configuration for one of them:


define host {
        host_name		hostxxx
        alias			Oracle
        use
srvhost-os-2000,srvhost-physical,srvhost-oracle,srvhost-non-production,s
rvhost-all
        notification_period		aperture
        register			1
        }

Applicable Templates:

define host {
       name                                     generic-host
       check_period                             24x7
       event_handler_enabled                    1
       flap_detection_enabled                   1
       process_perf_data                        1
       retain_status_information                1
       retain_nonstatus_information             1
       notifications_enabled                    1
       register                                 0
}


define host {
       name                                     generic-pnp
       action_url
/pnp/index.php?host=$HOSTNAME$'
onmouseover="get_g('$HOSTNAME$','_HOST_')" onmouseout="clear_g()"
       register                                 0
}


define host {
       name                                     srvhost-all
       alias                                    All Servers
       check_command                            check-nt-alive
       use                                      generic-pnp,generic-host
       max_check_attempts                       3
       check_interval                           60
       retry_interval                           1
       active_checks_enabled                    1
       passive_checks_enabled                   1
       flap_detection_enabled                   1
       process_perf_data                        1
       retain_status_information                1
       retain_nonstatus_information             1
       contact_groups                           +servers
       notification_interval                    240
       notification_period                      24x7
       notification_options                     d,u,r
       notifications_enabled                    1
       register                                 0
}


define host {
       name                                     srvhost-non-production
       alias                                    Non production servers
       hostgroups                               +SRV_Cls-non-production
       check_interval                           120
       retry_interval                           20
       passive_checks_enabled                   1
       contact_groups                           +servers
       notification_interval                    480
       notification_period                      workhours
       notification_options                     d,u,r
       notifications_enabled                    1
       register                                 0
}


define host {
       name                                     srvhost-oracle
       alias                                    Oracle servers
       hostgroups                               +SRV_app-oracle
       contact_groups                           +oracle
       register                                 0
}


define host {
       name                                     srvhost-physical
       alias                                    Servers that are running
on physical hardware
       hostgroups                               +SRV_platform-physical
       register                                 0
}


define host {
       name                                     srvhost-os-2000
       alias                                    Servers running Windows
2000 Server
       hostgroups                               +SRV_os-win2000
       check_command                            check-nt-alive
       register                                 0
}



Problem 2.  Many of our hosts are not running host checks, they are in
the scheduling queue but don't execute.  Looking at the scheduling queue
I can see many of the hosts that have host 'last check' times from
several weeks ago.  They show up in the queue but never run their host
checks (or don't seem to).  These same hosts run service checks on time
without issue.  Screen shot attached (problem 2).

Host config for one of the hosts not running host checks:
define host {
        host_name                       hostxxxx
        alias                           media server
        use
srvhost-production,srvhost-physical,srvhost-os-2003,srvhost-all
        register                        1
        }


define host {
       name                                     generic-host
       check_period                             24x7
       event_handler_enabled                    1
       flap_detection_enabled                   1
       process_perf_data                        1
       retain_status_information                1
       retain_nonstatus_information             1
       notifications_enabled                    1
       register                                 0
}


define host {
       name                                     generic-pnp
       action_url
/pnp/index.php?host=$HOSTNAME$'
onmouseover="get_g('$HOSTNAME$','_HOST_')" onmouseout="clear_g()"
       register                                 0
}


define host {
       name                                     srvhost-all
       alias                                    All Servers
       check_command                            check-nt-alive
       use                                      generic-pnp,generic-host
       max_check_attempts                       3
       check_interval                           60
       retry_interval                           1
       active_checks_enabled                    1
       passive_checks_enabled                   1
       flap_detection_enabled                   1
       process_perf_data                        1
       retain_status_information                1
       retain_nonstatus_information             1
       contact_groups                           +servers
       notification_interval                    240
       notification_period                      24x7
       notification_options                     d,u,r
       notifications_enabled                    1
       register                                 0

}

define host {
       name                                     srvhost-os-2003
       alias                                    Servers running Windows
2003
       hostgroups                               +SRV_os-win2003
       check_command                            check-nt-alive
       register                                 0

}

define host {
       name                                     srvhost-physical
       alias                                    Servers that are running
on physical hardware
       hostgroups                               +SRV_platform-physical
       register                                 0

}

define host {
       name                                     srvhost-production
       alias                                    All servers in
production mode
       hostgroups                               +SRV_Cls-production
       contact_groups
+helpdesk,servers,servers-off-hours,thesolver
       register                                 0

}

define command {
       command_name                             check-nt-alive
       command_line                             $USER1$/check_tcp -H
$HOSTADDRESS$ -p 135 -t 30
}


Any ideas or help is tracking this down is appreciated.  I'm pretty sure
it's a bug in the code, but I suppose it's possible my configuration is
off somehow.  :-) 

Thanks Again, 

-greg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090707/db530722/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Enter the BlackBerry Developer Challenge  
This is your chance to win up to $100,000 in prizes! For a limited time, 
vendors submitting new applications to BlackBerry App World(TM) will have 
the opportunity to enter the BlackBerry Developer Challenge. See full prize 
details at: http://p.sf.net/sfu/blackberry
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list