Confused on Nagios check queue processing of down host

Frater, Greg J gjfrater at bechtel.com
Mon Nov 25 17:39:39 CET 2002


Hello All,

I'm running Nagios on a server (Compaq dual proc, 1.4 Gig, 512 RAM, RAID 5)
system with the expectation of checking 650 hosts using 1500-2000 service
checks.  Running Nagios 1.0b6 on RH 7.3 (kernel 2.4-18).  When putting the
initial set of checks on the server I noticed very large check latency.
After digging into the problem I found that it is caused by one (or more)
down host getting stuck in the scheduling queue.  When a service check for a
down host gets to the top of the scheduling queue it gets stuck causing a
backlog in the queue.  It sits at the top of the queue for about 5 minutes
give or take 30 sec.  With a 5 minute late start Nagios may or may not ever
catch up depending on the number of checks being done.  Even though it is
running parallelized checks it stops all of them (according to the cgi)
until that 5 minute time is reached then the down host service check clears
the queue and it continues processing the other checks.  From the mailing
list archive it looks like others are having similar problems showing up as
high check latency.  The way I read the documentation it appears that this
should be prevented by the  service_check_timeout and host_check_timeout.
Surely this hang up is not by design.  Would this be considered a bug or
could I have things misconfigured?  Below are my configs, let me know if I
left something out that could help figure this out.  I appreciate any help
or suggestions in fixing this problem.  

At the time the down host hits the queue my vitals looked like the
following:

Time Frame Checks Completed 
<= 1 minute: 41 (15.5%) 
<= 5 minutes: 252 (95.5%) 
<= 15 minutes: 264 (100.0%) 
<= 1 hour: 264 (100.0%) 
Since program start:   264 (100.0%) 

Metric Min. Max. Average 
Check Execution Time:   2 sec 6 sec 2.648 sec 
Check Latency: < 1 sec 1 sec 0.004 sec 

Process Status:   OK   
Check Command Output:  Nagios ok: located 5 processes, status log updated 9
seconds ago   


At about the 5 minute mark it looks like this:

Time Frame Checks Completed 
<= 1 minute: 26 (9.8%) 
<= 5 minutes: 26 (9.8%) 
<= 15 minutes: 264 (100.0%) 
<= 1 hour: 264 (100.0%) 
Since program start:   264 (100.0%) 

Metric Min. Max. Average 
Check Execution Time:   2 sec 10 sec 2.678 sec 
Check Latency: 4 sec 301 sec 152.087 sec 
Percent State Change: 0.00% 6.12% 0.02%

Process Status:   WARNING   
Check Command Output:  Nagios problem: located 4 processes, status log
updated 309 seconds ago 

This entire time there are no changes in the scheduling queue


nagios.cfg:
#global_service_event_handler=somecommand
#	n	= None - don't use any delay between checks
#	d	= Use a "dumb" delay of 1 second between checks
#	s	= Use "smart" inter-check delay calculation
#       x.xx    = Use an inter-check delay of x.xx seconds
inter_check_delay_method=s
#       s       = Use "smart" interleave factor calculation
#       x       = Use an interleave factor of x, where x is a
#                 number greater than or equal to 1.
service_interleave_factor=s
max_concurrent_checks=0
service_reaper_frequency=10
sleep_time=1
service_check_timeout=20
host_check_timeout=30
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
retain_state_information=0
state_retention_file=/usr/local/nagios/var/status.sav
retention_update_interval=60
use_retained_program_state=1
interval_length=60
use_agressive_host_checking=0


checkcommands.cfg:
# 'check_ping' command definition
define command{
	command_name	check_ping
	command_line	$USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c
$ARG2$ -p $ARG3$
	}

# 'check_host_alive' command definition
define command{
	command_name	check-nt-alive
	command_line	$USER1$/check_tcp -H $HOSTADDRESS$ -p 135
	}

# 'check_cisco_alive' command definition
define command{
	command_name	check-cisco-alive
	command_line	$USER1$/check_tcp -H $HOSTADDRESS$ -p 23
	}


services.cfg:
define service{
   name					ping-templ
   service_description			PING
   is_volatile				0
   check_command			check_ping!100.0,60%!500.0,100%!3
   max_check_attempts			3
   normal_check_interval 		5
   retry_check_interval 		1
   active_checks_enabled  		1
   passive_checks_enabled 		0
   check_period  			24x7
   obsess_over_service  		1
   check_freshness 			0
   flap_detection_enabled		1
   process_perf_data			1
   retain_status_information		1
   retain_nonstatus_information		1
   notification_interval  		120
   notification_period  		24x7
   notification_options 		w,u,c,r
   notifications_enabled 		1
   stalking_options 			w

   register				0
   }

# Ping Servers definition
define service{
	use				ping-templ		; Name of
service template to use

	host_name
SRV0001,SRV0002,SRV0003,SRV0004,SRV0005,SRV0006,SRV0007,SRV0009,SRV0010,SRV0
011,SRV0012,SRV0013,SRV0014,SRV0015,SRV0016,SRV0017,SRV0018,SRV0019,SRV0020,
SRV0021,SRV0022,SRV0023,SRV0024,SRV0025,SRV0026,SRV0027,SRV0028,SRV0029,SRV0
030,SRV0031,SRV0032,SRV0033,SRV0034,SRV0035,SRV0036,SRV0037,SRV0038,SRV0039,
SRV0040,SRV0041,SRV0042,SRV0043,SRV0044,SRV0045,SRV0046,SRV0047,SRV0048,SRV0
049,SRV0050,SRV0051,SRV0052,SRV0053,SRV0054,SRV0055,SRV0056,SRV0057,SRV0058,
SRV0059,SRV0060,SRV0061,SRV0062,SRV0063,SRV0064,SRV0065,SRV0066,SRV0068,SRV0
069,SRV0070,SRV0071,SRV0072,SRV0073,SRV0074,SRV0075,SRV0076,SRV0077,SRV0078,
SRV0079,SRV0080,SRV0081,SRV0082,SRV0083,SRV0084,SRV0085,SRV0086,SRV0087,SRV0
088,SRV0089,SRV0090,SRV0091,SRV0092,SRV0093,SRV0094,SRV0095,SRV0096,SRV0098,
SRV0099,SRV0100,SRV0102,SRV0103,SRV0104,SRV0105,SRV0106,WTPS16193
	contact_groups			nt-admins
	}




Thanks, 

Greg Frater
WTP IT dept.
509 371 3537
gjfrater at bechtel.com


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf




More information about the Users mailing list