Slow scheduled service checks

Tedman Eng teng at dataway.com
Mon Sep 20 23:30:56 CEST 2004


Check latency is indeed very high on your system.  It is the time between
when a check is supposed to run and when it actually gets run.  By
comparison, it should be between 1-30 seconds, depending on network
conditions and nagios load.

If you have a very large number of down hosts, this can also affect your
latency, since Nagios "pauses" to check a host and thus skews the scheduling
queue when this happens.  It can usually catch up though if the other checks
have enough headroom in the scheduling queue.

Look at your scheduling queue (best done right after a restart).  The checks
should be spaced out evenly.  If your normal check interval for most
services is 5 minutes, look to see that all of your services are scheduled
to complete before that 5 minutes is up.  

Try manually setting your inter-check-delay.  
Your value should be just below .5 (every half second per check) if you have
600 services actively checked.

-----Original Message-----
From: Jeff Engstrom [mailto:jeff.engstrom at fortix.net]
Sent: Monday, September 20, 2004 2:01 PM
To: Nagios-Users
Cc: teng at dataway.com
Subject: RE: [Nagios-users] Slow scheduled service checks


Here is the servers performance metrics...

Time Frame		Checks Completed 
<= 1 minute:		35 (5.3%)
<= 5 minutes:		249 (37.5%)
<= 15 minutes:		664 (100.0%)
<= 1 hour:		664 (100.0%)
Since program start:	664 (100.0%)

Metric			Min.		Max.		Average
Check Execution Time:	< 1 sec		5 sec		0.396 sec 
Check Latency:		359 sec		476 sec		415.349 sec 
Percent State Change:	0.00%		17.04%		0.03%

I don't have any excessively long check intervals as you might notice
from the data above.  The check latency seems high to me but I don't
have a complete understanding of what the value represents.

Thanks again!
Jeff


On Mon, 2004-09-20 at 13:24, Tedman Eng wrote:
> Please let us know your performance metrics
> 
> Check execution times and check lantency (table in the top right).
> Would also be helpful to see active check completion rate (table in the
top
> left)
> 
> These should help pinpoint where the slowdown is.
> 
> 
> Also to optimize, if you have some checks that are long-intervalled (run
> only once every day, etc), you should consider hand calculating the
> inter-check-delay rather than using the 's' method.  Use the formula from
> the documentation, but toss out any long-interval checks, since they'll
> adversely skew the calculations.
> 
> 
> -----Original Message-----
> From: Jeff Engstrom [mailto:jeff.engstrom at fortix.net]
> Sent: Monday, September 20, 2004 10:41 AM
> To: nagios-users at lists.sourceforge.net
> Subject: [Nagios-users] Slow scheduled service checks
> 
> 
> Hello all,
> 
> I have a server monitoring some 1500 points and it seems for the most
> part to run quite well. However, for one reason or another the "Last
> Check" times are off when a service is down. That is not the only
> problem actually... it appears that it can take some 15mins after the
> service is restored for the update to reach the interface.
> 
> The main cfg is detailed below...
> 
> check_external_commands=1
> command_check_interval=-1
> log_rotation_method=d
> use_syslog=1
> log_notifications=1
> log_service_retries=1
> log_host_retries=1
> log_event_handlers=1
> log_initial_states=1
> log_external_commands=1
> log_passive_service_checks=1
> inter_check_delay_method=s
> service_interleave_factor=s
> max_concurrent_checks=18
> service_reaper_frequency=3
> sleep_time=1
> service_check_timeout=60
> host_check_timeout=60
> event_handler_timeout=30
> notification_timeout=30
> ocsp_timeout=5
> perfdata_timeout=5
> retain_state_information=1
> retention_update_interval=60
> use_retained_program_state=0
> interval_length=60
> use_agressive_host_checking=0
> execute_service_checks=1
> accept_passive_service_checks=1
> enable_notifications=1
> enable_event_handlers=1
> process_performance_data=0
> obsess_over_services=1
> ocsp_command=submit_check_result
> check_for_orphaned_services=1
> check_service_freshness=1
> freshness_check_interval=60
> aggregate_status_updates=1
> status_update_interval=15
> enable_flap_detection=1
> low_service_flap_threshold=5.0
> high_service_flap_threshold=20.0
> low_host_flap_threshold=5.0
> high_host_flap_threshold=20.0
> 
> Thanks for any help on this!
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
> Project Admins to receive an Apple iPod Mini FREE for your judgement on
> who ports your project to Linux PPC the best. Sponsored by IBM.
> Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when
reporting
> any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list