Passive checks greatly delaying active checks

Fred f1216 at yahoo.com
Wed Sep 28 16:09:17 CEST 2005
Previous message: Passive checks greatly delaying active checks
Next message: Monitoring Windows Servers with Nagios
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ludwig,

Look back through the archives at a few of my posts, we are having the
same problems.  I found that removing the hostcheck solved the problem, but
that might have been just putting the number of active checks below some
threshold.

I have not been able to find the time to trace this down yet, if anyone
has any ideas they would be greatly appreciated.

I also found that if I mess with the interservice delay and the service
check scheduling values from "s" to "n", you get different results, but
you are absolutly correct in your observations around the number of passive
vs active checks.  I find it somewhere around 5000 or so service definitions
is where things stop getting scheduled.

-FredC

--- Ludwig Pummer <Ludwig.Pummer at Copart.Com> wrote:

> Hello folks,
> 
> I'm experimenting with a distributed monitoring + failover configuration
> between 2 nagios servers, each actively monitoring its own group of
> hosts unless the other nagios server fails.
> 
> Nagios server #1 is a dual Xeon 2.4GHz (hyperthreading off) w/ 1.5GB RAM
> running RHES 3. Nagios server #2 is a dual Xeon 3.2GHz (hyperthreading
> on) w/ 3.0GB RAM running RHES 3 in 64-bit mode.
> 
> Both are running Nagios 1.2. They are running idential Nagios
> configurautions with the exception of active/passive services. My nagios
> init script sends DISABLE_HOST_SVC_CHECKS,
> DISABLE_HOST_SVC_NOTIFICATIONS, and DISABLE_HOST_NOTIFICATIONS commands
> at nagios startup for those hosts which that particular nagios server is
> not supposed to actively monitor. I've got 472 hosts and 1487 services
> total. Server #1 has 686 active and 801 passive service checks. Server
> #2 has 805 active and 682 passive service checks. Both machines have an
> ocsp_command set up which will send_nsca to the other nagios server the
> results of any active checks.
> 
> The issue I'm having is that when I have nsca running to receive passive
> checks from the other host, active checks are delayed a lot (from under
> 30 seconds without nsca to 15-25 minutes with nsca running). My
> command_check_interval is set to -1. I have log_passive_service_checks
> set to 1 for testing, so I can see the nsca results coming in. I don't
> see why receiving passive checks is causing such large delays in my
> active checks.
> 
> Below are numbers from the top two tables on the Performance Info page.
> 
> I start off nagios with the nsca daemon not running. Everything works
> fine, except all the passive checks on both machines keep reporting
> "pending".
> 
> This is the performance info after an hour or so of steady operation:
> Server #1:
> Time Frame	Checks Completed
> <= 1 minute:	294 (42.9%)
> <= 5 minutes:	686 (100.0%)
> <= 15 minutes:	686 (100.0%)
> <= 1 hour:	686 (100.0%)
> Since program start:  	686 (100.0%)
> 	
> Metric	Min.	Max.	Average
> Check Execution Time:  	< 1 sec	8 sec	1.819 sec
> Check Latency:	< 1 sec	11 sec	1.010 sec
> Percent State Change:	0.00%	8.95%	0.01%
> 
> Server #2:
> Time Frame	Checks Completed
> <= 1 minute:	446 (55.4%)
> <= 5 minutes:	805 (100.0%)
> <= 15 minutes:	805 (100.0%)
> <= 1 hour:	805 (100.0%)
> Since program start:  	805 (100.0%)
> 	
> Metric	Min.	Max.	Average
> Check Execution Time:  	< 1 sec	16 sec	2.933 sec
> Check Latency:	< 1 sec	17 sec	1.102 sec
> Percent State Change:	0.00%
> 
> A minute after the above numbers were taken, I started up the nsca
> daemon on both machines (single-process daemon mode).
> 
> 10 minutes later, the numbers look like this:
> Server #1:
> Time Frame	Checks Completed
> <= 1 minute:	0 (0.0%)
> <= 5 minutes:	90 (13.1%)
> <= 15 minutes:	686 (100.0%)
> <= 1 hour:	686 (100.0%)
> Since program start:  	686 (100.0%)
> 	
> Metric	Min.	Max.	Average
> Check Execution Time:  	< 1 sec	8 sec	1.840 sec
> Check Latency:	< 1 sec	277 sec	125.098 sec
> Percent State Change:	0.00%	11.32%	0.02%
> 
> Server #2:
> Time Frame	Checks Completed
> <= 1 minute:	0 (0.0%)
> <= 5 minutes:	187 (23.3%)
> <= 15 minutes:	803 (100.0%)
> <= 1 hour:	803 (100.0%)
> Since program start:  	803 (100.0%)
> 
> About 18 hours later, they look like this:
> 
> Server #1:
> Time Frame	Checks Completed
> <= 1 minute:	0 (0.0%)
> <= 5 minutes:	126 (18.4%)
> <= 15 minutes:	522 (76.2%)
> <= 1 hour:	685 (100.0%)
> Since program start:  	685 (100.0%)
> 	
> Metric	Min.	Max.	Average
> Check Execution Time:  	< 1 sec	8 sec	1.858 sec
> Check Latency:	502 sec	955 sec	774.826 sec
> Percent State Change:	0.00%	0.00%	0.00%
> 
> Server #2:
> Time Frame	Checks Completed
> <= 1 minute:	0 (0.0%)
> <= 5 minutes:	227 (28.2%)
> <= 15 minutes:	601 (74.8%)
> <= 1 hour:	804 (100.0%)
> Since program start:  	804 (100.0%)
> 	
> Metric	Min.	Max.	Average
> Check Execution Time:  	< 1 sec	16 sec	2.960 sec
> Check Latency:	23 sec	1084 sec	517.776 sec
> Percent State Change:	0.00%	4.14%	0.01%
> 
> If I kill the nsca daemon, 10 minutes later the numbers look like this:
> 
> Server #1:
> Time Frame	Checks Completed
> <= 1 minute:	2 (0.3%)
> <= 5 minutes:	686 (100.0%)
> <= 15 minutes:	686 (100.0%)
> <= 1 hour:	686 (100.0%)
> Since program start:  	686 (100.0%)
> 	
> Metric	Min.	Max.	Average
> Check Execution Time:  	< 1 sec	8 sec	1.838 sec
> Check Latency:	< 1 sec	19 sec	2.646 sec
> Percent State Change:	0.00%	5.99%	0.01%
> 
> Server #2:
> Time Frame	Checks Completed
> <= 1 minute:	2 (0.2%)
> <= 5 minutes:	805 (100.0%)
> <= 15 minutes:	805 (100.0%)
> <= 1 hour:	805 (100.0%)
> Since program start:  	805 (100.0%)
> 	
> Metric	Min.	Max.	Average
> Check Execution Time:  	< 1 sec	16 sec	2.934 sec
> Check Latency:	< 1 sec	38 sec	2.376 sec
> Percent State Change:	0.00%	12.37%	0.04%
> 
> And about 20 minutes after killing nsca, they look like this:
> 
> Server #1:
> Time Frame	Checks Completed
> <= 1 minute:	101 (14.7%)
> <= 5 minutes:	686 (100.0%)
> <= 15 minutes:	686 (100.0%)
> <= 1 hour:	686 (100.0%)
> Since program start:  	686 (100.0%)
> 	
> Metric	Min.	Max.	Average
> Check Execution Time:  	< 1 sec	8 sec	1.821 sec
> Check Latency:	< 1 sec	18 sec	3.213 sec
> Percent State Change:	0.00%	5.46%	0.01%
> 
> Server #2:
> Time Frame	Checks Completed
> <= 1 minute:	2 (0.2%)
> <= 5 minutes:	805 (100.0%)
> <= 15 minutes:	805 (100.0%)
> <= 1 hour:	805 (100.0%)
> Since program start:  	805 (100.0%)
> 	
> Metric	Min.	Max.	Average
> Check Execution Time:  	< 1 sec	16 sec	2.937 sec
> Check Latency:	< 1 sec	27 sec	0.840 sec
> Percent State Change:	0.00%	29.67%	0.12%
> 
> Any ideas?
> 
> --Ludwig Pummer
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by:
> Power Architecture Resource Center: Free content, downloads, discussions,
> and more. http://solutions.newsforge.com/ibmarch.tmpl
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting
> any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
> 







-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
Previous message: Passive checks greatly delaying active checks
Next message: Monitoring Windows Servers with Nagios
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Users mailing list