Passive checks greatly delaying active checks

Ludwig Pummer Ludwig.Pummer at Copart.Com
Tue Sep 27 22:39:36 CEST 2005


Hello folks,

I'm experimenting with a distributed monitoring + failover configuration
between 2 nagios servers, each actively monitoring its own group of
hosts unless the other nagios server fails.

Nagios server #1 is a dual Xeon 2.4GHz (hyperthreading off) w/ 1.5GB RAM
running RHES 3. Nagios server #2 is a dual Xeon 3.2GHz (hyperthreading
on) w/ 3.0GB RAM running RHES 3 in 64-bit mode.

Both are running Nagios 1.2. They are running idential Nagios
configurautions with the exception of active/passive services. My nagios
init script sends DISABLE_HOST_SVC_CHECKS,
DISABLE_HOST_SVC_NOTIFICATIONS, and DISABLE_HOST_NOTIFICATIONS commands
at nagios startup for those hosts which that particular nagios server is
not supposed to actively monitor. I've got 472 hosts and 1487 services
total. Server #1 has 686 active and 801 passive service checks. Server
#2 has 805 active and 682 passive service checks. Both machines have an
ocsp_command set up which will send_nsca to the other nagios server the
results of any active checks.

The issue I'm having is that when I have nsca running to receive passive
checks from the other host, active checks are delayed a lot (from under
30 seconds without nsca to 15-25 minutes with nsca running). My
command_check_interval is set to -1. I have log_passive_service_checks
set to 1 for testing, so I can see the nsca results coming in. I don't
see why receiving passive checks is causing such large delays in my
active checks.

Below are numbers from the top two tables on the Performance Info page.

I start off nagios with the nsca daemon not running. Everything works
fine, except all the passive checks on both machines keep reporting
"pending".

This is the performance info after an hour or so of steady operation:
Server #1:
Time Frame	Checks Completed
<= 1 minute:	294 (42.9%)
<= 5 minutes:	686 (100.0%)
<= 15 minutes:	686 (100.0%)
<= 1 hour:	686 (100.0%)
Since program start:  	686 (100.0%)
	
Metric	Min.	Max.	Average
Check Execution Time:  	< 1 sec	8 sec	1.819 sec
Check Latency:	< 1 sec	11 sec	1.010 sec
Percent State Change:	0.00%	8.95%	0.01%

Server #2:
Time Frame	Checks Completed
<= 1 minute:	446 (55.4%)
<= 5 minutes:	805 (100.0%)
<= 15 minutes:	805 (100.0%)
<= 1 hour:	805 (100.0%)
Since program start:  	805 (100.0%)
	
Metric	Min.	Max.	Average
Check Execution Time:  	< 1 sec	16 sec	2.933 sec
Check Latency:	< 1 sec	17 sec	1.102 sec
Percent State Change:	0.00%

A minute after the above numbers were taken, I started up the nsca
daemon on both machines (single-process daemon mode).

10 minutes later, the numbers look like this:
Server #1:
Time Frame	Checks Completed
<= 1 minute:	0 (0.0%)
<= 5 minutes:	90 (13.1%)
<= 15 minutes:	686 (100.0%)
<= 1 hour:	686 (100.0%)
Since program start:  	686 (100.0%)
	
Metric	Min.	Max.	Average
Check Execution Time:  	< 1 sec	8 sec	1.840 sec
Check Latency:	< 1 sec	277 sec	125.098 sec
Percent State Change:	0.00%	11.32%	0.02%

Server #2:
Time Frame	Checks Completed
<= 1 minute:	0 (0.0%)
<= 5 minutes:	187 (23.3%)
<= 15 minutes:	803 (100.0%)
<= 1 hour:	803 (100.0%)
Since program start:  	803 (100.0%)

About 18 hours later, they look like this:

Server #1:
Time Frame	Checks Completed
<= 1 minute:	0 (0.0%)
<= 5 minutes:	126 (18.4%)
<= 15 minutes:	522 (76.2%)
<= 1 hour:	685 (100.0%)
Since program start:  	685 (100.0%)
	
Metric	Min.	Max.	Average
Check Execution Time:  	< 1 sec	8 sec	1.858 sec
Check Latency:	502 sec	955 sec	774.826 sec
Percent State Change:	0.00%	0.00%	0.00%

Server #2:
Time Frame	Checks Completed
<= 1 minute:	0 (0.0%)
<= 5 minutes:	227 (28.2%)
<= 15 minutes:	601 (74.8%)
<= 1 hour:	804 (100.0%)
Since program start:  	804 (100.0%)
	
Metric	Min.	Max.	Average
Check Execution Time:  	< 1 sec	16 sec	2.960 sec
Check Latency:	23 sec	1084 sec	517.776 sec
Percent State Change:	0.00%	4.14%	0.01%

If I kill the nsca daemon, 10 minutes later the numbers look like this:

Server #1:
Time Frame	Checks Completed
<= 1 minute:	2 (0.3%)
<= 5 minutes:	686 (100.0%)
<= 15 minutes:	686 (100.0%)
<= 1 hour:	686 (100.0%)
Since program start:  	686 (100.0%)
	
Metric	Min.	Max.	Average
Check Execution Time:  	< 1 sec	8 sec	1.838 sec
Check Latency:	< 1 sec	19 sec	2.646 sec
Percent State Change:	0.00%	5.99%	0.01%

Server #2:
Time Frame	Checks Completed
<= 1 minute:	2 (0.2%)
<= 5 minutes:	805 (100.0%)
<= 15 minutes:	805 (100.0%)
<= 1 hour:	805 (100.0%)
Since program start:  	805 (100.0%)
	
Metric	Min.	Max.	Average
Check Execution Time:  	< 1 sec	16 sec	2.934 sec
Check Latency:	< 1 sec	38 sec	2.376 sec
Percent State Change:	0.00%	12.37%	0.04%

And about 20 minutes after killing nsca, they look like this:

Server #1:
Time Frame	Checks Completed
<= 1 minute:	101 (14.7%)
<= 5 minutes:	686 (100.0%)
<= 15 minutes:	686 (100.0%)
<= 1 hour:	686 (100.0%)
Since program start:  	686 (100.0%)
	
Metric	Min.	Max.	Average
Check Execution Time:  	< 1 sec	8 sec	1.821 sec
Check Latency:	< 1 sec	18 sec	3.213 sec
Percent State Change:	0.00%	5.46%	0.01%

Server #2:
Time Frame	Checks Completed
<= 1 minute:	2 (0.2%)
<= 5 minutes:	805 (100.0%)
<= 15 minutes:	805 (100.0%)
<= 1 hour:	805 (100.0%)
Since program start:  	805 (100.0%)
	
Metric	Min.	Max.	Average
Check Execution Time:  	< 1 sec	16 sec	2.937 sec
Check Latency:	< 1 sec	27 sec	0.840 sec
Percent State Change:	0.00%	29.67%	0.12%

Any ideas?

--Ludwig Pummer


-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list