High check latency in a machine with low load

Javier Vela Diago jvela at s2grupo.es
Tue Oct 11 16:16:55 CEST 2011


I have a lot of custom checks, written mostly in perl, bash and some in 
python. And some take a lo of time.

Nevermind, I think I found the solution, or at least one part. I 
configured to 1 the enable_large_instalallation_tweaks. This options, 6 
months ago, almost crashed my system, so i discarded it. Now, with bigger 
problems, is the last thing that I wanted to test, but finally this 
afternoon I tested  it.

When I restarted Nagios, the load has started to grow until 6-8,  and the 
latency problems dissapeared. I was sceptical about the utility of this 
options but when the load changes form 2,5 to 6, it means that the machine 
is doing a lot of work that before wasn't doing.

Now the problem is that NDOUtils is causing  some latency because of 
MYSQL, but well, at least I know what to optimize. Some tips will be 
apreciated :)

Thank you and sorry for your time.


De:     Daniel Wittenberg <daniel.wittenberg.r0ko at statefarm.com>
Para:   Nagios Users List <nagios-users at lists.sourceforge.net>
Fecha:  11/10/2011 16:02
Asunto: Re: [Nagios-users] High check latency in a machine with low load



I think you have the enable_high_latency option enabled J  j/k
 
Do you have any particular checks that are taking a long time?  i.e. can 
you watch top and see checks taking a while?
 
Dan
 
 
From: Javier Vela Diago [mailto:jvela at s2grupo.es] 
Sent: Tuesday, October 11, 2011 6:23 AM
To: nagios-users at lists.sourceforge.net
Subject: [Nagios-users] High check latency in a machine with low load
 
Hi, 

I have a Nagios 3.2.3 deployment with 1000+ Hosts and 3000+ services. This 
Nagios runs together with NDO and PNP (in bulk mode) in a server with 4GB 
of Ram and 4 cpus. 

One day I realized that the check delay in the performance CGI was very 
high (300-400 seconds). It was very strange so I took the tunning guide 
form nagios (http://nagios.sourceforge.net/docs/3_0/tuning.html) and 
applied all the points I could. In particular I adjusted the 
max_concurrent_checks to zero (no limit): 

max_concurrent_checks=0 

The reaper event: 

service_reaper_frequency=5 
max_check_result_reaper_time=15 

and checked that the host checks where not forced. In addition I 
configured 15 seconds of host check cache. 

cached_host_check_horizon=15 

But the problem remains. And the load of the server is not very high. Load 
of 2,5, 2 GB of free memory and an average utilization of disc of 7%. I 
disabled NDO and PNP but it was useless. After the first round of checks, 
the delay returns, while the load of the server doesn't grow. 

I have searched in google but all the problems area because of the load in 
the server, but here this is not the main problem. So my question is ¿what 
can I do now?¿There is some variable that shows me where to look? I'm a 
bit lost right now and I don't know how to find the problem. 

¿Or maybe the only way is to configure a master-slave nagios in order to 
maximize the server utilization? 

In addition, I have pretty big timeouts (60 seconds) because of the high 
latency on the network. All your help is appreciated. Thank you in 
advance. 

nagiostats 
Nagios Stats 3.2.3 
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org) 
Last Modified: 10-03-2010 
License: GPL 

CURRENT STATUS DATA 
------------------------------------------------------ 
Status File: /usr/local/argos/aplicaciones/nagios/var/status.dat 
Status File Age:                        0d 0h 0m 11s 
Status File Version:                    3.2.3 

Program Running Time:                   0d 20h 56m 7s 
Nagios PID:                             21834 
Used/High/Total Command Buffers:        0 / 0 / 4096 

Total Services:                         4032 
Services Checked:                       4032 
Services Scheduled:                     4030 
Services Actively Checked:              4032 
Services Passively Checked:             0 
Total Service State Change:             0.000 / 37.300 / 0.163 % 
Active Service Latency:                 32.876 / 442.138 / 415.816 sec 
Active Service Execution Time:          0.051 / 60.097 / 1.545 sec 
Active Service State Change:            0.000 / 37.300 / 0.163 % 
Active Services Last 1/5/15/60 min:     237 / 1530 / 4020 / 4020 
Passive Service Latency:                0.000 / 0.000 / 0.000 sec 
Passive Service State Change:           0.000 / 0.000 / 0.000 % 
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0 
Services Ok/Warn/Unk/Crit:              3766 / 38 / 44 / 184 
Services Flapping:                      0 
Services In Downtime:                   0 

Total Hosts:                            931 
Hosts Checked:                          931 
Hosts Scheduled:                        931 
Hosts Actively Checked:                 931 
Host Passively Checked:                 0 
Total Host State Change:                0.000 / 12.370 / 0.077 % 
Active Host Latency:                    0.000 / 441.308 / 416.063 sec 
Active Host Execution Time:             0.062 / 10.113 / 0.395 sec 
Active Host State Change:               0.000 / 12.370 / 0.077 % 
Active Hosts Last 1/5/15/60 min:        74 / 423 / 931 / 931 
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec 
Passive Host State Change:              0.000 / 0.000 / 0.000 % 
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0 
Hosts Up/Down/Unreach:                  897 / 24 / 10 
Hosts Flapping:                         0 
Hosts In Downtime:                      1 

Active Host Checks Last 1/5/15 min:     109 / 535 / 1583 
   Scheduled:                           87 / 433 / 1300 
   On-demand:                           22 / 102 / 283 
   Parallel:                            87 / 438 / 1323 
   Serial:                              0 / 0 / 0 
   Cached:                              22 / 97 / 260 
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0 
Active Service Checks Last 1/5/15 min:  304 / 1605 / 4924 
   Scheduled:                           304 / 1605 / 4923 
   On-demand:                           0 / 0 / 1 
   Cached:                              0 / 0 / 0 
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0 

External Commands Last 1/5/15 min:      0 / 0 / 0 

nagios -s 

Nagios Core 3.2.3 
Copyright (c) 2009-2010 Nagios Core Development Team and Community 
Contributors 
Copyright (c) 1999-2009 Ethan Galstad 
Last Modified: 10-03-2010 
License: GPL 

Website: http://www.nagios.org 
Warning: aggregate_status_updates directive ignored.  All status file 
updates are now aggregated. 
Warning: downtime_file variable ignored.  Downtime entries are now stored 
in the status and retention files. 
Warning: comment_file variable ignored.  Comments are now stored in the 
status and retention files. 
Timing information on object configuration processing is listed 
below.  You can use this information to see if precaching your 
object configuration would be useful. 

Object Config Source: Config files (uncached) 

OBJECT CONFIG PROCESSING TIMES      (* = Potential for precache savings 
with -u option) 
---------------------------------- 
Read:                 0.080036 sec 
Resolve:              0.010660 sec  * 
Recomb Contactgroups: 0.002666 sec  * 
Recomb Hostgroups:    0.004086 sec  * 
Dup Services:         0.034632 sec  * 
Recomb Servicegroups: 0.001277 sec  * 
Duplicate:            0.010939 sec  * 
Inherit:              0.005594 sec  * 
Recomb Contacts:      0.000001 sec  * 
Sort:                 0.000000 sec  * 
Register:             0.074413 sec 
Free:                 0.008730 sec 
                      ============ 
TOTAL:                0.234920 sec  * = 0.071741 sec (30.54%) estimated 
savings 


RETENTION DATA TIMES 
---------------------------------- 
Read and Process:     0.495480 sec 
                      ============ 
TOTAL:                0.495480 sec 


Timing information on configuration verification is listed below. 

CONFIG VERIFICATION TIMES          (* = Potential for speedup with -x 
option) 
---------------------------------- 
Object Relationships: 0.060039 sec 
Circular Paths:       0.026557 sec  * 
Misc:                 0.005999 sec 
                      ============ 
TOTAL:                0.092595 sec  * = 0.026557 sec (28.7%) estimated 
savings 


EVENT SCHEDULING TIMES 
------------------------------------- 
Get service info:        0.014509 sec 
Get host info info:      0.002853 sec 
Get service params:      0.000078 sec 
Schedule service times:  0.039947 sec 
Schedule service events: 0.034656 sec 
Get host params:         0.000001 sec 
Schedule host times:     0.007519 sec 
Schedule host events:    0.029519 sec 
                         ============ 
TOTAL:                   0.129082 sec 


Projected scheduling information for host and service checks 
is listed below.  This information assumes that you are going 
to start running Nagios with your current config files. 

HOST SCHEDULING INFORMATION 
--------------------------- 
Total hosts:                     931 
Total scheduled hosts:           931 
Host inter-check delay method:   SMART 
Average host check interval:     259.01 sec 
Host inter-check delay:          0.28 sec 
Max host check spread:           30 min 
First scheduled check:           Tue Oct 11 13:14:08 2011 
Last scheduled check:            Tue Oct 11 13:18:26 2011 


SERVICE SCHEDULING INFORMATION 
------------------------------- 
Total services:                     4032 
Total scheduled services:           4030 
Service inter-check delay method:   SMART 
Average service check interval:     299.55 sec 
Inter-check delay:                  0.07 sec 
Interleave factor method:           SMART 
Average services per host:          4.33 
Service interleave factor:          5 
Max service check spread:           30 min 
First scheduled check:              Tue Oct 11 13:15:07 2011 
Last scheduled check:               Tue Oct 11 13:20:07 2011 


CHECK PROCESSING INFORMATION 
---------------------------- 
Check result reaper interval:       5 sec 
Max concurrent service checks:      Unlimited 


PERFORMANCE SUGGESTIONS 
----------------------- 
I have no suggestions - things look okay. 
-- 
Javier Vela Diago
S2 GRUPO
Ramiro de Maeztu, 7 bajo. 46022 Valencia
Tel: 963.110.300 Fax: 963.106.086
e-mail : jvela arroba s2grupo punto es
http://www.s2grupo.es
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when 
reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20111011/4a2a60e0/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list