High Service Check Latency

Simone Felici s.felici at mclink.eu
Tue May 22 09:46:31 CEST 2012


Hello!

Yes, it's a common problem, but cannot figure out how to debug it.
I've a distributed setup with a master server collecting >9.000 passive services sent from other 
servers, all with active latencies near 0. The master server checks *only* itself as active 
services, ~40 services, most of them every 5 minutes. AFAIK passive services should not affect 
"active service check latency" statistics. Looking into retention.dat file, the high latencies are 
all related to the local executed active services. Actual stats:

Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 10-03-2010
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /usr/local/nagios/var/status.dat
Status File Age:                        0d 0h 0m 7s
Status File Version:                    3.2.3

Program Running Time:                   0d 20h 40m 53s
Nagios PID:                             9360
Used/High/Total Command Buffers:        0 / 7 / 10000

Total Services:                         9098
Services Checked:                       9098
Services Scheduled:                     33
Services Actively Checked:              39
Services Passively Checked:             9059
Total Service State Change:             0.000 / 100.000 / 1.351 %
Active Service Latency:                 4.156 / 7943.743 / 6163.392 sec   <<<<<<<<
Active Service Execution Time:          0.010 / 2.485 / 0.319 sec
Active Service State Change:            0.000 / 22.890 / 2.443 %
Active Services Last 1/5/15/60 min:     0 / 0 / 0 / 0
Passive Service Latency:                0.088 / 7.914 / 1.997 sec
Passive Service State Change:           0.000 / 100.000 / 1.346 %
Passive Services Last 1/5/15/60 min:    1851 / 7501 / 8084 / 8392
Services Ok/Warn/Unk/Crit:              8784 / 78 / 76 / 160
Services Flapping:                      4
Services In Downtime:                   112

Total Hosts:                            1912
Hosts Checked:                          1912
Hosts Scheduled:                        0
Hosts Actively Checked:                 74
Host Passively Checked:                 1838
Total Host State Change:                0.000 / 46.910 / 0.135 %
Active Host Latency:                    0.000 / 1425.848 / 1104.205 sec
Active Host Execution Time:             0.012 / 0.402 / 0.096 sec
Active Host State Change:               0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:        0 / 0 / 0 / 0
Passive Host Latency:                   0.000 / 639.353 / 1.197 sec
Passive Host State Change:              0.000 / 46.910 / 0.140 %
Passive Hosts Last 1/5/15/60 min:       1 / 12 / 27 / 70
Hosts Up/Down/Unreach:                  1850 / 57 / 5
Hosts Flapping:                         0
Hosts In Downtime:                      35

Active Host Checks Last 1/5/15 min:     42 / 194 / 565
    Scheduled:                           0 / 0 / 0
    On-demand:                           42 / 194 / 565
    Parallel:                            0 / 0 / 0
    Serial:                              0 / 0 / 0
    Cached:                              42 / 194 / 565
Passive Host Checks Last 1/5/15 min:    1 / 14 / 45
Active Service Checks Last 1/5/15 min:  0 / 0 / 0
    Scheduled:                           0 / 0 / 0
    On-demand:                           0 / 0 / 0
    Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 2311 / 9235 / 12988

External Commands Last 1/5/15 min:      0 / 1 / 1


I've some broker modules to handle sql logging and distributed setup. Other parameters that could be 
interesting:

command_check_interval=-1
service_inter_check_delay_method=s
max_concurrent_checks=80
check_result_reaper_frequency=2
max_check_result_reaper_time=30
obsess_over_services=0
obsess_over_hosts=0

Looking on suggesions by the proc:

Nagios Core 3.2.3
Copyright (c) 2009-2010 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 10-03-2010
License: GPL

Website: http://www.nagios.org
Timing information on object configuration processing is listed
below.  You can use this information to see if precaching your
object configuration would be useful.

Object Config Source: Config files (uncached)

OBJECT CONFIG PROCESSING TIMES      (* = Potential for precache savings with -u option)
----------------------------------
Read:                 0.703470 sec
Resolve:              0.018964 sec  *
Recomb Contactgroups: 0.454370 sec  *
Recomb Hostgroups:    0.010414 sec  *
Dup Services:         0.025101 sec  *
Recomb Servicegroups: 0.000211 sec  *
Duplicate:            0.003912 sec  *
Inherit:              0.008386 sec  *
Recomb Contacts:      0.000000 sec  *
Sort:                 0.000003 sec  *
Register:             0.050582 sec
Free:                 0.006160 sec
                       ============
TOTAL:                1.281574 sec  * = 0.521362 sec (40.68%) estimated savings


RETENTION DATA TIMES
----------------------------------
Read and Process:     0.514352 sec
                       ============
TOTAL:                0.514352 sec


Timing information on configuration verification is listed below.

CONFIG VERIFICATION TIMES          (* = Potential for speedup with -x option)
----------------------------------
Object Relationships: 0.185991 sec
Circular Paths:       0.020317 sec  *
Misc:                 0.009450 sec
                       ============
TOTAL:                0.215758 sec  * = 0.020317 sec (9.4%) estimated savings


EVENT SCHEDULING TIMES
-------------------------------------
Get service info:        0.014388 sec
Get host info info:      0.002899 sec
Get service params:      0.000010 sec
Schedule service times:  0.000679 sec
Schedule service events: 0.000231 sec
Get host params:         0.000000 sec
Schedule host times:     0.000102 sec
Schedule host events:    0.000051 sec
                          ============
TOTAL:                   0.018360 sec


Projected scheduling information for host and service checks
is listed below.  This information assumes that you are going
to start running Nagios with your current config files.

HOST SCHEDULING INFORMATION
---------------------------
Total hosts:                     1912
Total scheduled hosts:           0
Host inter-check delay method:   SMART
Average host check interval:     0.00 sec
Host inter-check delay:          0.00 sec
Max host check spread:           15 min
First scheduled check:           N/A
Last scheduled check:            N/A


SERVICE SCHEDULING INFORMATION
-------------------------------
Total services:                     9098
Total scheduled services:           33
Service inter-check delay method:   SMART
Average service check interval:     1770.91 sec
Inter-check delay:                  9.09 sec
Interleave factor method:           SMART
Average services per host:          4.76
Service interleave factor:          1
Max service check spread:           5 min
First scheduled check:              Tue May 22 09:41:22 2012
Last scheduled check:               Tue May 22 09:46:12 2012


CHECK PROCESSING INFORMATION
----------------------------
Check result reaper interval:       2 sec
Max concurrent service checks:      80


PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.



If I force a schedule of an active check, I can see how the force is immediatly logged into 
nagios.log, but executed with the high delay.
Is there a way I can debug or what parameter should I tune? Increasing logging could help?
I've still looked on the nagios tuning page, but doesn't help me much. Some suggestions based on the 
information provided?

Thank's a lot!

Simon









------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list