High Service Check Latency

Yu Watanabe yu.watanabe at jp.fujitsu.com
Tue May 22 10:05:40 CEST 2012


Hello.

My case was Nagios latency was caused by java.
Little tuning with java helped me out.

# java and nagios had absolutely no relations.

Thanks,
Yu

>Hello!
>
>Yes, it's a common problem, but cannot figure out how to debug it.
>I've a distributed setup with a master server collecting >9.000 passive services sent from other 
>servers, all with active latencies near 0. The master server checks *only* itself as active 
>services, ~40 services, most of them every 5 minutes. AFAIK passive services should not affect 
>"active service check latency" statistics. Looking into retention.dat file, the high latencies are 
>all related to the local executed active services. Actual stats:
>
>Nagios Stats 3.2.3
>Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
>Last Modified: 10-03-2010
>License: GPL
>
>CURRENT STATUS DATA
>------------------------------------------------------
>Status File:                            /usr/local/nagios/var/status.dat
>Status File Age:                        0d 0h 0m 7s
>Status File Version:                    3.2.3
>
>Program Running Time:                   0d 20h 40m 53s
>Nagios PID:                             9360
>Used/High/Total Command Buffers:        0 / 7 / 10000
>
>Total Services:                         9098
>Services Checked:                       9098
>Services Scheduled:                     33
>Services Actively Checked:              39
>Services Passively Checked:             9059
>Total Service State Change:             0.000 / 100.000 / 1.351 %
>Active Service Latency:                 4.156 / 7943.743 / 6163.392 sec   <<<<<<<<
>Active Service Execution Time:          0.010 / 2.485 / 0.319 sec
>Active Service State Change:            0.000 / 22.890 / 2.443 %
>Active Services Last 1/5/15/60 min:     0 / 0 / 0 / 0
>Passive Service Latency:                0.088 / 7.914 / 1.997 sec
>Passive Service State Change:           0.000 / 100.000 / 1.346 %
>Passive Services Last 1/5/15/60 min:    1851 / 7501 / 8084 / 8392
>Services Ok/Warn/Unk/Crit:              8784 / 78 / 76 / 160
>Services Flapping:                      4
>Services In Downtime:                   112
>
>Total Hosts:                            1912
>Hosts Checked:                          1912
>Hosts Scheduled:                        0
>Hosts Actively Checked:                 74
>Host Passively Checked:                 1838
>Total Host State Change:                0.000 / 46.910 / 0.135 %
>Active Host Latency:                    0.000 / 1425.848 / 1104.205 sec
>Active Host Execution Time:             0.012 / 0.402 / 0.096 sec
>Active Host State Change:               0.000 / 0.000 / 0.000 %
>Active Hosts Last 1/5/15/60 min:        0 / 0 / 0 / 0
>Passive Host Latency:                   0.000 / 639.353 / 1.197 sec
>Passive Host State Change:              0.000 / 46.910 / 0.140 %
>Passive Hosts Last 1/5/15/60 min:       1 / 12 / 27 / 70
>Hosts Up/Down/Unreach:                  1850 / 57 / 5
>Hosts Flapping:                         0
>Hosts In Downtime:                      35
>
>Active Host Checks Last 1/5/15 min:     42 / 194 / 565
>    Scheduled:                           0 / 0 / 0
>    On-demand:                           42 / 194 / 565
>    Parallel:                            0 / 0 / 0
>    Serial:                              0 / 0 / 0
>    Cached:                              42 / 194 / 565
>Passive Host Checks Last 1/5/15 min:    1 / 14 / 45
>Active Service Checks Last 1/5/15 min:  0 / 0 / 0
>    Scheduled:                           0 / 0 / 0
>    On-demand:                           0 / 0 / 0
>    Cached:                              0 / 0 / 0
>Passive Service Checks Last 1/5/15 min: 2311 / 9235 / 12988
>
>External Commands Last 1/5/15 min:      0 / 1 / 1
>
>
>I've some broker modules to handle sql logging and distributed setup. Other parameters that could be 
>interesting:
>
>command_check_interval=-1
>service_inter_check_delay_method=s
>max_concurrent_checks=80
>check_result_reaper_frequency=2
>max_check_result_reaper_time=30
>obsess_over_services=0
>obsess_over_hosts=0
>
>Looking on suggesions by the proc:
>
>Nagios Core 3.2.3
>Copyright (c) 2009-2010 Nagios Core Development Team and Community Contributors
>Copyright (c) 1999-2009 Ethan Galstad
>Last Modified: 10-03-2010
>License: GPL
>
>Website: http://www.nagios.org
>Timing information on object configuration processing is listed
>below.  You can use this information to see if precaching your
>object configuration would be useful.
>
>Object Config Source: Config files (uncached)
>
>OBJECT CONFIG PROCESSING TIMES      (* = Potential for precache savings with -u option)
>----------------------------------
>Read:                 0.703470 sec
>Resolve:              0.018964 sec  *
>Recomb Contactgroups: 0.454370 sec  *
>Recomb Hostgroups:    0.010414 sec  *
>Dup Services:         0.025101 sec  *
>Recomb Servicegroups: 0.000211 sec  *
>Duplicate:            0.003912 sec  *
>Inherit:              0.008386 sec  *
>Recomb Contacts:      0.000000 sec  *
>Sort:                 0.000003 sec  *
>Register:             0.050582 sec
>Free:                 0.006160 sec
>                       ============
>TOTAL:                1.281574 sec  * = 0.521362 sec (40.68%) estimated savings
>
>
>RETENTION DATA TIMES
>----------------------------------
>Read and Process:     0.514352 sec
>                       ============
>TOTAL:                0.514352 sec
>
>
>Timing information on configuration verification is listed below.
>
>CONFIG VERIFICATION TIMES          (* = Potential for speedup with -x option)
>----------------------------------
>Object Relationships: 0.185991 sec
>Circular Paths:       0.020317 sec  *
>Misc:                 0.009450 sec
>                       ============
>TOTAL:                0.215758 sec  * = 0.020317 sec (9.4%) estimated savings
>
>
>EVENT SCHEDULING TIMES
>-------------------------------------
>Get service info:        0.014388 sec
>Get host info info:      0.002899 sec
>Get service params:      0.000010 sec
>Schedule service times:  0.000679 sec
>Schedule service events: 0.000231 sec
>Get host params:         0.000000 sec
>Schedule host times:     0.000102 sec
>Schedule host events:    0.000051 sec
>                          ============
>TOTAL:                   0.018360 sec
>
>
>Projected scheduling information for host and service checks
>is listed below.  This information assumes that you are going
>to start running Nagios with your current config files.
>
>HOST SCHEDULING INFORMATION
>---------------------------
>Total hosts:                     1912
>Total scheduled hosts:           0
>Host inter-check delay method:   SMART
>Average host check interval:     0.00 sec
>Host inter-check delay:          0.00 sec
>Max host check spread:           15 min
>First scheduled check:           N/A
>Last scheduled check:            N/A
>
>
>SERVICE SCHEDULING INFORMATION
>-------------------------------
>Total services:                     9098
>Total scheduled services:           33
>Service inter-check delay method:   SMART
>Average service check interval:     1770.91 sec
>Inter-check delay:                  9.09 sec
>Interleave factor method:           SMART
>Average services per host:          4.76
>Service interleave factor:          1
>Max service check spread:           5 min
>First scheduled check:              Tue May 22 09:41:22 2012
>Last scheduled check:               Tue May 22 09:46:12 2012
>
>
>CHECK PROCESSING INFORMATION
>----------------------------
>Check result reaper interval:       2 sec
>Max concurrent service checks:      80
>
>
>PERFORMANCE SUGGESTIONS
>-----------------------
>I have no suggestions - things look okay.
>
>
>
>If I force a schedule of an active check, I can see how the force is immediatly logged into 
>nagios.log, but executed with the high delay.
>Is there a way I can debug or what parameter should I tune? Increasing logging could help?
>I've still looked on the nagios tuning page, but doesn't help me much. Some suggestions based on the 
>information provided?
>
>Thank's a lot!
>
>Simon
>
>
>
>
>
>
>
>
>
>------------------------------------------------------------------------------
>Live Security Virtual Conference
>Exclusive live event will cover all the ways today's security and 
>threat landscape has changed and how IT managers can respond. Discussions 
>will include endpoint security, mobile security and the latest in malware 
>threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
>_______________________________________________
>Nagios-users mailing list
>Nagios-users at lists.sourceforge.net
>https://lists.sourceforge.net/lists/listinfo/nagios-users
>::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
>::: Messages without supporting info will risk being sent to /dev/null
>


------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list