Scheduled checks falling far behind

Litwin, Matthew mlitwin at stubhub.com
Sun Oct 24 22:14:58 CEST 2010


Hi Matthieu (and anyone else who might want to throw their hat into the ring):

So after identifying that I have latency times that are around 500-600 seconds I have tried the tuning tips form the nagios docs, however I have fiddled with it and it while after the restart latency drops briefly, then just comes back up to the high levels again. At this point I have only been working with check_reaper_frequency and max_check_result_reaper_time by doubling and halving them from their default values. max_concurrent_checks remains at 0. Load on the server is very low. The machine is a 8 core machine so I really wish I could make better use of it. Load is a measly 1.5 on average. Finally, I tried enable_environment_macros = 0 which actually made it worse, once things quiesced after startup. use_large_installation_tweaks=1 did improve the latency by maybe %30 and I did actually start seeing RRD data come in solid for about 15 minutes but then it returned to being sparse again so while a modest improvement, it still doesn't fill RRD data to have useful data.

Any other tuning suggestions? I think I have done everything in the performance tweaks section that seems relevant, including all of those that have been suggested here.

In summary, I am looking for some way to make nagios "do more" with the system resources as the host is barely working at all. I really wish there was some way to just make nagios to have some ability to do things more in parallel for cases where a system has plenty of horsepower and RAM. If I have to resort to compiling things with different settings I would be open to trying it, but I just feel like I am grasping at straws now.

Here is an typical nagiostats:

srwp01mon001:bin$ date; nagiostats
Sun Oct 24 17:22:41 UTC 2010

Nagios Stats 3.2.1
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org<http://www.nagios.org>)
Last Modified: 03-09-2010
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /usr/local/nagios/var/status.dat
Status File Age:                        0d 0h 0m 16s
Status File Version:                    3.2.1

Program Running Time:                   0d 0h 21m 54s
Nagios PID:                             9792
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         4987
Services Checked:                       4987
Services Scheduled:                     4970
Services Actively Checked:              4987
Services Passively Checked:             0
Total Service State Change:             0.000 / 15.990 / 0.006 %
Active Service Latency:                 0.236 / 683.782 / 536.494 sec
Active Service Execution Time:          0.013 / 11.525 / 0.378 sec
Active Service State Change:            0.000 / 15.990 / 0.006 %
Active Services Last 1/5/15/60 min:     0 / 1565 / 4970 / 4970
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              4972 / 10 / 1 / 4
Services Flapping:                      0
Services In Downtime:                   0

Total Hosts:                            241
Hosts Checked:                          241
Hosts Scheduled:                        241
Hosts Actively Checked:                 241
Host Passively Checked:                 0
Total Host State Change:                0.000 / 0.000 / 0.000 %
Active Host Latency:                    362.793 / 679.309 / 523.157 sec
Active Host Execution Time:             0.172 / 4.065 / 3.780 sec
Active Host State Change:               0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:        0 / 97 / 241 / 241
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  241 / 0 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     22 / 100 / 257
   Scheduled:                           22 / 97 / 242
   On-demand:                           0 / 3 / 15
   Parallel:                            22 / 97 / 242
   Serial:                              0 / 0 / 0
   Cached:                              0 / 3 / 15
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  262 / 1779 / 5436
   Scheduled:                           262 / 1779 / 5436
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0

Here is nagios -s:

# /usr/local/nagios/bin/nagios -s /usr/local/nagios/etc/nagios.cfg

Nagios Core 3.2.1
Copyright (c) 2009-2010 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 03-09-2010
License: GPL

Website: http://www.nagios.org
Timing information on object configuration processing is listed
below.  You can use this information to see if precaching your
object configuration would be useful.

Object Config Source: Config files (uncached)

OBJECT CONFIG PROCESSING TIMES      (* = Potential for precache savings with -u option)
----------------------------------
Read:                 0.008987 sec
Resolve:              0.000533 sec  *
Recomb Contactgroups: 0.000075 sec  *
Recomb Hostgroups:    0.003513 sec  *
Dup Services:         0.025789 sec  *
Recomb Servicegroups: 0.048340 sec  *
Duplicate:            0.037513 sec  *
Inherit:              0.003420 sec  *
Recomb Contacts:      0.000000 sec  *
Sort:                 0.000000 sec  *
Register:             0.038780 sec
Free:                 0.003135 sec
                      ============
TOTAL:                0.170086 sec  * = 0.119184 sec (70.07%) estimated savings


RETENTION DATA TIMES
----------------------------------
Read and Process:     0.352939 sec
                      ============
TOTAL:                0.352939 sec


Timing information on configuration verification is listed below.

CONFIG VERIFICATION TIMES          (* = Potential for speedup with -x option)
----------------------------------
Object Relationships: 0.063209 sec
Circular Paths:       5.735947 sec  *
Misc:                 0.003824 sec
                      ============
TOTAL:                5.802980 sec  * = 5.735947 sec (98.8%) estimated savings


EVENT SCHEDULING TIMES
-------------------------------------
Get service info:        0.007308 sec
Get host info info:      0.000356 sec
Get service params:      0.000011 sec
Schedule service times:  0.016611 sec
Schedule service events: 0.053224 sec
Get host params:         0.000002 sec
Schedule host times:     0.000752 sec
Schedule host events:    0.009029 sec
                         ============
TOTAL:                   0.087293 sec


Projected scheduling information for host and service checks
is listed below.  This information assumes that you are going
to start running Nagios with your current config files.

HOST SCHEDULING INFORMATION
---------------------------
Total hosts:                     241
Total scheduled hosts:           241
Host inter-check delay method:   SMART
Average host check interval:     199.92 sec
Host inter-check delay:          0.83 sec
Max host check spread:           30 min
First scheduled check:           Sun Oct 24 17:26:17 2010
Last scheduled check:            Sun Oct 24 17:28:46 2010


SERVICE SCHEDULING INFORMATION
-------------------------------
Total services:                     4987
Total scheduled services:           4970
Service inter-check delay method:   SMART
Average service check interval:     179.98 sec
Inter-check delay:                  0.04 sec
Interleave factor method:           SMART
Average services per host:          20.69
Service interleave factor:          21
Max service check spread:           30 min
First scheduled check:              Sun Oct 24 17:26:25 2010
Last scheduled check:               Sun Oct 24 17:29:24 2010


CHECK PROCESSING INFORMATION
----------------------------
Check result reaper interval:       30 sec
Max concurrent service checks:      Unlimited


PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.

Well, I hate to say it, but I think not!

On Oct 24, 2010, at 10:58 AM, Mathieu Gagné wrote:

On 2010-10-24 03:54, Litwin, Matthew wrote:
You hit the nail on the head. Changing MaxBytes to a very large number made latency totally dwarf execution time.

So now what do I do?

Try disabling environment variables in nagios.cfg:
enable_environment_macros = 0

This didn't help at all, and may have made latency increase!


Our latency dropped from 20 minutes to 10 seconds after this change.

This guy had a similar issue back then:
http://marc.info/?l=nagios-devel&m=120393376922635

You should also try to enable large installation tweaks:
use_large_installation_tweaks=1




Documentation here:
http://nagios.sourceforge.net/docs/3_0/largeinstalltweaks.html

And adjust those configurations based on your installation:
check_result_reaper_frequency
max_concurrent_checks

As I mentioned, I have tried all sorts of permutations of this to no real effect.  I have max_concurrent_checks=0 (no limit) which is the default.

max_host_check_spread
max_service_check_spread

What does this do exactly that might effect latency? This seems only relevant to behavior after nagios starts up, correct?


--
Mathieu

Thanks again for yours and everyone else's advice up to this point,


------------------------------------------------------------------------------
Nokia and AT&T present the 2010 Calling All Innovators-North America contest
Create new apps & games for the Nokia N8 for consumers in  U.S. and Canada
$10 million total in prizes - $4M cash, 500 devices, nearly $6M in marketing
Develop with Nokia Qt SDK, Web Runtime, or Java and Publish to Ovi Store 
http://p.sf.net/sfu/nokia-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list