Performance issues, too

Andreas Ericsson ae at op5.se
Tue Dec 19 12:15:03 CET 2006


Thanks for an excellently detailed problem report, missing only the 
Nagios version and system type/version info. I've got some comments and 
followup questions. See below.

Tobias Klausmann wrote:
> Hi! 
> 
> Recently I have run into the very same performance issues 
> as Daniel Meyer (or so it seems). However, I'm not quite sure
> about it. Here's the gist of it.
> 
> Currently, service check latency slowly creeps up. As it is now,
> it starts out at a little over 1s and after about 12 hours it's
> in the area of about 90s. It keeps climbing after that. 
> 
> Here's the output of nagios -s:
> 
> HOST SCHEDULING INFORMATION
> ---------------------------
> Total hosts:                     330
> Total scheduled hosts:           0

No scheduled host-checks. That's good, cause they interfere with normal 
operations in Nagios.

> Host inter-check delay method:   SMART
> Average host check interval:     0.00 sec
> Host inter-check delay:          0.00 sec
> Max host check spread:           10 min
> First scheduled check:           N/A
> Last scheduled check:            N/A
> 
> 
> SERVICE SCHEDULING INFORMATION
> -------------------------------
> Total services:                     2836
> Total scheduled services:           2836
> Service inter-check delay method:   SMART
> Average service check interval:     2225.56 sec


This is, as you point out below, quite odd. What's your _longest_ 
normal_check_interval for services?


> Inter-check delay:                  0.21 sec
> Interleave factor method:           SMART
> Average services per host:          8.59
> Service interleave factor:          9
> Max service check spread:           10 min
> First scheduled check:              Tue Dec 19 11:21:45 2006
> Last scheduled check:               Tue Dec 19 11:31:47 2006
> 
> 
> CHECK PROCESSING INFORMATION
> ----------------------------
> Service check reaper interval:      5 sec

You could lower this to 2 seconds. I've done so on any number of 
installations and it has no negative impact what so ever, but seems to 
make Nagios a bit more responsive.

> Max concurrent service checks:      Unlimited
> 

I assume you aren't running in to hardware limits on this machine. 
What's the normal load when you're running nagios? If it's > NUM_CPUS 
then you most likely don't have beefy enough hardware. That's hardly 
ever the case though, so don't bother looking into it unless all else fails.

Nvm, question answered below. Hardware resources should be no problem 
what so ever.

> 
> This all looks peachy - I think. What I don't get is this line:
> 
> Average service check interval:     2225.56 sec
> 
> It seems to me that this is either a skewed value, stemming from
> my history of looong latencies (at one point we were beyonf
> 9000 seconds).

Nopes. Nagios doesn't bother reading logfiles when it calculates the 
scheduling numbers.

> *Or* it is indicative of a misconfiguration on my
> part. If the latter is the case, I'd be eager, nay ecstatic to
> hear what I did wrong. Here are a few of the config vars that
> might influence this:
> 

There has been a slight thinko in Nagios. I don't know if it's still 
there in recent CVS versions. The thinko is that it (used to?) calculate 
average service check interval by adding up all normal_check_interval 
values and dividing it by the number of services configured (or 
something along those lines), which leads to long latencies. This 
normally didn't make those latencies increase though. Humm...


> sleep_time=0.25
> service_reaper_frequency=5
> max_concurrent_checks=0
> max_host_check_spread=10
> host_inter_check_delay_method=s
> service_interleave_factor=s
> command_check_interval=1
> obsess_over_services=0
> aggregate_status_updates=1
> status_update_interval=20
> 
> Also, here's the output from nagiostats:
> Nagios Stats 2.6
> Copyright (c) 2003-2005 Ethan Galstad (www.nagios.org)
> Last Modified: 11-27-2006
> License: GPL
> 
> CURRENT STATUS DATA
> ----------------------------------------------------
> Status File:                          /var/nagios/status.dat
> Status File Age:                      0d 0h 0m 3s
> Status File Version:                  2.6
> 
> Program Running Time:                 0d 1h 59m 5s
> 
> Total Services:                       2836
> Services Checked:                     2836
> Services Scheduled:                   2758
> Active Service Checks:                2836
> Passive Service Checks:               0


All services aren't being scheduled, but you have no passive service 
checks. Have you disabled checks of 78 services?


> Total Service State Change:           0.000 / 12.370 / 0.007 %
> Active Service Latency:               0.006 / 10.237 / 0.906 sec
> Active Service Execution Time:        0.047 / 10.159 / 0.180 sec
> Active Service State Change:          0.000 / 12.370 / 0.007 %
> Active Services Last 1/5/15/60 min:   477 / 2678 / 2745 / 2754
> Passive Service State Change:         0.000 / 0.000 / 0.000 %
> Passive Services Last 1/5/15/60 min:  0 / 0 / 0 / 0
> Services Ok/Warn/Unk/Crit:            2814 / 6 / 0 / 16
> Services Flapping:                    0
> Services In Downtime:                 0
> 
> Total Hosts:                          330
> Hosts Checked:                        330
> Hosts Scheduled:                      0
> Active Host Checks:                   330
> Passive Host Checks:                  0
> Total Host State Change:              0.000 / 0.000 / 0.000 %
> Active Host Latency:                  0.000 / 1.000 / 0.888 sec
> Active Host Execution Time:           0.030 / 4.059 / 0.112 sec
> Active Host State Change:             0.000 / 0.000 / 0.000 %
> Active Hosts Last 1/5/15/60 min:      0 / 12 / 12 / 12
> Passive Host State Change:            0.000 / 0.000 / 0.000 %
> Passive Hosts Last 1/5/15/60 min:     0 / 0 / 0 / 0
> Hosts Up/Down/Unreach:                329 / 1 / 0
> Hosts Flapping:                       0
> Hosts In Downtime:                    0
> 
> Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface.
> LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both
> around 40% idle most of the time. I see about 300 context
> switches and 500 interrupts per second. The network load is
> neglible, ditto the packet rate.
> 
> The way these figures look I don't see a performance problem per
> se, but maybe I have overlooked a metric that descirbes the
> "usual" bottleneck of installations.
> 

Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel 
cpu's, that causes up to 60% performance loss (yes, it really is that bad).

I'm puzzled. Please let me know if you find the answer to this problem. 
I'll help you debug it as best I can, but please continue posting 
on-list. Thanks.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list