Performance issues, too

Tobias Klausmann klausman at schwarzvogel.de
Tue Dec 19 12:42:43 CET 2006


Hi! 

On Tue, 19 Dec 2006, Andreas Ericsson wrote:
> Thanks for an excellently detailed problem report, missing only the 
> Nagios version and system type/version info. I've got some comments and 
> followup questions. See below.

I'm running 2.6 now but I had the troubles with 2.5 initially.
OS is a Gentoo Linux, Kernel 2.6.15.5 initially, upgrade to
2.6.19 today.

> > ---------------------------
> > Total hosts:                     330
> > Total scheduled hosts:           0
> 
> No scheduled host-checks. That's good, cause they interfere with normal 
> operations in Nagios.

I've read as much. In my seperate mail I had a few questions
about it, let's keep them (and the answers there ;)

> > Host inter-check delay method:   SMART
> > Average host check interval:     0.00 sec
> > Host inter-check delay:          0.00 sec
> > Max host check spread:           10 min
> > First scheduled check:           N/A
> > Last scheduled check:            N/A
> > 
> > 
> > SERVICE SCHEDULING INFORMATION
> > -------------------------------
> > Total services:                     2836
> > Total scheduled services:           2836
> > Service inter-check delay method:   SMART
> > Average service check interval:     2225.56 sec
> 
> This is, as you point out below, quite odd. What's your _longest_ 
> normal_check_interval for services?

The longest check_interval is 86400 seconds. It's a SSL cert
freshness check. I figured it wasn't necesseary to check that
more often than once a day. I also have check_intervals of 3, 5,
15, 20, 30 and 1440 seconds. The latter is also a cert freshness
check which is lower because the customer wanted it to be that
short.

> > CHECK PROCESSING INFORMATION
> > ----------------------------
> > Service check reaper interval:      5 sec
> 
> You could lower this to 2 seconds. I've done so on any number of 
> installations and it has no negative impact what so ever, but seems to 
> make Nagios a bit more responsive.

I'll give that a try.

> > Max concurrent service checks:      Unlimited
> 
> I assume you aren't running in to hardware limits on this machine. 
> What's the normal load when you're running nagios? If it's > NUM_CPUS 
> then you most likely don't have beefy enough hardware. That's hardly 
> ever the case though, so don't bother looking into it unless all else fails.
> 
> Nvm, question answered below. Hardware resources should be no problem 
> what so ever.

I also noticed that HT was disabled on the machine. I've changed
that (and added support for it to the kernel) when I did the
kernel upgrade today. I'll keep an eye on check latency.

> > *Or* it is indicative of a misconfiguration on my
> > part. If the latter is the case, I'd be eager, nay ecstatic to
> > hear what I did wrong. Here are a few of the config vars that
> > might influence this:
> 
> There has been a slight thinko in Nagios. I don't know if it's still 
> there in recent CVS versions. The thinko is that it (used to?) calculate 
> average service check interval by adding up all normal_check_interval 
> values and dividing it by the number of services configured (or 
> something along those lines), which leads to long latencies. This 
> normally didn't make those latencies increase though. Humm...

Well, the numbers sure do get whacky after a restart: first it
skyrockets for about five minutes, then plummets to 1s. From
there it works its way up the way I described.

> > Total Services:                       2836
> > Services Checked:                     2836
> > Services Scheduled:                   2758
> > Active Service Checks:                2836
> > Passive Service Checks:               0
> 
> All services aren't being scheduled, but you have no passive service 
> checks. Have you disabled checks of 78 services?

Oops, forgot to mention that. Yes, a server farm is being rebuilt
currently. As I didn't want all the host check timeouts to make
matters much, much, worse, I disabled them entirely.

> > Hardware is a dual-2.8GHz Xeon, 2G RAM and a 100 FDX interface.
> > LoadAvg is around 1.6, sometimes gets to 1.9. CPUs are both
> > around 40% idle most of the time. I see about 300 context
> > switches and 500 interrupts per second. The network load is
> > neglible, ditto the packet rate.
> > 
> > The way these figures look I don't see a performance problem per
> > se, but maybe I have overlooked a metric that descirbes the
> > "usual" bottleneck of installations.
> > 
> 
> Are the CPU's 64 bit ones running in 32-bit emulation mode? For intel 
> cpu's, that causes up to 60% performance loss (yes, it really is that bad).

Sheesh. Yes, it is a 32-bit installation. I only ever bothered
with 64-bit installs on Opteron hardware. I might look into
migrating to 64 bits, then.

> I'm puzzled. Please let me know if you find the answer to this problem. 
> I'll help you debug it as best I can, but please continue posting 
> on-list. Thanks.

Sure. I'll first check if the "processor upgrade" and kernel
update helped anything, then try lowering the reaper interval to
2. I'll post the results as soon as I have them.

Regards & Thanks,
Tobias
-- 
Never touch a burning system.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list