High CPU utilization at random times

Dan Wilson drw at adnc.com
Wed Oct 5 06:40:09 CEST 2005


I've been looking into a problem for quite some time now and have come 
up stumped.  Every time I think I know what the problem is I turn out to 
be wrong.

Sorry, this is LONG but has lots of detail, hopefully all the detail you 
guys need to make a diagnosis or point me in the right direction :-)

PROBLEM:
Randomly, and for no good reason, the CPU usage on this machine will go 
up to anywhere from .7 to 1.5!?!?!?!?!?

HARDWARE:
PIII 677
384MB ram
Software RAID 1 with IDE(all partitions except swap, yes, I boot from it 
too... I already took crap for booting from software raid, but it works 
fine, really)
extra drive for swap and nightly "snapshots" of /usr/local/ and /etc and 
a few other things.

SOFTWARE:
Mandrake linux 10.1(last updates 45 days ago)
Nagios 1.2 (no perl interpreter, with perl cache)
Plugins 1.3.1
Optional/custom plugins...
check_icmp instead of check_ping
custom check_ink script/plugin - this plugin is written in perl and uses 
the netsnmp module for perl.  This isn't the problem either, stopped all 
service checks that used it for a few hours, the problem was still 
there....  FYI: This script checks supply levels in network printers, I 
could have used the check_snmp plugin for this but that was too messy(i 
tried!). This way the out put is cleaner(ex. Levels OK - C-34% Y-75% 
M-12% K-90%) and there is only one check per printer instead of one for 
each supply :-)  [my programming skills suck, really, they do.  You have 
to specify the type of printer which has to be put in the script so if 
can correctly read the supplies...  I should have written it to 
"explore" the printer to see what kind of supplies it had and what could 
be checked so it would in theory work with any printer... but it works 
the way it is, and I couldn't figure out how to get everything to 
work... I'm learning and will some day get it to work the way I want????]
check_smart - checks HDD SMART values... not the trouble either, it was 
added recently after a HDD went bad and the box crashed 2 nights in a 
row(the extra drive was bad and failed during the "snapshot")

The follwing were the latest stable versions as of about Feb-2005
Apache
MRTG
NetSNMP
PERL
PHP
MySQL


THINGS I HAVE DONE/LOOKED AT TO TRY AND FIX THIS ISSUE:

Recompiled the kernel... no change, went back to the standard kernel.

Restarted like a MS machine... uptime makes no difference, pleanty of 
memory availble(150+MB) all the time

Nagios - stopped the service, no issue, start the service and let it run 
a while, the problem appears...  I recompiled(twice), adjusted a few 
options, no luck with the issue though nagios ran a tiny faster, maybe 
1-2%, not worth the wait to recompile IMHO

MRTG - checking interface on 2 routers, it is using RRD and the 
MRTG-RRD.CGI fast cgi script so the load from this every 5 minutes isn't 
even worth mentioning.  Tried removing access from users to stop 
MRTG-RRD.CGI from generating graphs on demand. I even tried stopping 
MRTG and lost 4 hours of data but still had the problem.

Apache - stopped the service, problem still continues.

PERL - recompiled and removed a few options that the documentation said 
could cause trouble, no change.  Even ran Nagios without any perl 
scripts/plugins, problem still there.

PHP - nothing is using this at the moment... was only installed for 
testing a Nagios config utility with a web interface...

MySQL - not being used, makes no difference if it is running or not.

I only run X while downloading updates, otherwise it stays off and I 
just SSH in.


MORE INFO:

At first I only noticed it when I would SSH in and look at the load 
because it took 15+seconds to log in.  I though it was SSH to I started 
having Nagios check the CPU load, I can look from time to time and catch 
it up nice and high.

It is NOT logs being rotated, excessive swaping, bad hardware(second 
machine it's happened on), too many people accessing the box, too many 
services/hosts down.(I'm checking about 90 hosts and 180+ services, 
after I delete the retention data and start Nagios fresh everything is 
checked and fine in 2 minutes or less.).

It's not to the point where the box is unusable, it clears up in a 
minute or two(always, every time, and that makes it hard to track down).

It is NOT(at least not that I can tell) Nagios making excessive retries 
on problems, it happens when there are no problem and I have the max 
retries set to 3 for all but a few things.  Timeouts are 10 seconds or 
less on all but one check.  I'm not using obssesive checks, processing 
perf data or anything like that.

When I first installed nagios 2 years ago I tinkered with getting it to 
respond faster, I set the time period to 15 seconds(default is 60?) so I 
could get a few things running every 15 or 30 seconds... works great and 
with little increased overhead....  I just have to remember that 1 
minute is now 4 and not 1... ;-)  Nagios responds like a champ now, 
forced checks don't take a minute or longer... 20 seconds at the 
longest.  I HATE WAITING! LOL




Any ideas?  Or should I just live with it until I upgrade to 2.0?  I'll 
be moving to faster hardware then anyway, dual PIII 700 with 2GB ram and 
hardware RAID1... It's not much but it is better :-)





-------------------------------------------------------
This SF.Net email is sponsored by:
Power Architecture Resource Center: Free content, downloads, discussions,
and more. http://solutions.newsforge.com/ibmarch.tmpl
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list