Strange load average with Nagios 3

Thomas Guyot-Sionnest dermoth at aei.ca
Tue Apr 21 16:57:58 CEST 2009


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Andreas Ericsson wrote:
> Yann Jouanin wrote:
>>>> Has anyone the same observation ?
>>> I've never seen anything like it before. Does the 4-hour interval coincide
>>> with your check-interval? Does it align with some performance-data
>> processing?
>>
>> No, we process performance data every 15 seconds (and even when stopping
>> processing pattern still reproduced).
>> Our check-interval are 1min and 5 min, it doesn't seem to be correlated with
>> the 4hours slot.
>>
>>>> Can something in Nagios behavior explain
>>>> this load ?
>>> Not really, no. I suppose a database logging application could display a
>> load
>>> pattern such as this if it manages its tables really poorly and then
>> vacuums
>>> them at the peak of the load, but since you mentioned nothing about
>> NDOUtils
>>> or anything similar I'll just assume you have no such things installed.
>>
>> There is no mysql nor NDOUtils running on these servers.
>> Only Nagios and PNP (NPCD + BULK)
>>
>>>> Our servers are running different Linux distributions and we spot out the
>>>> fact that the pattern is certainly due to Nagios.
>>>>
>>> How did you ascertain this? Sorry for being skeptical, but I've seen
>>> "Oh I'm really, really sure" followed by "oops turned out I was wrong"
>>> too many times to trust other's eyes ;-)
>> I can understand your skepticism, let's say we strongly guess (instead of
>> certainly!) this is due to nagios: 
>> 	-  stopping NPCD doesn't change the pattern. 
>> 	-  we check cronjob, nothing was running with a 4 hour periodicity
>> 	-  the different servers don't run the same services (E.G : some
>> have backup with bacula, some not)
>> 	-  We can unfortunately not stop the nagios process (because it's
>> production!) but, the amplitude of lobes seems to be quite correlated with
>> the number of services.
>>
> 
> That seems to rule out the basics and some of the esoterics at least.
> 
> Does this state persist if you restart Nagios, or does it sort of grow
> into place after it's been running a while?
> 
> It would be nice to be able to see average run-time of plugins over the
> time of that graph. I could imagine long-running checks to sort of pile
> up until they spill over and miss one of their check-windows, but that
> *should* mean load slowly increased and then stayed at a small plateau.

For the records, I noticed this on the very first day I switched to
3.0.1-cvs (close to 3.0.2) on two nagios servers and this behavior has
persisted since then (Sent an email to the mailing list back then - can
retrieve it if you like). It has absolutely nothing to do with
check_interval, cpu usage, IO usage or anything, and is very consistent
across restarts, server reboots, etc. Everything is running fine though
(and CPU usage was consistent between the two versions); I just had to
adjust the load thresholds on these servers to cope with it.

> In short; I have no idea what causes this behaviour.

Me neither - worth noting though is that my temporary folders (status
files, check results, temp files) are all on TMPFS ramdisks.

Some other specs:

Servers are dual-proc x86
Running Slackware 11.0.0 with custom 2.6.20.1 kernel

On the busiest server:
  Current Running time: 3d 21h 43m 41s
  High Command Buffers: 31 / 4096
  Services Actively Checked: 1726
  Active Service Latency: 0.000 / 3.538 / 0.643 sec
  Active Services Last 1/5/15/60 min: 1253 / 1450 / 1452 / 1707

- --
Thomas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (MingW32)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFJ7d7n6dZ+Kt5BchYRAvNYAJ0ZXfv+yTFh7E1xVoZWPdSUjGrTKwCgkJZU
B2FNRkESWzK44zwyLD6nXiU=
=kG0L
-----END PGP SIGNATURE-----

------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p




More information about the Developers mailing list