One cause of the 'Internal Server Errors' with nagios 2.3

Bill Ryder bill.ryder.nz at gmail.com
Wed May 17 23:21:49 CEST 2006


Hi,

I've been asked this a few times so I'll respond to the mailing list
instead of to indivual people.

The questions are typically

  * What machine do you run nagios on ?
  * Any hints about monitoring this many machines.

So ..

What machine do you run nagios on?
==========================

We use a dual proc (single cores) 2.8GHz Xeon IBM Blade server with
4GB of RAM (I thought it was 6GB - sorry about that Patrick) for the
renderwall monitoring. Typically though we only use about 3GB of the
4GB.

Reloading or restarting nagios takes about 5 minutes with 1500 hosts,
9000 services (approximatey). During which time you can see a single
CPU at 100%.

We run debian gnu linux on this machine with a 2.6.15 kernel. I tried
the 2.4 kernel but got rid of it early on as part of debugging it.

I'm currently running nagios 2.3.

All services except the host pings are checked using passive checks.
We check on various Weta specific things related to rendering on each
machine. It's all perl run from cron.

Once nagios is up and running It is rare to see more than one CPU
being used. In other words the bladeserver is not sweating under the
load. That CPU will hit 100% about 1/4 of the time.

Any hints about monitoring that many machines?
===================================

Do not process perfdata.
--------------------------------------------

This was a killer. I discovered to my surprise that nagios will run
/usr/bin/printf to record performance data. This meant that every time
a passive check sent a result back to nagios it would start a subshell
and run /usr/bin/printf to record some data. This is a killer with
9000 services being checked every 5 minutes.

If I needed the performance data I think I would hack nagios to
recognise /usr/bin/printf in a checkcommand and replace it with the
library call - obviously this would have to be done carefully to
handle various shell things (like >> etc).

Spread out your passive checks.
-----------------------------------------------------------

We run our passive checks every five minutes. Perhaps a little
aggressive. We had to spread out the passive check start times using a
random sleep before starting them. At the moment the passive check
scripts will sleep between 0 and 3 minutes before starting up. This
spreads the load nicely.

Adding more hosts will just be a matter of spreading these out more
until I have to change checking every 10 minutes instead of every 5
minutes. For now though that's not necessary.


Caveats
=======

We do no notification from nagios for the renderwall. The wranglers
use the web pages to monitor the machines. (Obviously we notify for
our production nagios instance but that's a much smaller problem).



Hope this helps,

---------
Bill Ryder
System Engineer
Weta Digital.


-------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid0709&bid&3057&dat1642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list