Nagios performance issue for 411 nodes and 1218 services

Sean Dilda agrajag at dragaera.net
Wed Mar 17 16:43:29 CET 2004


On Wed, 2004-03-17 at 08:06, Frank Pikelner wrote:
> We're running Nagios 1.1 on a Dual Xeon 2.4GHz box with 512MB RAM on FreeBSD
> 4.8. At present we monitor 411 nodes and 1218 service checks every 5
> minutes. As the number of nodes and service checks passed about 350 nodes
> and 1000 service checks we seem to get continuously from 0-30 ICMP timeouts
> for various servers (random). From Nagios I can manually PING the hosts and
> no timeouts occur (ping times are 40-100ms over WAN). Nagios is configured
> to do Smart Scheduling. If I look in Performance info for Nagios I get about
> 99-100% of the checks completing in less than 5 minutes (though does improve
> every once in a while). The CPU and the level of traffic does not appear to
> be great.

How long has nagios been running since it last reloaded its config, or
was restarted?  The smart scheduling isn't perfect.  It sets things up
nice and neat to begin with, but it never corrects after that.  So, once
you get one service that isn't in the Ok state, your scheduling becomes
no longer optimal.  If nagios runs for a long time with various errors
detected during that time, things can become way out of whack.

One thing I'd suggest doing is check the current scheduling.  From the
nagios web interface, there should be a link on the left that says
'Scheduling Queue'.   This will tell you when the last check for each
service was, and when the next one is scheduled.  Its sorted by time of
next check.  How does it look?  Is everything spread out evenly?  Or are
there a bunch of checks at about the same time, then wide gaps?

If a bunch of checks are squeezed together, and there are some huge
gaps, I'd recommend sending the nagios process a SIGHUP (kill -HUP
<pid>).  In my experience that will cause the nagios process to
reschedule everything with the smart scheduling, like it would do when
it first starts.

If the checks are evenly spaced for the most part, you may be hitting
your limits.  One thing I've found that helps to reduce the load on the
machine is to increase the normal_check_interval, if your sites policies
allow that.

Another consideration.. check the load on the nagios box when its
getting the timeouts.  If the load isn't that high, you might want to
consider just getting better networking for the machine.   Its also
possible that those ping timeouts are occurring because some parts of
your network are just bogged down at those times.



-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list