Nagios performance issue for 411 nodes and 1218 services

Marc Powell marc at ena.com
Wed Mar 17 16:53:34 CET 2004


On Wednesday, March 17, 2004 7:06 AM, Frank Pikelner shared with us:

> We're running Nagios 1.1 on a Dual Xeon 2.4GHz box with 512MB RAM on
> FreeBSD 4.8. At present we monitor 411 nodes and 1218 service checks
> every 5 minutes. As the number of nodes and service checks passed
> about 350 nodes and 1000 service checks we seem to get continuously
> from 0-30 ICMP timeouts for various servers (random). From Nagios I
> can manually PING the hosts and no timeouts occur (ping times are
> 40-100ms over WAN). Nagios is configured to do Smart Scheduling. If I
> look in Performance info for Nagios I get about 99-100% of the checks
> completing in less than 5 minutes (though does improve every once in
> a while). The CPU and the level of traffic does not appear to be
> great.          
> 
> My question is have we reach a monitoring limit (would doubt this)?
> Is there a bug in Nagios 1.1 that may be affecting our ability to
> monitor this number of hosts? 
> Any suggestions on what other troubleshooting can be done in Nagios?

I doubt it in both cases. I have a pIII 800 with 512M RAM running Nagios
and Cricket at 5 minute intervals for 751 hosts and 1189 services with
no problems (adding Smokeing put it over the edge though, primarily for
disk IO reasons). Some things to check --

. Verify speed and duplex are hard-coded and match on the machine and
whatever it's connected to.
. Reduce your service_reaper_frequency. I have mine set at 2 seconds.
. Enable aggregated_status_updates.
. Increase your status_update_interval. If you have it set low (30
seconds for example), Nagios is going to spend a lot of time just
writing the status file to disk. I update mine every minute at least,
and on some machines at 5 mintue intervals. *NOTE* This will affect the
service checks completed percentage in the performance info. The checks
are still executing as scheduled but since the status file isn't being
updated as frequently, the CGI's don't know that. It's all a matter of
timing. If you look at performance info right after the status file has
been written then it'll be at or near 100%, if you look later, it'll be
less.
. Try using check_fping instead of check_ping. It's a bit friendlier.
. Run /path/to/nagios -v /path/to/nagios.cfg and make sure your
max_concurrent_checks is at or above the recommended value (I typically
add half again the recommended number).
. If you have hosts that are down a lot on your network, make sure that
your host check_command completes quickly (i.e. a single ping) and you
retry only the minimum number of times to satisfy you that it's down.
Nagios stops doing everything else until a host check completes and
could result in delayed service check processing.

HTH,

Marc






-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id70&alloc_id638&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list