Reducing Load on a Distributed Nagios Installation

Fred f1216 at yahoo.com
Mon Oct 24 20:35:27 CEST 2005


Yes, thank you, this was very helpful.  Our large clusters were having a
similar problem, turns out it was the host-checks were just taking
hours to complete (1090 nodes x 10 passive services each, 6 distributed
monitors)  I changed the host check to be a sudo check_icmp which takes
only .006ms vs about 2.5 seconds for a check_ping (the default is to
do 3 of these checks!)

Things proceed reasonably with these changes, however, I will ultimately
implement a database lookup solution where the host status is saved via
some other mechanism.

The other problem I run into is stale checks.  I'm thinking that a new
notification state might be justified in Nagios.  Currently you can
enable or suppress alerts based on w,u,c,r etc, but with distributed monitoring
we are actually overloading a service definition if you want a master
server to report or even act on stale data.   For example, if I configure
a passive service on the master nagios and supply the data normally from
the distributed monitors, things work well.  If however, the nagios monitor
goes away and the master decides to execute the passive check command
which typically says "warning: service is stale", you can wind up with a lot of
alerts.  The problem is, if you supress the alerts for that service then
you also suppress the real warning states when they are sent from the
distributed monitor.  If nagios had a "stale" state, then it might be possible
to better deal with getting lots of stale alerts.

-FredC

--- Jan-Piet Mens <jpm at retail-sc.com> wrote:

> Hello Marcel,
> 
> thank you for your comments. The guys in charge of Nagios need or want
> the host alive status, so we have to go that way.
> 
> Regards,
> 	-JP
> 
> On Mon Oct 24 2005 at 16:34:42 CEST, Marcel Mitsuto Fucatu Sugano wrote:
> 
> > Hi JP,
> > 
> > On Sat, 2005-10-15 at 11:53 +0200, Jan-Piet Mens wrote: 
> > > We've experienced quite a bit of load on a distributed Nagios
> > > installation with several thousand passive service checks which
> > > are supplied to a central Nagios server via NSCA. Our central
> > > Nagios 1.2 server started swapping and subsequently thrashed
> > > itself to death. After a bit of debugging, we've come up with a
> > > solution which may be interesting to those in a similar position.
> > 
> > I'm dealing with distributed monitoring with central server as you do,
> > but in my case, we have 11 monitoring agents, that sends their check
> > results to nsca on the central server. I'm using nagios2.0b4 for the
> > central server and nagios1.X on the agents. Counting all checks that is
> > passively sended to the central server, it sums over 10000 passive
> > checks been received by one commoditie hardware, highly available, a
> > Pentium4-HT, running SuSE9.3, very simple. But it works. No thrashing
> > experienced so far. But we do not check_icmp over stale check results.
> > We simply show this as an Unknown alert with an output of stale, and try
> > to find reasonable freshness thresholds. 
> > 
> > In your situation, i would thought about upgrading the central nagios
> > server to 2.0b4.
> > 
> > > 
> > > We've documented the proceedings as well as the solution we 
> > > implemented at http://wiki.fupps.com/nagios/icmp
> > > 
> > > Regards,
> > > 	-JP
> > 
> > Nice solution there, it may show that an installation with big passive
> > nagios configuration will thrash the central server, if
> > freshness_threshold and freshness_checking report staled results from
> > distributed monitoring agents, become to be happening in such a low
> > latency that the command associated with the staled passive service
> > report, will fork too many childs, waiting to write to the pipe.
> > 
> > But, have you thought _not_ to be checking host-alive whenever a staled
> > results check-in? Anyways, it was very nice and clearfull reading the
> > workaround of your problem. Thanks.
> > 
> > -- 
> > Marcel Mitsuto Fucatu Sugano <msugano at uolinc.com>
> > Universo Online S.A. -- http://www.uol.com.br
> > 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by the JBoss Inc.
> Get Certified Today * Register for a JBoss Training Course
> Free Certification Exam for All Training Attendees Through End of 2005
> Visit http://www.jboss.com/services/certification for more information
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when reporting
> any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 







-------------------------------------------------------
This SF.Net email is sponsored by the JBoss Inc.
Get Certified Today * Register for a JBoss Training Course
Free Certification Exam for All Training Attendees Through End of 2005
Visit http://www.jboss.com/services/certification for more information
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list