Nagios2 process overwhelmed by NSCA daemon?

Marcel mitsuto at gmail.com
Thu Dec 10 15:29:15 CET 2009


In my last job, I was dealing with a nagios install a little bit over than
yours,

On Wed, Dec 9, 2009 at 9:06 PM, Jonathan Call <jcall at verio.net> wrote:

> I recently added two new slaves to a distributed Nagios system. The
> central server now passively processes 17,000+ service checks on 3000+
> servers.
>
> It's been over an hour and a half since I brought those new slaves
> online and I have about 150 hosts still stuck in 'Pending' and about
> 1300 services in the same state. In addition to that it seems that the
> service check results from the other slaves that were working normally
> are now arbitrarily disappearing. For example, on one host three of the
> service checks have been updated relatively recently (i.e. 5-30 minutes
> ago) but three other service checks haven't been updated for almost an
> hour. The slaves all appear operational and the hosts are being checked
> on time. Is it possible I've overwhelmed Nagios' ability to process data
> from the NSCA daemon or struck some internal Nagios bottleneck? Any
> suggestions would be appreciated.
>

With 4K servers and just over 24K service checks, with 12 or 13 distributed
servers.

Well, I've ran into many kinds of problems because of nagios poor design of
distributed monitoring setup.
Appears that distributed setup was done almost as a poor patch just to have
to overcome some limitation .

We ended up doing some custom passive plugins. They were built to send
status information updates just in case of state change. In that way the
load on NSCA side was very much reduced (it was Load Balanced with a Virtual
IP, batch updates, but problems would still occur). This set of plugins were
a little hard to mantain, because configuration of each server needed to be
at the monitored server, puppet ftw. All checks were logged and later
synchronized with ndo to have last checks history.

NDO and the database schema has had to be modified too. The volume of
inserts was way too high to be handled correctly in a timely manner,
recurrent restarts of the database causing staled results, every sort of
problem in managing those systems, even after a thorough tunning of the
database. After adding logic to update only when state change ocurred, and
another batch update to update last check and the fields that needed to be
updated with last check information, the database load was normalized and
scalability could be proven.

So what I'd suggest to you, is to first tweak with the large installation
procedures, tmpfs for the status.dat, objects.cache, retention.dat, setting
batch jobs to send_nsca output to central/master nagios instance, and so on.
Also, you can do some nagios setup magic aswell, having distributed nodes
checking in a frequency (normal_check_interval) different than central
nagios expects, say, setup central nagios to wait for status information on
30 minutes frequency, but have the distributed nodes to send them at 15
minutes freq., something like that.

For what I know, it's really a cumbersome job to have enterprise scalability
nagios configuration. For tiny and trivial installs it's like using Zennoss
or Zabbixx, but with a lot of extra configuration-files pain. I think that
no other competitor's tool (Z*bbnn*ssxx) would scale too when you need
enterprise huge installs, so nagios is a little ahead and gives flexibility,
but with an associated cost that scares anyone (ending up buying another
tool to much less for much more).

That's why I've liked Gabès Jean's Shinken approach to have scalability and
to ease interoperability with puppet. That would be the
übber-super-mega-ultra tool. Also, with nginx and asynchronicity of
front-end, back-end, and checks, would end up with the most robust, easy,
enterprise NMS.

So, Gèan, continue on that path to have your Shinken working with
backcompatibility with nagios setups, but also think ahead on design to have
puppet integrated to handle configuration convergence (maybe eventhandlers
too?).

Cheers,
M
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20091210/f90339eb/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list