Nagios2 process overwhelmed by NSCA daemon?

Jonathan Call jcall at verio.net
Mon Dec 14 19:41:08 CET 2009


See responses inline:

> -----Original Message-----
> From: Thomas Guyot-Sionnest [mailto:dermoth at aei.ca]
> Sent: Sunday, December 13, 2009 9:23 PM
> To: Jonathan Call
> Cc: nagios-user Mailinglist
> Subject: Re: [Nagios-users] Nagios2 process overwhelmed by NSCA
daemon?
> 
> On 09/12/09 06:06 PM, Jonathan Call wrote:
> > I recently added two new slaves to a distributed Nagios system. The
> > central server now passively processes 17,000+ service checks on
> 3000+
> > servers.
> >
> > It's been over an hour and a half since I brought those new slaves
> > online and I have about 150 hosts still stuck in 'Pending' and about
> > 1300 services in the same state. In addition to that it seems that
> the
> > service check results from the other slaves that were working
> normally
> > are now arbitrarily disappearing. For example, on one host three of
> the
> > service checks have been updated relatively recently (i.e. 5-30
> minutes
> > ago) but three other service checks haven't been updated for almost
> an
> > hour. The slaves all appear operational and the hosts are being
> checked
> > on time. Is it possible I've overwhelmed Nagios' ability to process
> data
> > from the NSCA daemon or struck some internal Nagios bottleneck? Any
> > suggestions would be appreciated.
> 
> Hummmm Very interesting. Which Nagios version are you using?

Nagios 2.12 (May 19, 2008) on FreeBSD 6.3

> 
> This sounds a lot like a problem I encountered a few years ago with
> passive checks. I had about 50-60 servers returning cron-scheduled
> check
> results to the Nagios server. 120 results ain't that much, but is
> seemed
> that with all the servers fully time-synced (using NTP) out of these
> ~120 results I was often missing some of them, which would eventually
> cause false-alarm due to stale services.
> 
> I could easily reproduce the problem by feeding lots of results to
> Nagios right when I was expecting a batch of passive results - this
> would cause random results to be dropped. I spent some time trying to
> debug this but I couldn't figure our where commands were dropped. My
> primary target was the ring buffer used by the command reaper. As far
> as
> I can remember I tested with version of Nagios ranging from 2.3 to
2.5;
> I never tried with recent version
> 
> If you're running a recent version of nagios what do you get for
> "Used/High/Total Command Buffers" in the "nagiostats" command output?
> (you can also get these numbers from the web interface, "Performance
> Info" in the left bar.). If it seems to be maxed out, you may try
> setting "command_check_interval" to "-1" and raising the
> "external_command_buffer_slots" option in nagios.cfg.
>

Buffer report from Nagiostats:
Used/High/Total Command Buffers:      25 / 4096 / 4096
Used/High/Total Check Result Buffers: 0 / 4096 / 4096

Nagios config:
command_check_interval=-1
external_command_buffer_slots=4096

 
> 
> If you're still having this problem with Nagios v3 and up I might try
> to
> reproduce this as well, and maybe I'll be able to figure out what's
> wrong this time.

Upgrading to Nagios v3 is being considered but isn't possible at this
time.

As I mentioned to someone else on this thread, it seems that having a
large number of queries (status.cgi) being run against the web interface
seems to provoke poor performance from the central server, this is even
after we switched the main objects.cache and status.dat files to a
memory disk.

Jonathan



This email message is intended for the use of the person to whom it has been sent, and may contain information that is confidential or legally protected. If you are not the intended recipient or have received this message in error, you are not authorized to copy, distribute, or otherwise use this message or its attachments. Please notify the sender immediately by return e-mail and permanently delete this message and any attachments. Verio, Inc. makes no warranty that this email is error or virus free.  Thank you.

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list