Nagios2 process overwhelmed by NSCA daemon?

Thomas Guyot-Sionnest dermoth at aei.ca
Mon Dec 14 05:22:48 CET 2009


On 09/12/09 06:06 PM, Jonathan Call wrote:
> I recently added two new slaves to a distributed Nagios system. The
> central server now passively processes 17,000+ service checks on 3000+
> servers. 
> 
> It's been over an hour and a half since I brought those new slaves
> online and I have about 150 hosts still stuck in 'Pending' and about
> 1300 services in the same state. In addition to that it seems that the
> service check results from the other slaves that were working normally
> are now arbitrarily disappearing. For example, on one host three of the
> service checks have been updated relatively recently (i.e. 5-30 minutes
> ago) but three other service checks haven't been updated for almost an
> hour. The slaves all appear operational and the hosts are being checked
> on time. Is it possible I've overwhelmed Nagios' ability to process data
> from the NSCA daemon or struck some internal Nagios bottleneck? Any
> suggestions would be appreciated.

Hummmm Very interesting. Which Nagios version are you using?

This sounds a lot like a problem I encountered a few years ago with 
passive checks. I had about 50-60 servers returning cron-scheduled check 
results to the Nagios server. 120 results ain't that much, but is seemed 
that with all the servers fully time-synced (using NTP) out of these 
~120 results I was often missing some of them, which would eventually 
cause false-alarm due to stale services.

I could easily reproduce the problem by feeding lots of results to 
Nagios right when I was expecting a batch of passive results - this 
would cause random results to be dropped. I spent some time trying to 
debug this but I couldn't figure our where commands were dropped. My 
primary target was the ring buffer used by the command reaper. As far as 
I can remember I tested with version of Nagios ranging from 2.3 to 2.5; 
I never tried with recent version

If you're running a recent version of nagios what do you get for 
"Used/High/Total Command Buffers" in the "nagiostats" command output? 
(you can also get these numbers from the web interface, "Performance 
Info" in the left bar.). If it seems to be maxed out, you may try 
setting "command_check_interval" to "-1" and raising the 
"external_command_buffer_slots" option in nagios.cfg.


If you're still having this problem with Nagios v3 and up I might try to 
reproduce this as well, and maybe I'll be able to figure out what's 
wrong this time.

-- 
Thomas

------------------------------------------------------------------------------
Return on Information:
Google Enterprise Search pays you back
Get the facts.
http://p.sf.net/sfu/google-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list