Nagios and Gearman - huge environment performance problem

Paul M. Dubuc work at paul.dubuc.org
Mon Aug 22 17:02:07 CEST 2011


Rodney Ramos wrote:
> Thanks, Daniel, but I don´t think that my problem is of hardware. I
> create the ramdisk and the problem is the same:
>   - nagios eating 100% of CPU all the time;
>   - nagios does not distribute the active checks in a smoothly way. It
> waits a long time and make the acitve checks in a burst way. I can see
> this with the gearman_top. The gearmand jobs waiting queue is empty
> almost all the time, but sometimes there is a burst of jobs in the
> queue. I can´t understand this behavior.
>
> Any help would be great. Thanks everybody.

I ran into a problem with Nagios "bunching up" the active service check 
schedule at the beginning of a check period or after a number of checks have 
been inhibited for a while.  See the discussion here:

http://sourceforge.net/mailarchive/forum.php?thread_name=4CF54C4F.2000500%40paul.dubuc.org&forum_name=nagios-users

We use check_periods with a time period that reflects our regularly scheduled 
downtimes.  As the downtime approaches, Nagios schedules all checks on the 
same time that the downtime ends.  The nagios.cfg settings for 
auto-rescheduling mentioned in the discussion referenced above help, but not 
much.  What seems to help more is to set use_retained_scheduling_info=0 and 
schedule a Nagios restart at the end of our downtime.  That forces Nagios to 
reschedule all the checks and spread them out again.

It would be nice if Nagios could maintain the schedule spread when checks are 
disabled this way.  Trying to run hundreds of checks at the same instant can't 
be a good thing.  As to why we use check_periods instead of scheduling 
downtimes regularly (aside from the fact that Nagios doesn't support regularly 
scheduled downtime) I'll repeat what I wrote before:


On 11/23/2010 03:50 AM, Andreas Ericsson wrote:

> You really should be using scheduled downtime for regular downtime though.
> There are pre-hacked solutions to automagically reschedule re-occurring
> downtime. Ninja supports it out of the box as of the latest version (or
> possibly latest git).

There are some cases where we really should not be running the checks during
down times because of the extra load they put on our system when they fail.
(Checks are still run during down times, if I'm not mistaken, only
notifications are inhibited.)  Many of our checks fail in this case by timing
out and they use relatively scarce (shared) and resource intensive processes
(web browser sessions run under SeleniumRC).  Timeouts tend to be long for
these checks so there is more contention for these processes when all the
checks using them start failing, and they're run more often until they all go
into a 'hard' failure state, etc.  Maybe we can live with this, but it would
be easier on the system to just inhibit checks we know are going to fail
during certain regularly scheduled down times.


------------------------------------------------------------------------------
uberSVN's rich system and user administration capabilities and model 
configuration take the hassle out of deploying and managing Subversion and 
the tools developers use with it. Learn more about uberSVN and get a free 
download at:  http://p.sf.net/sfu/wandisco-dev2dev




More information about the Developers mailing list