Possible bug in Nagios 2.12?

eponymous alias eponymousalias at yahoo.com
Thu Apr 9 19:30:03 CEST 2009


>>>> I'm not seeing anywhere that
>>>> (event_list_low = event_list_low->next)
>>>> unless the event actually runs.
>>>
>>> That's correct, although the loop is broken out of if:
>>> * The check shouldn't be run right now due to global
>>>   options
>>> * The check shouldn't be run right now due to temporary
>>>   setting
>>>
>>> However, if the check can't be run immediately
>>> due to too many checks running at that moment,
>>> or due to the check not being parallelizable
>>> and *any* other check is running, the only
>>> sensible thing to do is to sleep 1 second
>>> and then try again. This is what Nagios does.
>>
>> Ah, no.  The really sensible thing to do would
>> be to wait only until all the blocking checks
>> are done (either just one of "too many", or
>> all other checks in the parallelization case).
>
> How? Using no delay at all between attempts would be
> rather devastating, since spinlocks eat CPU like mad.

Did I say "spinlock"?  No, I did not.

sigtimedwait() would be a better choice, waiting for
SIGCHLD to show up, while allowing a controlled timeout
to cover the case when no child never returns.

> Nagios doesn't catch SIGCHLD in that thread (and nor
> can it, or the reaper process wouldn't know when it
> should reap child results).

Huh?  There is no SIGCHLD handling anywhere else in the
code right now.  And properly-constructed receipt of
SIGCHLD does not interfere with any waitpid() activity.
Besides, even in Nagios 3.0.6, this call in base/events.c:
    while((wait_result=waitpid(-1,NULL,WNOHANG))>0);
isn't really dependent on "knowing when to reap";
all it does is grab whatever is ready at that time.

>> Sleeping for a full second regardless of when
>> the blocking checks complete can waste time
>> between when the next plugin could run and
>> when it actually does.  And with enough checks
>> introducing these extra arbitrary delays, the
>> overall latency for the full set of checks can
>> easily creep up.
>
> So sleep some less then, but I'm not sure what you're
> hoping to achieve by doing that since you'd be decreasing
> the maximum latency of a single check by slightly less
> than one second, and less than 0.5 second on average.

Stop thinking small.  When you have many thousands of checks
to run, tiny delays persist and add up.  A second here, a
second there, and pretty soon you're talking real time.

> Hardly worth bothering with imo, unless none of your checks
> are parallelizable or you've managed to horribly misconfigure
> your max_parallel_service_checks (or whatever it's called).

max_concurrent_checks

>> Whether it would be simple to make that happen
>> in a particular software architecture is a
>> separate discussion; I'm just pointing out
>> the design issue here.
>
> There's no real design issue.

The design issue is that delays build up and become very
observable.

My "Whether" comment can be read as "there may be a gap
between what is sensible and possible, and what can be done
with only modest changes to the existing code".  There is
nothing sacred about the current structure of the code.
The tricky part of any revised structure would be to handle
signal blocking correctly, and to account for possible race
conditions.  Some alternative structural choices might
involve using extra threads and synchronization primitives
between the threads.  These are solvable problems, though.

> Nagios could sleep a little less, but it would provide
> such a microscopic correction for the average service
> that it really isn't worth it.

The end-users think differently.



      

------------------------------------------------------------------------------
This SF.net email is sponsored by:
High Quality Requirements in a Collaborative Environment.
Download a free trial of Rational Requirements Composer Now!
http://p.sf.net/sfu/www-ibm-com




More information about the Developers mailing list