Possible bug in Nagios 2.12?

Steven D. Morrey smorrey at ldschurch.org
Fri Apr 17 22:32:37 CEST 2009


>>Steven D. Morrey wrote:
>> <snip>
>>> How? Using no delay at all between attempts would be
>>> rather devastating, since spinlocks eat CPU like mad.
> ></snip>
>>
>> <snip>
>> Stop thinking small.  When you have many thousands of checks
> >to run, tiny delays persist and add up.  A second here, a
> >second there, and pretty soon you're talking real time.
> ></snip>
>>
>
>Well, stop thinking small yourself and use a distributed solution ;-)

Why distribute when a single box is more than capable of handling the load? Which leads me back to the reason I came here, seeking your wisdom  ;-)
Besides, we are already distributed using DNX.

>><snip>
>>> There's no real design issue.
>> </snip>
>>
>> <snip>
>> The design issue is that delays build up and become very
>> observable.
>> </snip>
>>
>
>That's not a design issue, it's just a fact of life.

It's both.
Under the existing design, which IS on the whole a good one, delays can build up.
The best solution at the moment is to reduce the amount of time spent in sleep, just like you said earlier, sched_yield does appear to be the best solution under the current design.

> >I've removed the sleep in my version of nagios and throughput difference is DRAMATIC.

>Do you have a lot of unparallelizable checks?

No, it turns out we don't have any. But we do have 28,000 checks, a check latency around 130 seconds on average, and a very low CPU usage. We saw the sleep and thought it might explain the high latency. It turns out that dramatic throughput increase we saw when we removed the sleep was very short lived, after about an hour the latency began to increase again.

>> That said other things are having a hard time running on the same machine.

>Including plugins and the reaper threads of Nagios ;-)

Plugins seem to be running just fine when they are actually run. The problem is they aren't being run often enough. As far as we know we aren't over loading the reaper threads, at least we aren't getting any "Warning: Overflow detected in service check result buffer - %ul message(s) lost." Messages.

>> I'm going to sprinkle some yields where the sleeps are at and see if that helps, I'll keep you apprised.
>>

>That's a very good idea (replacing sleep(1) with sched_yield()). Just
>make sure it keeps on working on AIX and Solaris and stuff like that,
>where Nagios compiles and runs just fine today.

>I'd prefer if you did it with a helper function to make it easier to
>support various operating systems that need it without duplicating a
>lot of code.

I've attached a patch and am seeking comments.
It won't cure cancer but if you do have non-parallel checks it may reduce your overall latency :)

Sincerely,
Steve


 NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: thread_yield.patch
Type: text/x-patch
Size: 5099 bytes
Desc: thread_yield.patch
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090417/6c47fe12/attachment.bin>
-------------- next part --------------
------------------------------------------------------------------------------
Stay on top of everything new and different, both inside and 
around Java (TM) technology - register by April 22, and save
$200 on the JavaOne (SM) conference, June 2-5, 2009, San Francisco.
300 plus technical and hands-on sessions. Register today. 
Use priority code J9JMT32. http://p.sf.net/sfu/p
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list