Problems with many hanging Nagios processes (Nagios spawning rogue nagios processes eventually crashing Nagios server)

Hendrik Bäcker andurin at process-zero.de
Fri Dec 22 13:21:38 CET 2006


Hi all,

as mentioned in Ethans Thread for testing the actual branch version, I
am afraid the problems are not only sitting on the buffers.

I have talked to a collegue of mine, watching to the sources. Specially
the event.c on line 1079

####
                        if(run_event==TRUE){

                                /* remove the first event from the
timing loop */
                                temp_event=event_list_low;
                                event_list_low=event_list_low->next;

                                /* handle the event */

                                handle_timed_event(temp_event);
// This is 1079 -----------^
                                /* reschedule the event if necessary */
                                if(temp_event->recurring==TRUE)
                                       
reschedule_event(temp_event,&event_list_low);

                                /* else free memory associated with the
event */
                                else
                                        free(temp_event);
                                }
####

The function starts after on line 1154 and following.

If I am right, this is the worker part who do anything for nagios,
starts checks, get check result (reaper), freshness checks and anything
else.

Is this part working serialized (one shot after another) or is it
threaded before?
If it is serialzed, won't it be able to paralize it?

Do anyone know how long the processing of handle_timed_event is running?
(Just a question before, I will test it after this mail compiling with
debug3)

Just a my 2 cents.

Best wishes
Hendrik


Ton Voon schrieb:
> Hi Mahesh,
>
> On 19 Dec 2006, at 00:42, Mahesh Kunjal wrote:
>
>> Here is what we did to resolve.
>>
>> 1. Edit the include/nagios.h.in
>> change
>> #define COMMAND_BUFFER_SLOTS 1024
>> to
>> #define COMMAND_BUFFER_SLOTS 60000
>>
>> And change
>> #define SERVICE_BUFFER_SLOTS 1024
>> to
>> #define SERVICE_BUFFER_SLOTS 60000
>>
>
> I was intrigued by this as we have a performance issue, but not with
> the same symptoms. Our problem is that NSCA processes increase when
> the nagios server is under load. They appear to be blocking on writing
> to the command pipe. Switching NSCA to single daemon mitigates the
> problem (slaves will timeout their passive results), but we wanted to
> know where any slow downs could be.
>
> From your findings, we've created a performance static patch,
> attached. This collects the maximum and current values for the command
> and service buffer slots and is then written to status.dat (by default
> every 10 seconds). What I found with a fake slave sending 128 results
> every 5 seconds was that the maximum values were fairly low (under
> 100), but when I put the server under load, the
> maximum_command_buffer_items shot up to 1969 and the
> maximum_service_buffer_items shot up to 2156 (had changed from
> defaults to your 60000).
>
> This could show if the buffer is filled at various points or if there
> is not enough data ready for Nagios to process further down the chain.
>
> I'd be interested in figures from other systems.
>
> Warning: the patch is not thread safe, so there is no guarantees that
> the statistic data will not be corrupted (but should not affect usual
> Nagios operation). Applies onto Nagios 2.5. Tested on Debian with 2.6
> kernel.
>
> Ton
>
> http://www.altinity.com
> T: +44 (0)870 787 9243
> F: +44 (0)845 280 1725
> Skype: tonvoon
>
>
> ------------------------------------------------------------------------
>
>
>
> ------------------------------------------------------------------------
>
> -------------------------------------------------------------------------
> Take Surveys. Earn Cash. Influence the Future of IT
> Join SourceForge.net's Techsay panel and you'll get the chance to share your
> opinions on IT & business topics through brief surveys - and earn cash
> http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>   


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list