Nagios 3.1.1 eats cpu like mad

Ethan Galstad egalstad at nagios.org
Tue Aug 11 18:52:56 CEST 2009


Hiren Patel wrote:
> Ricardo Maraschini wrote:
>> I couldn't simulate the problem with a static configuration, so me try
>> to explain how I simulate the problem changing the timeperiod
>> configuration:
>>
>> 0. Create a service with active checks enabled scheduled to check
>> every 5 minutes
>>
>> 1. Associate this service with a timeperiod(initially it can be 24x7)
>>
>> 2. Wait until the service check and reschedule occur
>>    Lets say that the check occurs at 10:00AM and the next check got
>> scheduled to 10:05AM
>>
>> 3. Stop nagios
>>
>> 4. Change your timeperiod configuration to invalidate the next service
>> check:
>>    Using the above example, you change the service timeperiod
>> configuration to check only from 10:07AM to 24:00. The important thing
>> to simulate the problem is that the next service schedule
>> check(10:10AM) remains valid.
>>
>> 5. Start nagios
>>
>> 6. Wait until the previous scheduled service(10:05AM) occurs.
>>
>> The behaviour will change acording to your nagios version. On previous
>> versions the service is scheduled to next year, on the latest stable
>> release it is scheduled to next week and a message is print in log files.
>>
>> Below you can see an email sent by me in April 2nd about the same
>> issue, it can be useful.
>> Good luck, if you need any other info, please let me know.
>>
> 
> thank you kindly for the explanation above on how to simulate the issue,
> I was able to simulate it using exactly the steps you mentioned.
> for me the problem is again the function that gets the next valid time,
> it returns void so there's no chance of getting an error return value
> from it, but it also sets the next valid time to the preferred time on
> two conditions, one being the preferred time is valid, the other being
> it can't find a good next valid time. I think this function needs to
> return int, and either OK or ERROR separating the two conditions above.
> in any case, the changes you suggested were problematic in one way, the
> run_async_service_check function can return error on a few occasions,
> not limited to the time being invalid. one such condition could be
> dependency constraints, now if we used current_time to get the next
> valid time for such a case, it would return current_time right back, so
> nagios will schedule that check right away, and when run again, loop in
> the same manner over and over. this I suspect caused the cpu eating seen
> with that diff.
> please test the attached diff if you don't mind. anyone else with
> better/bigger test environments than me could also try this, to see that
> it does not eat cpu like it was.
> I'd consider this a workaround and that the function be fixed long term.
> 

I replicated the bug and have just posted a fix to CVS.  The logic was
bad either due to recent timeperiod check logic changes, or since the
dawn of 3.x check logic redesign.

I wasn't able to replicate any CPU hogging, so I'm not sure if that is a
separate issue that needs to be fixed elsewhere.


- Ethan Galstad

------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july




More information about the Developers mailing list