Nagios 3.1.1 eats cpu like mad

Hiren Patel hir3npatel at gmail.com
Mon Aug 10 16:54:43 CEST 2009


Ricardo Maraschini wrote:
> I couldn't simulate the problem with a static configuration, so me try to explain how I simulate the problem changing the timeperiod configuration:
> 
> 0. Create a service with active checks enabled scheduled to check every 5 minutes
> 
> 1. Associate this service with a timeperiod(initially it can be 24x7)
> 
> 2. Wait until the service check and reschedule occur
>    Lets say that the check occurs at 10:00AM and the next check got scheduled to 10:05AM
> 
> 3. Stop nagios
> 
> 4. Change your timeperiod configuration to invalidate the next service check:
>    Using the above example, you change the service timeperiod configuration to check only from 10:07AM to 24:00. The important thing to simulate the problem is that the next service schedule check(10:10AM) remains valid.
> 
> 5. Start nagios
> 
> 6. Wait until the previous scheduled service(10:05AM) occurs.
> 
> The behaviour will change acording to your nagios version. On previous versions the service is scheduled to next year, on the latest stable release it is scheduled to next week and a message is print in log files.
> 
> Below you can see an email sent by me in April 2nd about the same issue, it can be useful.
> Good luck, if you need any other info, please let me know.
> 

thank you kindly for the explanation above on how to simulate the issue, 
I was able to simulate it using exactly the steps you mentioned.
for me the problem is again the function that gets the next valid time, 
it returns void so there's no chance of getting an error return value 
from it, but it also sets the next valid time to the preferred time on 
two conditions, one being the preferred time is valid, the other being 
it can't find a good next valid time. I think this function needs to 
return int, and either OK or ERROR separating the two conditions above.
in any case, the changes you suggested were problematic in one way, the 
run_async_service_check function can return error on a few occasions, 
not limited to the time being invalid. one such condition could be 
dependency constraints, now if we used current_time to get the next 
valid time for such a case, it would return current_time right back, so 
nagios will schedule that check right away, and when run again, loop in 
the same manner over and over. this I suspect caused the cpu eating seen 
with that diff.
please test the attached diff if you don't mind. anyone else with 
better/bigger test environments than me could also try this, to see that 
it does not eat cpu like it was.
I'd consider this a workaround and that the function be fixed long term.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: checks.diff
Type: text/x-patch
Size: 661 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20090810/d8b812ac/attachment.bin>
-------------- next part --------------
------------------------------------------------------------------------------
Let Crystal Reports handle the reporting - Free Crystal Reports 2008 30-Day 
trial. Simplify your report design, integration and deployment - and focus on 
what you do best, core application coding. Discover what's new with 
Crystal Reports now.  http://p.sf.net/sfu/bobj-july
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list