Nagios 3.1.1 eats cpu like mad

Andreas Ericsson ae at op5.se
Tue Jun 23 18:29:13 CEST 2009


There's a bug in Nagios 3.1.1, making it eat all available CPU even
with a very small configuration (5 hosts, 12 service checks).

I sort of introduced it, as I didn't fully test the impact of a patch
sent in before accepting it. Mea culpa, so I'll make sure to fix it.

For some reason, the patch shown inline below makes Nagios consume
100% CPU on my system. I don't know the reason for this, but I'll
investigate it and see how it can be fixed. I *think* it happens
because Nagios sees that "current_time" is valid and therefore
returns precisely that from get_next_valid_time(), which means it
pushes all the scheduled checks in front of it until enough time
has passed since the check was last *run* before actually executing
it. Obviously, that sucks major donkeyballs, so we really shouldn't
do that. I'll need to check that up a bit more closely before I can
say with 100% certainty that that's what's happening though.

-8<--8<--8<-
commit 523e8c516df323a0bafe98ecb9222384fde62d6e
Author: Andreas Ericsson <ae at op5.se>
Date:   Fri May 22 01:38:28 2009 +0000

    Fix service rescheduling on clock skew/timeperiod change
    
    This patch ensures that services and hosts are never scheduled one
    year into the future and set to never be rescheduled again.
    
    Previously, this could happen if the next preferred time happened
    to already be valid, but stops being so because of clock skew or
    someone changing the timeperiod definition between two Nagios
    restarts while retaining scheduling info.
    
    Patch-sent-by: Ricardo Maraschini <ricardo.maraschini at opservices.com.br>
    Signed-off-by: Andreas Ericsson <ae at op5.se>

diff --git a/base/checks.c b/base/checks.c
index 9d5c497..ef50a20 100644
--- a/base/checks.c
+++ b/base/checks.c
@@ -277,7 +277,7 @@ int run_scheduled_service_check(service *svc, int check_options, double latency)
 				preferred_time=current_time+((svc->check_interval<=0)?300:(svc->check_interval*interval_length));
 
 			/* make sure we rescheduled the next service check at a valid time */
-			get_next_valid_time(preferred_time,&next_valid_time,svc->check_period_ptr);
+			get_next_valid_time(current_time,&next_valid_time,svc->check_period_ptr);
 
 			/* the service could not be rescheduled properly - set the next check time for next year, but don't actually reschedule it */
 			if(time_is_valid==FALSE && next_valid_time==preferred_time){
@@ -2792,7 +2792,7 @@ int run_scheduled_host_check_3x(host *hst, int check_options, double latency){
 				preferred_time=current_time+((hst->check_interval<=0)?300:(hst->check_interval*interval_length));
 
 			/* make sure we rescheduled the next host check at a valid time */
-			get_next_valid_time(preferred_time,&next_valid_time,hst->check_period_ptr);
+			get_next_valid_time(current_time,&next_valid_time,hst->check_period_ptr);
 
 			/* the host could not be rescheduled properly - set the next check time for next year, but don't actually reschedule it */
 			if(time_is_valid==FALSE && next_valid_time==preferred_time){
-8<--8<--8<-


-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
Are you an open source citizen? Join us for the Open Source Bridge conference!
Portland, OR, June 17-19. Two days of sessions, one day of unconference: $250.
Need another reason to go? 24-hour hacker lounge. Register today!
http://ad.doubleclick.net/clk;215844324;13503038;v?http://opensourcebridge.org




More information about the Developers mailing list