BUG? Segfault & coredump with scheduled downtime, downtime scheduled horked

karl.kornel at mindspeed.com karl.kornel at mindspeed.com
Fri Aug 18 02:14:53 CEST 2006


Dear fellow Nagios users,

        Ever since downloading my first Nagios tarball (2.0rc2), and 
continuing to version 2.5, I have been noticing a big problem with 
downtimes.  It appears that if there are more than a couple of downtimes 
scheduled, Nagios will crash partway through the list.  This is a really 
big problem, for several reasons:  I've written a tool that automatically 
schedules recurring downtime.  It integrated into the Nagios web site, and 
anyone with access to schedule downtime for a host/service can schedule 
recurring downtime, and see the other recurring downtimes that have been 
scheduled by their fellow contacts.  I want to release it publicly, but I 
can't finish testing because of Nagios crashing (several downtimes a day 
== one crash per day).  On a different note, this (Nagios) was the first 
open-source tool to become widespread in our IT department.  I'd like 
Nagios to continue to gain acceptance in our group (most of whom are 
SAP/Oracle/Windows/etc.), and this problem doesn't help.

        I don't want to look like I'm barging in and saying "Fix this!", 
so I brought some stuff along with me, and I did what I could to diagnose 
the problem.  First of all, I have two core files.  When I installed 
Nagios, I used the 'install-unstripped' target, so these core files, and 
my copy of Nagios, include debugging symbols.  Here's what the backtrace 
looks like from the most recent coredump:

(gdb) bt
#0  0x00002aaaab20dd20 in strlen () from /lib/libc.so.6
#1  0x000000000042866f in hashfunc2 (
    name1=0x44e4f697 <Address 0x44e4f697 out of bounds>,
    name2=0x4e202c6900000000 <Address 0x4e202c6900000000 out of bounds>,
    hashslots=1024) at utils.c:4285
#2  0x0000000000437d15 in find_service (
    host_name=0x44e4f697 <Address 0x44e4f697 out of bounds>,
    svc_desc=0x4e202c6900000000 <Address 0x4e202c6900000000 out of 
bounds>)
    at ../common/objects.c:5016
#3  0x00000000004518cf in handle_scheduled_downtime 
(temp_downtime=0xfe6500)
    at ../common/downtime.c:311
#4  0x000000000042130e in handle_timed_event (event=0x722320) at 
events.c:1289
#5  0x0000000000421893 in event_execution_loop () at events.c:964
#6  0x000000000040eeb2 in main (argc=Variable "argc" is not available.
) at nagios.c:710
(gdb) 

        I tried to look through the code, and the coredump, and the most I 
could determine is this:  It looks like the scheduled downtime event 
struct was corrupted at some point during its life in the high-priority 
event queue (for one thing, between the time Nagios was started and the 
time it crashed, no more than 10 downtimes had ever been scheduled, yet 
the downtime ID is 81, and no downtime had ever been scheduled that was 
2072 hours long):

(gdb) frame 3
#3  0x00000000004518cf in handle_scheduled_downtime 
(temp_downtime=0xfe6500)
    at ../common/downtime.c:311
311 
svc=find_service(temp_downtime->host_name,temp_downtime->service_description);
(gdb) print *temp_downtime
$1 = {type = 0, host_name = 0x44e4f697 <Address 0x44e4f697 out of bounds>,
  service_description = 0x4e202c6900000000 <Address 0x4e202c6900000000 out 
of bounds>, entry_time = 0, start_time = 2334111869775642625, end_time = 
0,
  fixed = 6488400, triggered_by = 0, duration = 7459712, downtime_id = 81,
  author = 0x2aaa00000000 <Address 0x2aaa00000000 out of bounds>,
  comment = 0x44e4f6bf <Address 0x44e4f6bf out of bounds>, comment_id = 0,
  is_in_effect = 0, start_flex_downtime = 0, incremented_pending_downtime 
= 1,
  next = 0x0}
(gdb) 

        So, I've got two coredumps.  When the second coredump took place, 
and before restarting Nagios, I tarballed the entire Nagios directory, 
including all log files, cache files, etc..  I don't know if the object 
cache or downtimes data files would be of any help, but I've got them in 
storage.

        So, what else?  Well, I've looked at the event log for today, and 
I did notice something weird:  My recurring downtime scheduler schedules 
the day's downtimes every day at midnight, writing commands out to the 
Nagios command socket.  The event logs record receiving 6 
SCHEDULE_SVC_DOWNTIME commands, which is correct.  The first downtime 
started correctly, and ended correctly.  However (here's the weird part), 
the other downtimes started at the exact same moment the first downtime 
ended.  Even more weird, the second, third, and fourth downtimes ended 
when they should have started.  Here's all of the downtime-related entries 
from the event log, with the time values converted into readable 
dates/times:

[2006-08-15 00:11:29] EXTERNAL COMMAND: 
SCHEDULE_SVC_DOWNTIME;szlnmail1.shenzhen;CPU;[Tue Aug 15 17:55:00 2006];[
Tue Aug 15 19:30:00 2006];1;0;0;kornelak;Weekday backup.
[2006-08-15 00:11:29] EXTERNAL COMMAND: 
SCHEDULE_SVC_DOWNTIME;shenzhendc1.shenzhen;CPU;[Tue Aug 15 14:55:00 2006]
;[Tue Aug 15 16:00:00 2006];1;0;0;kornelak;Daily backup
[2006-08-15 00:11:29] EXTERNAL COMMAND: 
SCHEDULE_SVC_DOWNTIME;westborod2.westboro;CPU;[Tue Aug 15 14:55:00 2006];[
Tue Aug 15 16:00:00 2006];1;0;0;kornelak;Daily Backup
[2006-08-15 00:11:29] EXTERNAL COMMAND: 
SCHEDULE_SVC_DOWNTIME;westborom2.westboro;CPU;[Tue Aug 15 15:55:00 2006];[
Tue Aug 15 18:30:00 2006];1;0;0;kornelak;Daily backup
[2006-08-15 00:11:29] EXTERNAL COMMAND: 
SCHEDULE_SVC_DOWNTIME;hillsborom1.hillsboro;CPU;[Tue Aug 15 22:55:00 2006]
;Tue Aug 15 23:15:00 2006;1;0;0;kornelak;Daily Backup
[2006-08-15 00:11:29] EXTERNAL COMMAND: 
SCHEDULE_SVC_DOWNTIME;sophiad1.nice;CPU;[Tue Aug 15 07:55:00 2006];[Tue 
Aug 15 09:30:00 2006];1;0;0;kornelak;Daily Backup
[2006-08-15 07:55:03] SERVICE DOWNTIME ALERT: sophiad1.nice;CPU;STARTED; 
Service has entered a period of scheduled downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: sophiad1.nice;CPU;STOPPED; 
Service has exited from a period of scheduled downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: 
shenzhendc1.shenzhen;CPU;STARTED; Service has entered a period of 
scheduled downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: 
westborod2.westboro;CPU;STARTED; Service has entered a period of scheduled 
downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: 
westborom2.westboro;CPU;STARTED; Service has entered a period of scheduled 
downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: 
szlnmail1.shenzhen;CPU;STARTED; Service has entered a period of scheduled 
downtime
[2006-08-15 09:30:09] SERVICE DOWNTIME ALERT: 
hillsborom1.hillsboro;CPU;STARTED; Service has entered a period of 
scheduled downtime
[2006-08-15 14:55:00] SERVICE DOWNTIME ALERT: 
shenzhendc1.shenzhen;CPU;STOPPED; Service has exited from a period of 
scheduled downtime
[2006-08-15 14:55:00] SERVICE DOWNTIME ALERT: 
westborod2.westboro;CPU;STOPPED; Service has exited from a period of 
scheduled downtime
[2006-08-15 15:55:02] SERVICE DOWNTIME ALERT: 
westborom2.westboro;CPU;STOPPED; Service has exited from a period of 
scheduled downtime
Nagios crashed at 2006-08-15 16:00, which happens to be the times that the 
westborod2.westboro->CPU and westborod2.westboro->CPU downtimes were 
supposed to end.

        Notice how Nagios was fine until sophiad1.nice came out of 
downtime, and suddenly everything else went into downtime!

        So, that's all I've got.  Hopefully it's enough for someone to run 
with it and figure out what's going on.  Up to now I've been running 
Nagios 2.5.  At the time this email goes out, I'll be running the version 
of Nagios in CVS (copied from the daily tarball).  I'll let you know if 
the version in CVS works, but for now I'm going to assume that it does 
not.  Hopefully this is the right place to ask for help (and to ask if 
anyone else has seen this behavior).  I'd be happy to resubmit this info 
somewhere else, if needed.  Thanks in advance for your help!

-- A. Karl Kornel, Mindspeed Technologies, Inc.
karl.kornel at mindspeed.com -- (949) 579-3503
"Remember the Rules: Separation & Optimization"
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20060817/032f2064/attachment.html>
-------------- next part --------------
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list