SIGSEGV on reload - incorrect freeing of string?
Ton Voon
ton.voon at altinity.com
Tue Feb 20 22:24:35 CET 2007
Hi!
I wanted to check what people thought about this, because my
knowledge of string handling in C is poor and this problem is hard to
recreate.
We had a situation where nagios had a segfault. Here are the
pertinent entries in nagios.log:
[1171976003] Caught SIGHUP, restarting...
[1171976010] HOST ALERT: hostB;DOWN;SOFT;1;CRITICAL - Plugin timed
out after 10 seconds
[1171976010] SERVICE ALERT: hostB;TCP 2226;CRITICAL;SOFT;1;CRITICAL -
Socket timeout after 10 seconds
[1171976010] SERVICE ALERT: hostA;TCP 2222;CRITICAL;SOFT;1;CRITICAL -
Socket timeout after 10 seconds
[1171976011] Caught SIGSEGV, shutting down...
[1171976073] Nagios 2.5 starting... (PID=19542)
Note that there is a 7 second delay between the signal being caught
and the host alerts, which is reasonable given the check_ping plugin
was timing out after 10 seconds (there were actual network/host
problems). So it is fair to assume that Nagios was in the host
reachability logic.
Ethan made a change to Nagios in 2.5 where nagios exited out of the
host check logic earlier (as this was slowing the restart). This
includes the following code in checks.c:
-----
*** 2086,2089 ****
--- 2094,2107 ----
for(hst->current_attempt=1;hst-
>current_attempt<=max_check_attempts;hst->current_attempt++){
+ /* ADDED 06/20/2006 EG */
+ /* bail out if signal encountered - use old state */
+ if(sigrestart==TRUE || sigshutdown==TRUE){
+ hst->current_attempt=1;
+ hst->current_state=old_state;
+ free(hst->plugin_output);
+ hst->plugin_output=(char *)old_plugin_output;
+ return hst->current_state;
+ }
+
/* check the host */
result=run_host_check(hst,check_options);
***************
*** 2172,2175 ****
--- 2190,2203 ----
for(hst->current_attempt=1;hst->current_attempt<=hst-
>max_attempts;hst->current_attempt++){
+ /* ADDED 06/20/2006 EG */
+ /* bail out if signal encountered - use old state */
+ if(sigrestart==TRUE || sigshutdown==TRUE){
+ hst->current_attempt=1;
+ hst->current_state=old_state;
+ free(hst->plugin_output);
+ hst->plugin_output=(char *)old_plugin_output;
+ return hst->current_state;
+ }
+
/* run the host check */
result=run_host_check(hst,check_options);
-----
I'm wondering if the "free(hst->plugin_output)" is the problem.
Shouldn't this be a:
strcpy(hst->plugin_output, old_plugin_output);
instead?
When trying to recreate this, I created a host with an IP address
that is not pingable, using check_ping as the host check command. I
then submit a passive OK to this host and then schedule a check for
all services immediately. After a few seconds (to allow nagios to get
into the host check logic portion), I send a HUP signal to nagios.
The logs show a Caught SIGHUP and a delay before the HOST ALERT.
I can't get nagios to segfault, but I get the plugin output set to
funny characters, which suggests that plugin_output is not being
correctly set in those routines. These corrupt characters do not
appear if I change to a strcpy call. So it is possible that the
segfault is happening somewhere else.
Any thoughts?
Ton
http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
More information about the Developers
mailing list