Bug report: nagios shutdown removing lock file too early

Ton Voon ton.voon at altinity.com
Tue Jun 20 16:56:44 CEST 2006


On 19 Jun 2006, at 21:46, Ethan Galstad wrote:

> Ton Voon wrote:
>> Ethan,
>>
>> I think I've seen a problem with the nagios shutdown routine. If
>> nagios is doing a host check and a INT signal is sent, it seems to
>> take a long time before the nagios daemon dies. It looks like the
>> child nagios process is trying to complete all the retries for a host
>> check before going back into the main loop.
>>
>> Also, it appears that the lockfile is being removed before the main
>> process dies. Below is the output for a 'while true; do ps -p 728; ls
>> -l /usr/local/nagios/var/nagios.lock; sleep 1; done' during a kill  
>> 728.
>>
>> [snipped]
>>    PID  TT  STAT      TIME COMMAND
>>    728  ??  Ss     0:01.95 /usr/local/nagios/bin/nagios -d /usr/ 
>> local/
>> nagios/etc/nagios.cfg
>> -rw-r--r--   1 nagios  nagios  4 Jun 13 17:20 /usr/local/nagios/var/
>> nagios.lock
>>    PID  TT  STAT      TIME COMMAND
>>    728  ??  Ss     0:01.95 /usr/local/nagios/bin/nagios -d /usr/ 
>> local/
>> nagios/etc/nagios.cfg
>> -rw-r--r--   1 nagios  nagios  4 Jun 13 17:20 /usr/local/nagios/var/
>> nagios.lock
>>    PID  TT  STAT      TIME COMMAND
>>    728  ??  Ss     0:01.95 /usr/local/nagios/bin/nagios -d /usr/ 
>> local/
>> nagios/etc/nagios.cfg
>> ls: /usr/local/nagios/var/nagios.lock: No such file or directory
>>    PID  TT  STAT      TIME COMMAND
>>    728  ??  Ss     0:01.95 /usr/local/nagios/bin/nagios -d /usr/ 
>> local/
>> nagios/etc/nagios.cfg
>> ls: /usr/local/nagios/var/nagios.lock: No such file or directory
>>
>> This shows the lockfile gets removed before the main daemon dies.
>> (This is from a kill 728, not using any init scripts.) Eventually the
>> daemon dies.
>>
>> I've tested this on Nagios 2.2 on MacOSX 10.4, Nagios 2.0 on Debian
>> and Nagios 2.4 on Debian.
>>
>> Sorry, not had time to delve into the source code.
>
> Yep, this is a bug.  Its been present for several years now, so I
> suppose we could get around to fixing it.  :-)  Is the early lockfile
> removal causing noticeable problems with anything?

I think the lockfile removal is the source of the "multiple Nagios  
processes running". The example daemon-init script uses the lockfile  
as the status of the process. If you were to do a restart, Nagios  
would complete the stop because the signal was sent, but Nagios would  
actually be in the process of shutting down. Meanwhile a start would  
run, so another Nagios process is kicked off. Then, as both Nagios  
processes are trying to access the same files, mayhem can ensue :)

We've got our own startup script and we've change the stop routine to  
wait until nagios has actually stopped before moving out of the stop  
function. Much more stable, but there's a long delay if Nagios is in  
the middle of a host check.

> The file gets
> deleted immediately upon receiving a SIGHUP/etc. to prevent it from
> staying around if Nagios has problems shutting down.

I see why, but I think it is probably better to leave the lock file  
around if there was a problem shutting down, and handle the existence  
of the lock file on startup.

Ton


http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon




More information about the Developers mailing list