Multiple Nagios proccesses running.

Chris Wilson chris at aidworld.org
Thu Aug 11 17:16:26 CEST 2005


Hi Andreas,

On Thu, 2005-08-11 at 14:24, Andreas Ericsson wrote:
> Chris Wilson wrote:
> > You're right that we can't identify whether the other process is, but
> > killing it sounds much worse than just aborting! What if the user is
> > running several daemons as the same UID (e.g. nobody, daemon) and
> > another one gets the PID that Nagios was using before?

> True. For a proper fix, the lockfile would be locked against writing by 
> the old process. If there is no such process *AND* the file isn't 
> locked, it's fairly safe to assume the process isn't another nagios 
> daemon. If the lock is held, but the pid is wrong, some process is 
> running but has failed to update the pid in the file (a bug, by its own 
> means), and if a process exists but no lock is held, it's safe to assume 
> that the process running is another nagios daemon. However, that leaves 
> us with the old checking system pretty much in place, and your patch 
> becoming something of an extra clarification. filelock held = nagios 
> running, no filelock = nagios possibly not running, or running with some 
> weird permissions, or some such.

Sorry, I don't quite follow your logic here. It's probably just me being
stupid, but I parse this as:

* Lockfile exists, PID in lockfile doesn't exist (any more), file not
locked in kernel => other process is not a Nagios daemon (indeed, as
there is no other process :-)

* Lockfile exists, file locked in kernel by a different PID than the one
in lockfile => some process is running but has failed to update the pid
in the file (true, but very unlikely)

* Lockfile exists, file not locked in kernel, PID exists => safe to
assume that the process running is another nagios daemon (why? if it
failed to lock the file it should have aborted, and it could not lose
the lock afterwards. Sounds to me that this would mean the other process
was NOT nagios).

Under Linux at least, we can not check the PID of the locking process
using F_SETLK, as Nagios tries to do at the moment. From fcntl(2):

             pid_t l_pid;     /* PID of process blocking our lock
                                 (F_GETLK only) */

Nagios always reports that the PID of the other process (locking the
lockfile) is 0. I could include an additional F_GETLK in my patch to get
the real PID, if that seems like a good idea.

My suggested course of action would be to kill the other process ONLY if
the lockfile exists and is locked by the kernel, by the same PID. If
kill -KILL fails to kill the other Nagios process, or if the file is
locked by a different PID or contains the PID of a running process, we
should abort.

> However, in this scenario the filelock should always be attempted as 
> root (or at least as the most privileged user nagios starts as), because 
> root can sometimes (always, but sometimes silently) override filelocks 
> held by processes with lesser privileges.

Doesn't that imply that we should never lock as root? Otherwise we might
override the lock without realising it?

> > Surely it's safer to abort so that the user finds out something is
> > wrong, checks for and removes the old Nagios process, and then deletes
> > the lockfile?
> 
> This assumes user intervention, which I assumed was what you were trying 
> to move away from.

Not necessarily. I wanted to at least improve the current behaviour
which results in weird and unexpected Nagios behaviour after a restart,
and potential corruption and loss of important state data. Manual
intervention may be necessary in any case if a zombie Nagios process is
hanging around, refusing to release the lock or delete the lockfile. 

Avoiding starting a new copy and writing a clear error to the logs is a
big improvement in my book. I would only go further (kill the old
process) if it was clear that it could be done reliably and safely and
in a way that doesn't defy the user's expectations.

> > It's at least better than the current behaviour (on Linux
> > at least) of silently carrying on :-)
> > 
> Indeed, but that behaviour is flawed on its own merit.

If that means that my patch is not flawed on its own merit, I will take
it as a complement, thanks :-)

Seriously, I would regard the current patch as an improvement, even
though I know it does not solve the problem with the init script failing
to kill nagios properly.

> I don't. It only is if Nagios is running as a dedicated pseuod-user, 
> which it won't necessarily be. One could ofcourse in such cases submit a 
> RELOAD command to the external pipe. I'm not sure how many hoops one 
> should jump through though, or even if it's the right one to jump next.

I think that would be excessive. How would we know that the command
succeeded? At least if we TERM and KILL the other process we can find
out whether it died or not, and take appropriate action to avoid having
two fully working (but conflicting) Nagios processed running at once.

If the user asks us to restart nagios, I think we should actually
restart it, not just tell it to reload. Isn't there a separate command
line option for reload anyway?

Cheers, Chris.
-- 
(aidworld) chris wilson | chief engineer (chris at aidworld.org)



-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list