Nagios ignores broken file descriptor?

Andreas Ericsson ae at op5.se
Thu Nov 20 10:27:12 CET 2008


Please don't top-post. It makes my teeth itch.

Steven D. Morrey wrote:
> Maybe have it email someone if that happens?

That would be very unreliable too since there's no guarantees about
what a flaky kernel/hardware combo will do when Nagios tries to run
the command, or that the other command doesn't bunk out.

> Honestly though, if it's not logging then it's not performing it's
> critical role in the network, since the monitoring results are what we
> care about, not the act of monitoring itself.

It's still monitoring and it will still try to send notifications for
problems that arise in the network. Logging is primarily for reports.
Crashing means it won't even try to send notifications, so that's one
step *worse*.

> Having it go down completely might be the best option in such a case.
> Maybe this could be a config file option in future releases.
> 

If you send a good patch, I'll make sure to bring it up with the rest
of the developers and open this issue for discussion again.

> FYI we found the source of the scsi error.
> Turns out it was a bad driver that shipped with SLES 9.
> 

So neither email nor SMS notifications would have worked then, as
both rely on spooling systems to do their job. In short, Nagios had
absolutely no way out that would have worked at all, but you still
think it should have done something different. You'll need more than
that to convince me this is something worth investing effort in, but,
like I said, I'll happily take patches.


NOTICE: This email can be spread far and wide. I don't care one way
or another to whom you give this email, or the email(s) I'm responding
to. Neither does anyone else, so print it out and put it up next to
the coffee-machine at work. There it might give some unimportant but
anal-retentive mid-level nobody boss of nothing important something
to think about when he next sits down to formulate a policy on what
kind of retarded text everyone in your organization should include
in their emails.


> Sincerely,
> Steven D. Morrey
> 
> On Wed, 2008-11-19 at 01:13 -0700, Andreas Ericsson wrote:
>> Steven D. Morrey wrote:
>>> Here is an strace on the same box from just a few minutes ago.
>>> As you can see whats happening is Nagios does not appear to be catching
>>> the error about trying to write to a read only file system.
>>>
>>> nanosleep({1, 0},{1, 0})               = 0
>>> kill(-7799, SIGKILL)                    = -1 ESRCH (No such process)
>>> gettimeofday({1227029613, 509530}, NULL) = 0
>>> close(10)                               = 0
>>> open("/usr/local/nagios/var/nagios.log", O_RDWR|O_APPEND|O_CREAT, 0666)
>>> = -1 EROFS (Read-only file system)
>> Actually, it does catch it (which is why it doesn't try to write to it),
>> but since it's the logging API, there's not much Nagios can do about it
>> except crashing out. Given Nagios' role in the network, it's considered
>> better to keep running with logging disabled than to silently die without
>> leaving a core-dump or some other entry-point for debugging.
>>
>> In other words, this is unfortunate, but by design. If you have a solution,
>> I'm all ears.
>>
> 
> 

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/




More information about the Developers mailing list