Services on down hosts

Andreas Ericsson ae at op5.se
Wed Jun 11 17:54:48 CEST 2008


Jay R. Ashworth wrote:
> On Wed, Jun 11, 2008 at 08:19:14AM +0200, Andreas Ericsson wrote:
>> Jay R. Ashworth wrote:
>>> Stupid question of the week: 
>>>
>>> I see that Nagios 3 represents as CRITICAL services on hosts which are
>>> down.
>> Those CRITICAL tests are actually test-results. Since sometimes host
>> checks fail horribly for other reasons than them actually obtaining
>> an answer and that answer passing some sort of threshold, Nagios will
>> keep running service checks for the host the same way it always has,
>> but notifications for the services are suppressed.
> 
> Aha.  So they show up as CRIT, but if the host was already notified as
> DOWN, then I won't get paged for the services as well.  Ok; that's not
> too bad... (though I haven't actually set up notification yet :-).
> 

Yes.

>>> Since most services are OK-WARNING-CRITICAL; ie: ascending points on a
>>> numeric continum which *cannot be measured if the host is down*,
>>> wouldn't it make more sense if those services went UNKNOWN if the host
>>> wasn't running?
>> No. UNKNOWN is reserved for when plugins get fed nonsense arguments,
>> or can't, due to some really weird errors, complete the check it's
>> supposed to do (such as getting IO errors when trying to write to
>> a network socket).
> 
> Well, since (to take an example), CRITICAL load means "a loadaverage
> over 8" (on my 8-core Opteron), and we don't *know* the load average if
> the machine isn't reachable to return a value...  then the nrpe checker
> on the console in fact *is* getting an IO error when trying to, ok,
> read from a network socket.
> 

I was more thinking along the lines of errno being set to EIO when
attempting to read(2) from an already connected network socket, although
there are two schools about that too (some wants all failures to always
alert, while some wants a lot of things to be in UNKNOWN state).

Not being able to connect clearly signals there is something wrong
with the service though, while an EIO signals that there's something
wrong with the Nagios hosts' kernel or hardware.


>> Yes. Write an event-broker-module that upon a hard host-state change
>> alters the state of the services on that host to whatever you like.
>> You also need to prevent the status from changing in later service
>> checks, or you'd only have a very short period of time when the services
>> are actually in the state you pick.
> 
> IOW: this is not something lots of people are clamoring for.  Got it.
> 

They are, but there are more people clamoring for the opposite, and since
that's what we've already got, we're not about to change it ;-)

> I think if I'm going to invest a lot of work into code, I'll spend it
> reskinning the clunky looking cgi's instead.  :-)
> 

That could well be a wasted effort. Several UI's already exist, and more
are in the brewing. I'd suggest having a look at op5.org within a week or
so instead, and check nagios.org and nagios-community.org for news about
GUI's (op5.org will only have a reports gui though, while nagios.org
will primarily take care of the equivalent of status.cgi et al).

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list