Services on down hosts

Andreas Ericsson ae at op5.se
Thu Jun 12 21:18:42 CEST 2008


Jay R. Ashworth wrote:
>>> On Wed, Jun 11, 2008 at 05:54:48PM +0200, Andreas Ericsson wrote:
>>>>> Well, since (to take an example), CRITICAL load means "a loadaverage
>>>>> over 8" (on my 8-core Opteron), and we don't *know* the load average if
>>>>> the machine isn't reachable to return a value...  then the nrpe checker
>>>>> on the console in fact *is* getting an IO error when trying to, ok,
>>>>> read from a network socket.
>>>> I was more thinking along the lines of errno being set to EIO when
>>>> attempting to read(2) from an already connected network socket, although
>>>> there are two schools about that too (some wants all failures to always
>>>> alert, while some wants a lot of things to be in UNKNOWN state).
>>>>
>>>> Not being able to connect clearly signals there is something wrong
>>>> with the service though, while an EIO signals that there's something
>>>> wrong with the Nagios hosts' kernel or hardware.
>>> My problem with that is that not all of what Nagios monitors is
>>> "services", in the meaning we usually give to that term.  Much of it is
>>> "attributes" -- load average and diskspace on a machine being great
>>> examples.
>> True that, but the service of storing a file on disk (or, for some
>> retarded filesystems, reading one from a disk) requires there to be a
>> minimum of free space available. It's what makes up the platform on
>> which the *real* services rest. Hence servicegroups (which together
>> make up what a service-provider would call a service).
> 
> Could you expand on that?  Do you mean to imply that a good use for a
> servicegroup is "all the physical services upon which my public website
> rests", as I think I read in your reply there?
> 

Yes, that's what I mean. Groups are first and foremost a visual aid
(never mind configuration, as that can be scripted). Having some random
point-and-click monkey on duty watching the servicegroup summary will
give you a quick warning of what the users will claim has broken down
when they call to complain.

>>> IMHO, anything you're trying to monitor that's actually a "service" --
>>> IE: a public facing website -- shouldn't be directly attached to a host,
>>> anyway...
>>>
>>> What if you're Google?  Which host do you attach "http://www.google.com" 
>>> to?
>> All the query distributors (google works by having several front-end servers
>> distributing the incoming queries to quite a large army of query responders,
>> which have access to the gdfs (google distributed filesystem) for doing the
>> actual lookups). Since a monitoring tool is only worth something if it tells
>> you *where* things break rather than only that things are broken, that
>> makes perfect sense for a monitoring system even if that's not the case for
>> the service provider or its sales people.
> 
> Ok, Google was a poor choice.
> 
> Conversely, though, there may be cases where... or maybe there aren't.
> let me muse on this some more.
> 

Muse away. :)

I'm fairly convinced the net op team will still demand a system that
shows them where the problem is though, while the it support team
just want to know what to say when the customers/users/whatever calls
in and claim "the mail isn't working" (servicegroups help there).

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list