Why separate hosts and services

Andreas Ericsson ae at op5.se
Thu Apr 15 22:07:26 CEST 2004


Chris Wilson wrote:
> Hi Andreas,
> 
> 
>>I can think of at least two good reasons.
>>
>>1) Problem localisation. When a service fails, someone has to fix it. If 
>>they don't know what machine it's on the purpose of a monitoring system 
>>is soundly defeated.
>>
>>Ofcourse, you could type in the host_address and host_alias in every 
>>service-description, but keeping things the way they are really saves a 
>>lot of typing compared to that.
> 
> 
> OK, that's a good point, but it could also be handled by inheriting 
> hostname from service to dependent service, unless overridden by the 
> dependent service.
> 
Not a very good idea, since many servicedependancies have relations 
between several hosts (switch interface operability connects to db 
loadbalancer connects to database servers).

> Another way would be to report the "path" through the "service tree" to 
> the failed service in the notification message. This might actually help 
> fault diagnosis. For example, if you receive separate notifications that 4 
> machines behind the same router have gone down at the same time, then you 
> might assume that the router might be at fault.
> 
Great idea. By simply adding the macro $PARENTS$, this can easily be 
accomplished, while not modifying any core logic.

> At the moment, with the current notification architecture, I don't think
> you can have enough information to do that, without looking at the status
> CGIs or knowing from memory that the hosts are all behind the same router
> (which doesn't scale well :-)
> 
In larger networks there are usually different people handling different 
parts of it, and with a proper naming-standard (with a little help from 
the 'alias' variable in the host object definition), this has never been 
a problem for any of our customers. Some of them have really huge networks.

> 
>>2) Notification suppression. If a service fails, nagios immediately 
>>checks if the host is down. If it is, no more service checks will be 
>>scheduled until the host pops back up.
> 
> 
> But we already do the same thing for dependent services, don't we? I don't
> understand why the logic is different, and why they can't be combined into
> a single, simple if-down-then-check-parent-service algorithm.
> 
Check out the 'parents' variable in host object definition.

> 
>>Check out (host- and service-) dependancies. It's all properly documented.
> 
> 
> To my mind, service dependency is not the same as meta-services (which is
> what I'm talking about).
> 
> For example, let's assume we have three services, A, B and C. A is a 
> meta-service, and B and C "depend" on it. A does not have any check of its 
> own; its state is entirely determined from the states of its dependent 
> services. If B and C both fail, then A is determined to have failed, and 
> not otherwise. 
> 
This can be done today, using service dependancies.

> This is not the same as B and C both depending on A, because if B and C
> both fail, then how does one make A fail automatically in Nagios? I don't
> think it's possible, do you? 

Yes. What you're talking about is modifications to the core logic. 
Having plugins checking this would be 'the long way around'.

> I guess it might involve writing a plugin to
> check the status of all children, and I don't know if Nagios would update
> the status.sav quickly enough that we would be able to determine this
> reliably in the parent check. Do you know if it does?
> 
status.sav is the default state retention file, so we can't even count 
on it being there. status.log gets updated about 1 second after a state 
changes and should be more interesting for something like this.

> Besides which, we would have to parse both the configuration files and
> status.sav to determine this, and neither of those is easy to do.

Not a problem, really. Especially considering the fact that all the code 
to do both is right under the nose of anybody who cares to download the 
sources.

> 
> Cheers, Chris.

-- 
Mvh / Best Regards
Sourcerer / Andreas Ericsson
OP5 AB
+46 (0)733 709032
andreas.ericsson at op5.se


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click




More information about the Developers mailing list