Why distinguish hosts from services?

Andreas Ericsson ae at op5.se
Sat Aug 9 14:35:22 CEST 2008
Previous message: Why distinguish hosts from services?
Next message: Why distinguish hosts from services?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Holger Weiss wrote:
> * Andreas Ericsson <ae at op5.se> [2008-08-07 10:32]:
>> Holger Weiss wrote:
>>> Nagios implements a basic design decision I never quite understood: the
>>> distinction between hosts and services.  This distinction seems to add
>>> quite a lot of complexity, such as duplicated code, four different types
>>> of dependencies (parents, host dependencies, service dependencies, and
>>> the implicit service->host dependencies), and so on.  I don't really see
>>> the gain over simply dealing with arbitrary "objects" and dependencies
>>> between them, which would reduce complexity and provide more flexibility
>>> (such as the possibility to let some service depend on a host it's not
>>> running on, or the other way round).
>>>
>>> Note that I don't doubt the usefulness of syntactic configuration sugar,
>>> such as the implicit service->host dependencies or the nice and simple
>>> way of mapping the network topology using the "parents" directive.  The
>>> thing I don't really understand is why Nagios distinguishes hosts from
>>> services internally (outside the configuration parser).  However, I may
>>> well be overlooking something, so I figured I'd ask what it is :-)
>>>
>>> In any case, giving up such basic distinction would of course require
>>> dramatic changes to Nagios' core, I'm not seriously suggesting to do
>>> something like that anytime soon (so this posting probably isn't very
>>> constructive, sorry).  I'm just asking out of curiosity.
>> I believe it originated from the fact that object dependencies originally
>> consisted almost solely of the implicit service->host dependencies, which
>> came naturally from just thinking about the network in the first place.
>>
>> Anyway, I'm not convinced that re-arranging the dependency stuff will make
>> things any easier. It's not exactly hard to do it properly in the nagios
>> core today, and I'm having trouble imagining a simple enough config syntax
>> without the host->parent dependency stuff. Have you thought anything about
>> that? If so, what's your suggestion?
> 
> I wouldn't want to give up stuff like the "parents" directive or
> implicit service->host dependencies.  While I can imagine a syntax which
> would give the user more control over their semantics, increasing the
> user's flexibility isn't really my main point.  My question is why
> hosts, services, and the various dependency types are handled separately
> in Nagios' core, as opposed to them just being syntactic sugar which is
> resolved into generic objects and dependencies by the configuration
> parser.
> 
> This question first came to my mind while stumbling over the issues with
> Nagios 2.x's host check logic and some problems with host dependencies.
> While thinking about how they should be fixed, I thought that the
> service check and dependency logic already works quite well, and as I
> couldn't really see the inherent difference of host and service objects,
> I thought about whether the separate logic for hosts could maybe just be
> dropped in favour of a generic logic for all monitored objects and their
> dependencies.  (IIRC, there even existed some project which suggested to
> avoid host checks by replacing them with service checks/dependencies
> entirely?)  Anyway, with Nagios 3.x, these issues are mostly solved, so
> if that would've been the first Nagios release I used, I maybe would
> never have thought about it :-)
> 
> However, my (naive?) thought would still be that dealing with generic
> objects and dependencies between them could significantly reduce
> complexity and duplication of code.  Nagios' core includes loads of
> host_foo() and service_foo() functions which do similar stuff (or
> different stuff, but I've yet to see a case where I really understand
> why the difference is necessary), and it includes separate code for the
> different dependency types.
> 
> To give a concrete example of a problem I still have with Nagios 3.x
> which gives me the feeling that these distinctions sometimes complicate
> things unnecessarily:
> 
> We use separate host definitions for separate interfaces (so for us, the
> "host" keyword should really be named "interface" ;-]).  For each host,
> there's a "primary" interface which all other interfaces depend on using
> host dependencies.  Now, for example, if we upgrade a system, we'd like
> to just specify a downtime for the primary interface to make sure that
> no host or service notifications will be generated whatsoever.  If we
> just reboot the host, things work as expected.  But during an upgrade,
> some services will usually go into a hard problem state while the system
> is still UP.  In this case, only the notifications for the services
> running on the primary interface will be suppressed, because Nagios does
> suppress service notifications if the host the service runs on is in a
> downtime, but not if only a host this host depends on is in a downtime.
> 
> Similar problems can occur with parents: if a parent is in a downtime,
> but the parent's host check returns an UP because the parent still pings
> although it stopped routing already, notifications for the child(s)
> won't be suppressed.  Or for service dependencies (though maybe less
> likely): if the dependent-upon service is in a downtime and the
> dependent service is stopped before the dependent-upon service is
> stopped, notifications for the dependent service won't be suppressed.
> 
> Apart from that, it would be nice if objects which directly or
> indirectly depend on an object which is in a downtime would also have
> some "downtime" status flag set, so that tools such as the web interface
> could easily mark them as such.  But that's just cosmetic.
> 
> To fix such problems once and forever, I'd have to implement various
> logics at different places in the code: (1) don't notify on a host if a
> directly or indirectly dependent-upon host is in a downtime; (2) don't
> notify on the services running on this host; (3) don't notify on a
> service if a directly or indirectly dependent-upon service is in a
> downtime; (4) don't notify on a host if a direct or indirect parent is
> in a downtime (with redundant paths accounted for); (5) maybe don't
> notify on the services running on this host, either, just to make sure.
> My dream is that with generic object types and dependencies, I could
> implement a recursive check for downtimes of dependent-upon objects at a
> single place in the code and be done with it, which would be much
> simpler and less error-prone.
> 

A much simpler way of doing it is to set the "notification_options" field
in the host and service-objects to flags (well, everything that could be
flags should be flags, really), then it becomes a matter of doing bitfield
comparisons to see if a notification should be suppressed or not,
regardless of which type of object it is.

One trouble is that to make this generic regardless of which type of object
you're checking it against means both hosts and services would need to
understand the same sort of check results, as well as the same kind of
notification options and everything that gets affected by such things, and
the data structs for both types of objects would need to be identical, which
would waste memory on a O(n) scale, rather than the fixed-price overhead of
almost duplicating some of the code.

Now consider this instead:
if ((host->notification_options & contact->notification_options) & (1 << host->status))
	send_notification;

And then think you've got a macro for it, which goes like this:
#define should_notify(obj, contact) \
	((obj->notification_options & contact->notification_options) & (1 << obj->status))

which means you can get the best of both worlds for the things that are
actually the same (or at least similar enough), while maintaining the
implicit dependencies without wasting memory in such a horrible
non-scalable way.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/
Previous message: Why distinguish hosts from services?
Next message: Why distinguish hosts from services?
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Developers mailing list