Why distinguish hosts from services?

Andreas Ericsson ae at op5.se
Mon Aug 11 10:05:56 CEST 2008


Holger Weiss wrote:
> * Andreas Ericsson <ae at op5.se> [2008-08-09 14:35]:
>> Holger Weiss wrote:
>>> We use separate host definitions for separate interfaces (so for us, the
>>> "host" keyword should really be named "interface" ;-]).  For each host,
>>> there's a "primary" interface which all other interfaces depend on using
>>> host dependencies.  Now, for example, if we upgrade a system, we'd like
>>> to just specify a downtime for the primary interface to make sure that
>>> no host or service notifications will be generated whatsoever.  If we
>>> just reboot the host, things work as expected.  But during an upgrade,
>>> some services will usually go into a hard problem state while the system
>>> is still UP.  In this case, only the notifications for the services
>>> running on the primary interface will be suppressed, because Nagios does
>>> suppress service notifications if the host the service runs on is in a
>>> downtime, but not if only a host this host depends on is in a downtime.
>>>
>>> Similar problems can occur with parents: if a parent is in a downtime,
>>> but the parent's host check returns an UP because the parent still pings
>>> although it stopped routing already, notifications for the child(s)
>>> won't be suppressed.  Or for service dependencies (though maybe less
>>> likely): if the dependent-upon service is in a downtime and the
>>> dependent service is stopped before the dependent-upon service is
>>> stopped, notifications for the dependent service won't be suppressed.
>>>
>>> Apart from that, it would be nice if objects which directly or
>>> indirectly depend on an object which is in a downtime would also have
>>> some "downtime" status flag set, so that tools such as the web interface
>>> could easily mark them as such.  But that's just cosmetic.
>>>
>>> To fix such problems once and forever, I'd have to implement various
>>> logics at different places in the code: (1) don't notify on a host if a
>>> directly or indirectly dependent-upon host is in a downtime; (2) don't
>>> notify on the services running on this host; (3) don't notify on a
>>> service if a directly or indirectly dependent-upon service is in a
>>> downtime; (4) don't notify on a host if a direct or indirect parent is
>>> in a downtime (with redundant paths accounted for); (5) maybe don't
>>> notify on the services running on this host, either, just to make sure.
>>> My dream is that with generic object types and dependencies, I could
>>> implement a recursive check for downtimes of dependent-upon objects at a
>>> single place in the code and be done with it, which would be much
>>> simpler and less error-prone.
>> A much simpler way of doing it is to set the "notification_options" field
>> in the host and service-objects to flags (well, everything that could be
>> flags should be flags, really), then it becomes a matter of doing bitfield
>> comparisons to see if a notification should be suppressed or not,
>> regardless of which type of object it is.
> 
> If it were done this way, I'd still have to implement the various checks
> I mentioned in order to set the "dependent-upon object is in a downtime"
> flag.  So, while your suggestion would save some memory and allow for
> using generic macros to compare the current state of an object with the
> configured notification_options, it wouldn't really solve my problem.
> 
>> One trouble is that to make this generic regardless of which type of object
>> you're checking it against means both hosts and services would need to
>> understand the same sort of check results
> 
> Yes, I just fail to see the trouble.
> 

HOST_DOWN and SERVICE_WARNING (among other things) overlap, as do lots of
other similar things. In other words, trying to use generic objects without
knowing which kind of objects it is would currently not be possible.

It *would* be possible if you pulled the rug out from under the feet of
all NEB-module authors (as well as being willing to discard all retention
data in the world for at least one restart) and simply re-define those values,
but that's hardly a stellar solution when the net gain is slightly simpler
code.

>> as well as the same kind of notification options and everything that
>> gets affected by such things
> 
> Same here.
> 
>> the data structs for both types of objects would need to be identical, which
>> would waste memory on a O(n) scale, rather than the fixed-price overhead of
>> almost duplicating some of the code.
>>
>> Now consider this instead:
>> if ((host->notification_options & contact->notification_options) & (1 << host->status))
>> 	send_notification;
>>
>> And then think you've got a macro for it, which goes like this:
>> #define should_notify(obj, contact) \
>> 	((obj->notification_options & contact->notification_options) & (1 << obj->status))
>>
>> which means you can get the best of both worlds for the things that are
>> actually the same (or at least similar enough), while maintaining the
>> implicit dependencies without wasting memory in such a horrible
>> non-scalable way.
> 
> Your argument depends on the assumption that there's some inherent
> difference between host and service objects.  If this is true, then
> memory would be wasted by including object type specific data into
> generic data structures.  However, my question was specifically whether
> this assumption actually holds.  I know that Nagios currently believes
> that only a service can be in a WARNING or UNKNOWN state and that only a
> host can be in an UNREACHABLE state.  So far, I'm not convinced these
> dogmas are true :-)
> 

*shrug* I haven't thought a lot about it, but before you start hacking,
please consider the implications I listed above. Those implications make
changes like this 4.0 material, imo.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/




More information about the Developers mailing list