About a active plugin in local machine

Paul L. Allen pla at softflare.com
Wed Jun 22 19:03:22 CEST 2005


Harlan Richard C writes: 

> I know about distributed monitoring, and we have run it before, the
> question the way I read it was had nothing to do with distributed
> monitoring but sending a change of state of the service.

It was a mixture of both.  He wants to monitor machines on remote networks
but (I assume) has found that active checks put too much load on his
Nagios box or he can't get the checks through firewalls.  His solution
is to use what is effectively *very* distributed monitoring by puting
NC_Net on every server he wants to monitor and having each monitored
server submit its own passive checks.  I would guess he's doing that
because he can't install a linux box with Nagios in each of the
remote sites.  So he is doing distributed monitoring, although whether
for sensible reasons or not we do not know for sure. 

What he also wanted to do was only submit passive results if a service
changed state in order to reduce the load.  That would be reasonable IF
he had a remote Nagios box doing the checks and IF he performed active
checks on the remote Nagios box or if he had staleness checks on the
passive results the remote Nagios box submitted for its own service.
In that situation if a service on the monitored server goes down the
remote Nagios box spots the problem and submits a passive result.  If
the remote Nagios box goes down then the master Nagios box spots the
problem. 

But with each server monitoring itself and submitting results only if a
state changes there are failure modes he cannot detect.  He can't enable
staleness checking because the results don't come in on a regular basis.
That means if he only submits state changes and the server itself dies
or goes up in flames, or loses power, or whatever, his Nagios server will
continue to think the service is up. 

As somebody else suggested you can add check_ping as a service check
to the monitored servers but you can get ping responses from machines
that have got themselves into a state where nothing else is working.  So
the web server, mail server, and NC_Net could all be locked up but ping
is OK so you think everything is working. 

> Witch is a valid thing to want to do. The metrics over all will not be
> off, if the service goes down Nagios will get the passive results sent
> to it, if the box goes down you have another check to allow Nagios to
> down the box.

Which ought to be better than a ping check.  It needs to be a check of
one of the essential services.  Except it could well be that each of the
servers he's monitoring only have one or two essential network services
(we have a client that has several computers that are each dedicated to
being web servers and nothing else, other computers dedicated to being
mail servers and nothing else, other computers that are dedicated to
being MS SQL servers and nothing else, etc.)  If his servers are like
that then he's back in the situation of active checks.  And even if
his servers run multiple services and he actively checks only one, there's
a small possibility that one or more of the passively checked services
could fail along with NC_Net but the service he actively checks continues
to give good results.  I admit it's a very small chance, but it's a real
one.  For instance, running out of disk space might kill SMTP and NC_Net
(if he's having it write to a log) but DNS could continue to work.  DNS
is a lightweight protocol so would be the obvious choice to reduce the
load placed on the main Nagios box. 

> If the Nagios is setup with passive check with a time out Nagios will 
> force the check it self get an unknown and then ping the box.

That is the normal way of doing it.  But he wanted to only submit state
changes, which means he can't do staleness checking - the service could
be up for days or weeks (or even longer with a non-Microsoft OS) so
staleness checks couldn't be used. 

> But over all I still think that if you are waiting for the host to
> update Nagios about the state of the service then it is a non critical
> check.

I don't think it is unsuitable for checking critical services provided
you have staleness checking enabled or at least one active service check
for each server.  Without staleness checks or the overhead of at least one
active service check per server you have too many failure modes that
will result in you not knowing services have gone down.  And that was the
point I was trying to get through to him, that submitting only state
changes would mean either that the monitoring was unreliable or that he'd
have to do other things in order to get anything he could trust. 

> That is not the same a down stream Nagios box running a active check
> then sending the data to the Main Nagios server. 

There's no real difference provided you do it right (regular submission
of passive check results whether there has been a state change or not
together with staleness checking).  It's just very, very distributed
monitoring. 

-- 
Paul Allen
Softflare Support 




-------------------------------------------------------
SF.Net email is sponsored by: Discover Easy Linux Migration Strategies
from IBM. Find simple to follow Roadmaps, straightforward articles,
informative Webcasts and more! Get everything you need to get up to
speed, fast. http://ads.osdn.com/?ad_id=7477&alloc_id=16492&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list