Nagios dependency question

Carroll, Jim P [Contractor] jcarro10 at sprintspectrum.com
Thu Dec 26 23:05:06 CET 2002


[CC'ing to the list for the sake of continuity.]

Scott Whitney wrote:
> Thanks for the help, but parents only solve part of the 
> equation.  Here's
> the actual dependency structure:
> 
> nagios --> NAT Router --> Cisco Router --> INTERNET --> coloc 
> router -->
> ping machine --> httpd machine --> check_site (mine)

That's what I'd figured (for the most part).  The only part I'm unfamiliar
with is check_site.  From what you're now saying, check_site can fail on its
own, but if check_http fails, you'll get a notification for the failed
check_http *as well as* a failed check_site (times 55)?  If you can think of
even one scenario where check_http will fail but check_site will come back
OK, you may not want to define that dependancy.

> Now, check_site is a service similar to httpd, but it looks 
> for certain
> things that the web app throws back in the headers.

I take it that your check_site plugin does something that you weren't able
to do with either the -e or the -s option in check_http...?  Just curious;
sometimes the finer details of the plugins are overlooked.

If check_site is basically an enhancement to check_http, I'm wondering why
you're using check_http at all.

Having said that, it's worth noting that (in a proper config) a service on a
given host depends on the host itself being available.  If the service
fails, it should check to see whether the host check is good or not, and if
not, would only alert on the failed host, not on the N failed services.

Taking that approach, once you define parents (as per my previous e-mail),
then if the remote router fails, everything behind that will be
indeterminate (you have no way of knowing whether they're up or down), and
if the local router fails, then everything behind that (including the remote
router) will be indeterminate.

> So, what I've got, then is:
> 
> NAT Router (host) --> Cisco Router (host) --> Coloc Router (host) -->
> machine (host) --> httpd (service) --> check_site (service)
> 
> If I'm checking 55 sites (service checks) per minute, and 
> let's say that
> Coloc Router is down, I want the logic to work as such:
> 
> Well, Coloc Router is down, which means that everyone who has 
> Coloc Router
> as a parent is down, and therefore you can't possibly check 
> the services on
> those machines.  Nagios should not, in this case, log the 
> services as down,
> but rather the machines.  The docs are unclear on the status 
> of checking a
> service with parent hosts down.  Do you know how this works?

If a router is down, you can't flag the hosts *or* the services behind that
router as down.  Or up.  How could you possibly know?  They're indeterminate
(unreachable).  You would need an out-of-band connection to check whether a
remote host/service is down/up.

> Does that make sense?

Sure does.  Just try not to get caught up in the excitement of defining
dependancies.  I've gone through that as well; you'll find yourself doing a
*lot* more keystrokes for effectively the same thing to meet your particular
needs.  (I did it so that for the N NRPE checks on a given host, there are
N-1 dependancies on a very basic "echo OK - NRPE is working" test.  Take
those definitions and try rolling them out across all your hosts... not
fun.)

Try using the parents approach.  You may be pleasantly surprised.

jc

> Thanks, again.
> 
> Scott
> ----- Original Message -----
> From: "Carroll, Jim P [Contractor]" <jcarro10 at sprintspectrum.com>
> To: "'Scott Whitney'" <swhitney at journyx.com>;
> <nagios-users at lists.sourceforge.net>
> Sent: Thursday, December 26, 2002 1:46 PM
> Subject: RE: [Nagios-users] Nagios dependency question
> 
> 
> > Scott Whitney wrote:
> > > Background:
> > > a) Nagios runs "here"
> > > b) There is a router "here"
> > > c) It goes across the Internet to my coloc site (call it "there")
> > > d) There is a router "there"
> > > e) For the purposes of this example, there is 1 "machine" there
> > > f) "machine" runs httpd
> > > g) this httpd is shared for all web apps on the box, of which
> > > there are 55
> > > h) I have a script which checks the status of this web app.
> > >
> > > Here's my problem.  When the router, here, is down, I get 59
> > > messages.  That
> > > is, router "here", router "there", machine ping, machine
> > > httpd + 55 sites.
> >
> > Heh.  I'm not aware of a check_sites plugin.  ;-)  How are 
> you actually
> > checking whether a 'site' is up or down?  (This might be 
> moot, but it
> might
> > be useful info.)
> >
> > > I can solve this using dependencies, but here's my question.
> >
> > You *can*, but I would recommend you define parents in your 
> hosts.cfg.
> Much
> > gentler on the ol' grey matter.  Read:
> >
> > http://your_nagios_server/nagios/docs/xodtemplate.html#host
> >
> > and go straight to the 'parents' directive.  Worth noting:  
> You don't need
> > to define *all* the parent nodes in a given hosts config 
> definition.  In
> > your case, 'router there' would be the parent for all the 
> hosts at the
> colo,
> > 'router here' would be the parent to 'router there'.  (I'm 
> not entirely
> > certain how/when you would want to define multiple parents, quite
> honestly,
> > but I'm taking the simple approach, and it works quite well.)
> >
> > > For the dependencies to work properly, each of the sites must
> > > be dependent
> > > on:
> > >     a) httpd
> > >     b) ping machine
> > >     c) ping router "there"
> > >     d) ping router "here"
> > >
> > > Let's assume I check this every minute.  My math says that
> > > this is roughly
> > > 280 hits on httpd per minute (55 * 5 + 5), 280 pings to the
> > > machine per
> > > minute, 280 pings to the router there per minute and 280
> > > pings to the router
> > > here per minute.
> >
> > I'm not sure how you arrive at those values.  According to 
> what you've
> told
> > us, you have 3 pingable IP addresses, therefore you would 
> get 1 ping per
> > node per minute.  As for httpd hits, you've so far stated 
> that you have 1
> > httpd daemon, so we can only extrapolate that only 1 socket is being
> > listened on.  Perhaps once you've clarified how you define 
> a 'site' and
> how
> > you're checking each site, that'll become clearer.  (If you 
> mean that
> you've
> > defined 55 IP aliases and you're pinging each one, etc...?)
> >
> > > This gets a little worse when you realize I actually have
> > > over 200 sites,
> > > not 55.  Also on 7 boxes, not one, so we're looking at more
> > > like 1005 per
> > > minute, spread unevenly across several boxes.
> >
> > Not quite that bad, given the logic I've been following.
> >
> > > The question, then, is whether anyone has run into this
> > > and/or does Nagios
> > > take this into consideration via any caching mechanism?  The
> > > documentation
> > > says
> >
> > Do you mean caching a ping or an httpd check?  That 
> somewhat defeats the
> > purpose of doing the check to begin with, I'd think, even 
> assuming that
> you
> > could do so.
> >
> > > "Before Nagios executes a service check or sends
> > > notifications out for a
> > > service, it will check to see if the service has any
> > > dependencies. If it
> > > doesn't have any dependencies, the check is executed or the
> > > notification is
> > > sent out as it normally would be. If the service does 
> have one or more
> > > dependencies, Nagios will check each dependency entry as follows:
> > > Nagios gets the current status* of the service that is being
> > > depended upon.
> > > "
> > >
> > > * by default this is the current HARD state
> > >
> > > So...from where is it getting this information?  Further
> > > perusal through the
> > > theory section helps me not at all...
> > >
> > > Anyone have ideas on this?
> >
> > Yes.  Just skip the whole dependancies bit for now, and 
> focus on parents.
> > Unless you're really really in the mood for visualizing a 4-D Klein
> bottle.
> > ;)
> >
> > jc
> >
> > > Thanks,
> > >
> > > Scott Whitney
> > > swhitney at Journyx.com
> > >
> > >
> > >
> > > -------------------------------------------------------
> > > This sf.net email is sponsored by:ThinkGeek
> > > Welcome to geek heaven.
> > > http://thinkgeek.com/sf
> > > _______________________________________________
> > > Nagios-users mailing list
> > > Nagios-users at lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/nagios-users
> > >
> >
> 


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf




More information about the Users mailing list