Nagios Strategies

John P. Rouillard rouilj at cs.umb.edu
Wed Dec 27 20:58:52 CET 2006


In message <3c4611bc0612271118m6da6a868o34b269d31583cd57 at mail.gmail.com>,
"Brian Loe" writes:
>One of the things Tivoli had promised to provide our company was
>end-to-end insight of our Java applications. It doesn't work. Any
>product that does this is going to be difficult to create, implement
>or administer. In our case, the easiest method for verifying the
>uptime and performance of our applications is to have our developers
>build it into the application.

There are a few frameworks out there that support adding stats for
external consumption.

You can also get some coverage in the same way your QA testers do. I
have re-purposed some QA tests that provide cross coverage (so if test
1 and test 3 fail in a certain way, but test 2 passes I know what
component has failed). Sadly most QA tests take a while to run and
place a non-insignificant load on the system but it is a way of trying
to get a reasonable code/monitoring coverage.

>If you only want to know about the interactions between services on
>various servers then I would guess you'll need to use those same
>communication lines (whether that be a protocol, port, or physical
>network) and do some check writing.

We have a VPN that links my employer's sites. I use check_by_ssh to
ping the other remote gateways from each gateway for just this reason.

>Obviously simply checking that the service is up on the individual
>can only tell you that the service is running - not that its not
>communicating or having a communication problem.

I do a lot of log analysis here looking for expected sequences of log
entries meeting timing constraints.

Also if you have queries that trigger the communication, and the
queries fail to return properly that is a good indicator of the root
cause of the failure if you have other tests that target other failure
modes.

>I'm not sure how to represent this in Nagios, though.

The best I have been able to do is to create servicegroups for the
services that make up an application. Then document relationships
between them in twiki (which is accessed via the services and hosts
link in nagios).

I am working on a rule based mechanism that allows me to say:

  service1 fails in a particular way
  service2 exhibits a particular error 5 minutes later

then the mechanism will notify the user of the problem diagnosis
(service2 fails because of service1 failure) and that no action is
needed unless service 2 doesn't recover within 10 minutes.

Note that this is a different reaction than if service2 had failed
without the preceding service1 failure.

This encodes multiple components of the diagnostic process into an
analysis engine whose output can point out likely root causes to the
operators. So it's represented in nagios as an overall application
health/analysis service that is fed from the rule mechanism.

You don't get a picture of the relationships, but...

				-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list