AW: Cascading Services/Service hierarchy

Julian Hein jhein at netways.de
Mon Sep 27 22:21:14 CEST 2004


Hi,

> If we have to sit down at the end of each month and manually add the
> outages to a spreadsheet, then TCO goes waaaaaay up for us. 
> So having a
> mechanism that allows service stati to automatically propagate up a
> hierarchy is extremely beneficial to service provides. In VPO, I know
> the top-level is red, thus I know that I am not fulfilling my service
> obligation. I can then quickly drill down and see that which services
> are affected by the outage and what the root cause is. 

In Nagios it is pretty easy to write your own plugins, so if you need something special in your reports, we would do the following:

1. Write a plugin that actually meassures, what you need in your report. This might be a simple check_http checkcommand (retrieves the basket page from your webshop, greps for a certain string, etc) or even a fully blown application check (select a random product, put it in the basket, check out, fills in dummy credit card data, look if there is an order in the billing system etc.). Lets call it check_myshop.

2. Define one check with the plugin "check_myshop". If it is on a cluster you can also use a virtual host for it. Lets call this service "OnlineShop Availability"

3. Define all the other stuff you would like to know as an admin that helps you track down the problem: Diskspace, CPULoad, IfState_eth1, whatever

4. Define a dependency making you check "OnlineShop Availability", dependent on all the technical checks, e.g. DiskSpace, CPULoad, Firewll

 a) With this, you will not get a notification if there is a technical problem, because the dependency will notice, that the shop is offline because of some other reason, e.g. the disk is full.

 b) If a problem occours, you did not define a service for, you will still get an notification telling you that "OnlineShop Availability" has a problem, but without the real reason. After fixing the problem, you could implement an additional check that will tell you the real reason next time. 

When you need to report on your SLA, you just run a report on "OnlineShop Availability" and it will give you the correct figures. On the other hand, it is absolutly impossible to think about all causes why our app might fail in the first place and therefor it is impossible to implement all the checks, this strategy will help you to get better and better in you monitoring effort.

Julian

P.S.: If you like, you can send me a PM in German or we could talk on the phone as well.

-- 
Julian Hein                   NETWAYS GmbH
Managing Director             Deutschherrnstr. 47a
Fon.0911/92885-0              D-90429 Nürnberg
Fax.0911/92885-31                                        
jhein at netways.de              www.netways.de      


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list