AW: Cascading Services/Service hierarchy

Mohr James james.mohr at elaxy.com
Tue Sep 28 09:17:51 CEST 2004


> > 
> > I still have problems with the term "device". Being able to 
> access a 
> > web shop is neither a "device" or a "host". However, it is 
> a service.
> 
> *sigh* From a marketing point of view: yes. From a computer point of 
> view, it's a chain of processes co-operating to deliver a 
> certain output 
> to a variety of sources using a single source as input. Any idiot can 
> tell you that the webshop is down. Nagios lets you know WHY 
> it is down, 
> by checking each service (as a computer would see it, meaning 
> database-server, web-server, loadbalancer and what-not).

It seems to me that you are basing your comments on experiences using
Nagios for your internal IT department. While most of what you same is
fairly valid in that environment, it often does not apply to
environments where you are providing the same thing for external
customers. 

We are a **service** provider. We do not provide computers or network
components. We provide a service. It is incumbent on all of our
employees to see the system in terms of what we are providing and not in
the terms forced upon us by some software. I would assume that the
developers of Nagios chose the word "service" instead of "device"
because that is the common term used in businesses and that's what they
represent. Not "device". Service providers provide services and must
view their systems accordingly. The *chain* too is also a logical
entity. Each component maybe physical, but the chain is logical. A
service provider often provides either the chain or the end result. The
physical entities are secondary.

> By all means, if all you want is a system that tells you 
> "Hey, there's 
> something wrong in your network", then by all means, 
> configure Nagios to 
> tell you just that. It's really easy. You'd be using about 5% 
> of Nagios' 
> capability, and your netadmins still wouldn't be a bit wiser in their 
> trouble-shooting. That's not really my problem though, so 
> just go ahead 
> with it if you like.

*sigh* Although I appreaciate the time people take to reply to post, it
does little good when they address the wrong issue. In many real world
situations you don't don't generally install software simply because it
addresses one specific issue. Telling us what's wrong is one thing, but
also immediately knowing what effects a particular outage has on the
**entire** system is extremely important to service providers.  

Nothing personal, but I run into this problem with administrators and
developers all of the time. They work with blinders and only see that
one component that they are working on and do not see the system as a
whole. They are only interested in fixing that one specific problem,
regardless of the effects it has on other systems. When we have an
outage, then we need to know what other system are effected. And no, you
cannot expect every single administrator to know every single
inter-connection in the system. When we have an outage, we have to see
which customers are effected and then inform them of the outage. If the
outage takes 2 minutes longer to solve to inform them, then we have
fullfilled out contractual obloigations. If we solve it in 18 minutes
and have not informed them, then we have problems. If something goes
down, my operators need to know which of our customers to inform. That's
the real-world and is very common for service providers. HP VantagePoint
provides this perspective. 

Also, if all we know is that the bottom level component is down, we have
no immediate way of calculating the downtime for all of our customers.
We we cannot mark the service the customer sees as "down", then we need
to manually do the calculations. As I said, TCO goes waaaay up.

> > In
> > most context "device" refers to something **physical**, a 
> web shop is 
> > not physical, but still a service.
> 
> Only in a marketing context. See above for explanation.

That's not correct. There is no physical entity that represents the web
shop as compared to a physical disk. Also, as a service providers you
*must* be familar with the services you are selling or *marketing*.
Again, it seems to mean that you are coming from an environment where
Nagios is used for internal IT and not for customers. 

As I said before, we are in process of seeing if Nagios can replace HP
VantagePoint. We don't what to simply convert everything we from VPO-ese
to Nagios-ese. We want to see what we need to do with Nagios to fulfill
our obligations. Since this is not an internal IT, one of those
obligations is notifying the customer or outages and reporting them. So
from a business perspective these are extremely important as **they**
are what we are selling and not the physical machine. 

> > This is the service we need to guarentee for our customers. It 
> > consists of the web server service and database services,
> 
> Then monitor those services, and group them together in a 
> servicegroup 
> named "webshop-services". It's really quite simple, you know.

Thank you. However, although the Nagios documentation is very extensive,
you need to know what to look for and where to look before you can
determine if that's what you need. 

> 
> > which then consist of other services and eventually
> > *physical* hosts and *physical* devices.
> > 
> 
> Your admins will be happier if they know which physical host 
> is causing 
> the ruccus, so let them know that by monitoring the details. 
> This will 
> also let you keep availability high on the webshop.

But management and the customers won't. Besides, I hope that Nagios is
not so limited that I can only monitor the specific machine OR monitor
the end services. We are obligated by our contracts to notify our
customers within a certain amount of time, report outages at the end of
the months, as well as fix the problems. Not every company is the same.
We have specific time limits to report the problem and *begin* working
on the solution. If we go over the notification time limit, that is
often worse for our SLAs than taking 2 minutes longer to solve the
problem. That's the real world. 

> > As far as I see, 2.0 is only available from the CVS 
> repository and as
> > alpha/unstable from a couple of web sites. Is that correct?
> 
> It's considered alpha but I'd gladly call it stable so long 
> as you don't 
> enable the embedded perl (which has been causing trouble all 
> the time).

Thanks. That's really good. It would be nice to see what is coming. Do
you have it employed in customer environment or only internally?

> It's nice to
> > know what is coming up. Particularly in our case where we 
> are still in
> > the testing and planning stages. I just want to know is it something
> > that I can implement now. 
> > 
> > If I undestand you correctly, with 2.0, you will be able to have a
> > heirarchy with multiple levels. Will the stati propagate, as well?
> > 
> 
> Multiple levels of what?

Multiple levels of services. Assume you have two web shops that both
access the same database and for simplicities sake there is no
redunancy:

Shop1 -> Application server1 -> Database1 -> physical machine
Shop2 -> Application server2 -> Database1 -> physical machine

HP VPO allows us to graphical represent this heirarchy. Like the tree
one has with something like Windows Explorer, you can drill down the
various layers. If the phyiscal machine goes down an operator can
immediately see that both web shops are affected and contact the system
administrators (if necessary) **and** contact the customer. 

The way I have seen so far, and no one has yet contradicted it, is that
Nagios would **display** the services in the "Service Detail" like this:

Shop1 -> Application server1 
      -> Database1 
      -> physical machine

Shop1 -> Application server2
      -> Database1 
      -> physical machine

If you are dealing with so many services that it fills up four screens,
then you have a problem figuring out which services are affected. These
are the "impacted services".  The operator knows with the first one
which low-level component is down and can immediately inform the admins
(or solve the problem themselves in some cases). However, the other two
aspects of our obligations are not as easily fulfilled (notification and
reporting). 

> Define 'stati'.

Sorry, my German slipped in. The English plural is "status".


> > Although a simple list of all of the services would provide an
> > "overview", it is not necessarily the best representation of the
> > services.
> 
> For an admin it is, since it's vital for finding the cause of the 
> problem. Are you an admin yourself, or are you a project-leader 
> implementing Nagios (or are you possibly from management??
> 
> > Being able to represent/depict the services as a heirarchy is
> > more accurate than a simple list of what is "included in a
> > service-flow". 
> > 
> 
> It's a matter of what you're used to, I suppose.

And it also a matter of what you are selling. We are selling service and
not physical machines. We need to know what services are affected at the
same time as which machine is down. In some cases, knowing which
services are affected is more important.

> > True, but for reporting it *is** important to know when the 
> "customer
> > support FAQ" doesn't work.
>
> So get your availability reports of the 'customer-faq' 
> supportgroup and 
> just print the totals.

And which **other** services are affected by the outage of the load
balancer?

> > As you say, as an admin, I am not interested
> > in the fact that the web shop is not accessible. Instead, I 
> want to know
> > that it is the port on the switch connecting the web server 
> to the DB
> > server. However, neither my customer nor my boss care that it was a
> > switch port. They want to know whether we have reached our service
> > levels or not. 
> > 
> 
> Naturally. Show them the availability report of the servicegroup and 
> explain to them that "this database-server can't really cope with the 
> load" (for example) if you don't meet the demands.

That's not the point. You have a report at the end of the month saying
you fulfilled the availability SLA, but failed to fulfill the
notification SLA because you cannot easily find the impacted services.
 

> > If we have to sit down at the end of each month and manually add the
> > outages to a spreadsheet, then TCO goes waaaaaay up for us. 
> So having a
> > mechanism that allows service stati to automatically propagate up a
> > hierarchy is extremely beneficial to service provides. In 
> VPO, I know
> > the top-level is red, thus I know that I am not fulfilling 
> my service
> > obligation. I can then quickly drill down and see that 
> which services
> > are affected by the outage and what the root cause is. 
> > 
> 
> I suggest you stick to VPO then. You don't have to use Nagios if you 
> don't want to.

Yes, I do have to.

Look, if you don't want to help me, then I have to accept that. But
please don't waste your time with obnoxious comments like that. I **have
to** use it. In the real world you often don't have choices. Often
decisions are forced upon you from higher up. If Nagios does not have
the advanced features than VPO does, then it would be really helpful to
know about without spending days of trial and error. I can read the
manual from cover to cover, but it's not perfect and is missing
information. Instead of spending days trying to "figure it out", I came
to a place where there are knowledgeable people who I assumed are
willing to help. I'm really sorry if that assumption was incorrect.

I would bet that 90+% of all forum and mailinglist posts could be solved
by the poster by reading the documentation cover to cover and trying out
every possible combination. I was hoping to avoid that. That's why I
came am here, to get some help. If you don't want to help me, then
simply create a filter that tosses my email into the trash bin.  

Regards,

Jim Mohr


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list