AW: Cascading Services/Service hierarchy

Andreas Ericsson ae at op5.se
Tue Sep 28 12:38:42 CEST 2004


Mohr James wrote:
>>>I still have problems with the term "device". Being able to 
>>
>>access a 
>>
>>>web shop is neither a "device" or a "host". However, it is 
>>
>>a service.
>>
>>*sigh* From a marketing point of view: yes. From a computer point of 
>>view, it's a chain of processes co-operating to deliver a 
>>certain output 
>>to a variety of sources using a single source as input. Any idiot can 
>>tell you that the webshop is down. Nagios lets you know WHY 
>>it is down, 
>>by checking each service (as a computer would see it, meaning 
>>database-server, web-server, loadbalancer and what-not).
> 
> 
> It seems to me that you are basing your comments on experiences using
> Nagios for your internal IT department. While most of what you same is
> fairly valid in that environment, it often does not apply to
> environments where you are providing the same thing for external
> customers. 
> 
> We are a **service** provider. We do not provide computers or network
> components. We provide a service. It is incumbent on all of our
> employees

All your employees aren't network administrators. Nagios is a network 
admin tool, so it wasn't designed with service providers in mind. It is 
ridiculously easy to check the functionality of such a service, so I 
suggest you get going on writing a plugin to do it. Possibly check_http 
could be used to do it for you. The web-server and database services 
(the network services) can be made dependent on the webshop login 
service, for instance.

> to see the system in terms of what we are providing and not in
> the terms forced upon us by some software.

You really need to get your terminology in order
cat /etc/services on any unix system.

> I would assume that the
> developers of Nagios chose the word "service" instead of "device"

You're confusing service with host (in nagios).

> because that is the common term used in businesses and that's what they
> represent.

They represent services provided by something to something. The fact 
that the webshop consists of several (network layer) services doesn't 
really make it a service from a computer point of view, even though it 
most definitely is from a customer point of view.

What you need to understand is that Nagios does a little more than just 
telling you "webshop not working". To provide details, it needs to know 
what (network layer) services are involved in providing a (customer) 
service.

> Not "device". Service providers provide services and must
> view their systems accordingly.

An extremely small percent of all computer networks belong to companies 
that provide external services.

> The *chain* too is also a logical entity. Each component maybe physical, but the
> chain is logical. A service provider often provides either the chain or the end
> result. The physical entities are secondary.
> 

If you want to fix the chain, you go mend the broken link. Again you 
fail to notice the difference between a tool designed for management to 
show off at conferences, and a tool for admins to help them troubleshoot.

> 
>>By all means, if all you want is a system that tells you 
>>"Hey, there's 
>>something wrong in your network", then by all means, 
>>configure Nagios to 
>>tell you just that. It's really easy. You'd be using about 5% 
>>of Nagios' 
>>capability, and your netadmins still wouldn't be a bit wiser in their 
>>trouble-shooting. That's not really my problem though, so 
>>just go ahead 
>>with it if you like.
> 
> 
> *sigh* Although I appreaciate the time people take to reply to post, it
> does little good when they address the wrong issue. In many real world
> situations you don't don't generally install software simply because it
> addresses one specific issue. Telling us what's wrong is one thing, but
> also immediately knowing what effects a particular outage has on the
> **entire** system is extremely important to service providers.  
> 

Hence servicegroups. Are you even reading reply emails?

> Nothing personal, but I run into this problem with administrators

I take it you aren't an admin yourself then.

> and
> developers all of the time. They work with blinders and only see that
> one component that they are working on and do not see the system as a
> whole. They are only interested in fixing that one specific problem,
> regardless of the effects it has on other systems.

No chain is stronger than its weakest point. It's their job to work to 
fix specific problems. I thought you might have missed that.

> When we have an
> outage, then we need to know what other system are effected. And no, you
> cannot expect every single administrator to know every single
> inter-connection in the system.

Hence servicegroups. RTFM.

> When we have an outage, we have to see
> which customers are effected and then inform them of the outage.

Nagios is not a CRM or customer database tool, so you wouldn't be able 
to get that without making the connections anyway.

> If the
> outage takes 2 minutes longer to solve to inform them, then we have
> fullfilled out contractual obloigations. If we solve it in 18 minutes
> and have not informed them, then we have problems. If something goes
> down, my operators need to know which of our customers to inform. That's
> the real-world and is very common for service providers. HP VantagePoint
> provides this perspective. 
> 

Then HP VantagePoint can probably keep track of your customers as well. 
It sounds like you've already decided.

> Also, if all we know is that the bottom level component is down, we have
> no immediate way of calculating the downtime for all of our customers.

Yes, there is. Get a CSV report of the webshop servicegroup and put it 
in an excel-sheet, then just add up the columns and you're done. Or even 
better, ask someone who knows anything about network administration do 
it for you. They'll know what columns to add.

> We we cannot mark the service the customer sees as "down", then we need
> to manually do the calculations. As I said, TCO goes waaaay up.
> 

I could hack up the script for you right here in about 15 minutes (25 if 
you want graphs and a pretty little logo), but you probably wouldn't pay 
me for it so I guess I'll just skip that.

> 
>>>In
>>>most context "device" refers to something **physical**, a 
>>
>>web shop is 
>>
>>>not physical, but still a service.
>>
>>Only in a marketing context. See above for explanation.
> 
> 
> That's not correct. There is no physical entity that represents the web
> shop as compared to a physical disk. Also, as a service providers you
> *must* be familar with the services you are selling or *marketing*.
> Again, it seems to mean that you are coming from an environment where
> Nagios is used for internal IT and not for customers. 
> 

It's a network admin tool. Customers regularly aren't admins. If you 
want to provision it, you can do so with amazing easy by simply hacking 
up your own scripts. TCO goes WAY down and you get to put your nice 
little trinkets wherever you want.

> As I said before, we are in process of seeing if Nagios can replace HP
> VantagePoint. We don't what to simply convert everything we from VPO-ese
> to Nagios-ese. We want to see what we need to do with Nagios to fulfill
> our obligations. Since this is not an internal IT, one of those
> obligations is notifying the customer or outages and reporting them. So
> from a business perspective these are extremely important as **they**
> are what we are selling and not the physical machine. 
> 

Heres' an idea. Send me an email off-list and tell me what you need. I 
charge $190 an hour, and I can fix everything you need in about a week. 
You cover travelling and accomodations, ofcourse.

> 
>>>This is the service we need to guarentee for our customers. It 
>>>consists of the web server service and database services,
>>
>>Then monitor those services, and group them together in a 
>>servicegroup 
>>named "webshop-services". It's really quite simple, you know.
> 
> 
> Thank you. However, although the Nagios documentation is very extensive,
> you need to know what to look for and where to look before you can
> determine if that's what you need. 
> 

So look at it then. Surely you didn't intend to use something with just 
a hunch that it works?

> 
>>>which then consist of other services and eventually
>>>*physical* hosts and *physical* devices.
>>>
>>
>>Your admins will be happier if they know which physical host 
>>is causing 
>>the ruccus, so let them know that by monitoring the details. 
>>This will 
>>also let you keep availability high on the webshop.
> 
> 
> But management and the customers won't. Besides, I hope that Nagios is
> not so limited that I can only monitor the specific machine OR monitor
> the end services.

You can monitor anything you can imagine. If you had read the docs you 
would have known this.

> We are obligated by our contracts to notify our
> customers within a certain amount of time, report outages at the end of
> the months, as well as fix the problems. Not every company is the same.
> We have specific time limits to report the problem and *begin* working
> on the solution. If we go over the notification time limit, that is
> often worse for our SLAs than taking 2 minutes longer to solve the
> problem. That's the real world. 
> 

I'm not really interested in your obligations. I can only tell you that 
Nagios will do what you want, but only if you take the time to learn 
about it. You seem very reluctant to do that, and it seems you're 
leeching public efforts to satisfy your own curiosity. RTFM.

> 
>>>As far as I see, 2.0 is only available from the CVS 
>>
>>repository and as
>>
>>>alpha/unstable from a couple of web sites. Is that correct?
>>
>>It's considered alpha but I'd gladly call it stable so long 
>>as you don't 
>>enable the embedded perl (which has been causing trouble all 
>>the time).
> 
> 
> Thanks. That's really good. It would be nice to see what is coming. Do
> you have it employed in customer environment or only internally?
> 

Both.

> 
>>It's nice to
>>
>>>know what is coming up. Particularly in our case where we 
>>
>>are still in
>>
>>>the testing and planning stages. I just want to know is it something
>>>that I can implement now. 
>>>
>>>If I undestand you correctly, with 2.0, you will be able to have a
>>>heirarchy with multiple levels. Will the stati propagate, as well?
>>>
>>
>>Multiple levels of what?
> 
> 
> Multiple levels of services. Assume you have two web shops that both
> access the same database and for simplicities sake there is no
> redunancy:
> 
> Shop1 -> Application server1 -> Database1 -> physical machine
> Shop2 -> Application server2 -> Database1 -> physical machine
> 
> HP VPO allows us to graphical represent this heirarchy. Like the tree
> one has with something like Windows Explorer, you can drill down the
> various layers. If the phyiscal machine goes down an operator can
> immediately see that both web shops are affected and contact the system
> administrators (if necessary) **and** contact the customer. 
> 

*sigh* So use VPO then, but stop pestering the list with questions you 
could have gotten from documentation. RTFM.

> The way I have seen so far, and no one has yet contradicted it, is that
> Nagios would **display** the services in the "Service Detail" like this:
> 
> Shop1 -> Application server1 
>       -> Database1 
>       -> physical machine
> 
> Shop1 -> Application server2
>       -> Database1 
>       -> physical machine
> 
> If you are dealing with so many services that it fills up four screens,
> then you have a problem figuring out which services are affected. These
> are the "impacted services".

That's what the service problems and servicegroup overview are for. 
Noone really looks at the service detail if they have a large enough 
network (four pages would be a rather small one, so it might still be 
useful).

>  The operator knows with the first one
> which low-level component is down and can immediately inform the admins
> (or solve the problem themselves in some cases).

You need new operators if they only in some cases can solve the problems 
in their own network.

> However, the other two
> aspects of our obligations are not as easily fulfilled (notification and
> reporting). 
> 

Reporting is, notifications can be if you hack up your own notification 
script. It's trivial.

> 
>>>Although a simple list of all of the services would provide an
>>>"overview", it is not necessarily the best representation of the
>>>services.
>>
>>For an admin it is, since it's vital for finding the cause of the 
>>problem. Are you an admin yourself, or are you a project-leader 
>>implementing Nagios (or are you possibly from management??
>>
>>
>>>Being able to represent/depict the services as a heirarchy is
>>>more accurate than a simple list of what is "included in a
>>>service-flow". 
>>>
>>
>>It's a matter of what you're used to, I suppose.
> 
> 
> And it also a matter of what you are selling. We are selling service and
> not physical machines. We need to know what services are affected at the
> same time as which machine is down. In some cases, knowing which
> services are affected is more important.
> 
> 
>>>True, but for reporting it *is** important to know when the 
>>
>>"customer
>>
>>>support FAQ" doesn't work.
>>
>>So get your availability reports of the 'customer-faq' 
>>supportgroup and 
>>just print the totals.
> 
> 
> And which **other** services are affected by the outage of the load
> balancer?
> 

Service dependencies will handle that.

> 
>>>As you say, as an admin, I am not interested
>>>in the fact that the web shop is not accessible. Instead, I 
>>
>>want to know
>>
>>>that it is the port on the switch connecting the web server 
>>
>>to the DB
>>
>>>server. However, neither my customer nor my boss care that it was a
>>>switch port. They want to know whether we have reached our service
>>>levels or not. 
>>>
>>
>>Naturally. Show them the availability report of the servicegroup and 
>>explain to them that "this database-server can't really cope with the 
>>load" (for example) if you don't meet the demands.
> 
> 
> That's not the point. You have a report at the end of the month saying
> you fulfilled the availability SLA, but failed to fulfill the
> notification SLA because you cannot easily find the impacted services.
>  
> 
> 
>>>If we have to sit down at the end of each month and manually add the
>>>outages to a spreadsheet, then TCO goes waaaaaay up for us. 
>>
>>So having a
>>
>>>mechanism that allows service stati to automatically propagate up a
>>>hierarchy is extremely beneficial to service provides. In 
>>
>>VPO, I know
>>
>>>the top-level is red, thus I know that I am not fulfilling 
>>
>>my service
>>
>>>obligation. I can then quickly drill down and see that 
>>
>>which services
>>
>>>are affected by the outage and what the root cause is. 
>>>
>>
>>I suggest you stick to VPO then. You don't have to use Nagios if you 
>>don't want to.
> 
> 
> Yes, I do have to.
> 
> Look, if you don't want to help me, then I have to accept that. But
> please don't waste your time with obnoxious comments like that. I **have
> to** use it.

Then sit down and RTFM. All of us on this list do this for free on our 
spare time, and I'm frankly quite fed up with people who ask questions 
they could have read the answers to in the documentation.

> In the real world you often don't have choices. Often
> decisions are forced upon you from higher up.

Nobody's forcing you to like it.

> If Nagios does not have
> the advanced features than VPO does, then it would be really helpful to
> know about without spending days of trial and error.

I've already told you how to solve this, but you're not listening. It's 
awfully tiring, really.

> I can read the
> manual from cover to cover, but it's not perfect and is missing
> information. Instead of spending days trying to "figure it out", I came
> to a place where there are knowledgeable people who I assumed are
> willing to help. I'm really sorry if that assumption was incorrect.
> 

It wasn't, but as I've said before, we do this for free on our spare 
time and by re-asking questions you've already been given an answer to 
you're not making any of us more willing to help.

> I would bet that 90+% of all forum and mailinglist posts could be solved
> by the poster by reading the documentation cover to cover and trying out
> every possible combination. I was hoping to avoid that.

I noticed. There are companies offering paid support for Nagios. Many of 
them have advanced additional features. I suggest you contact one of them.

> That's why I
> came am here, to get some help. If you don't want to help me, then
> simply create a filter that tosses my email into the trash bin.  
> 

Have a look at this;
http://www.findopensourcesupport.com/services/support/search.php?advanced_search=1&software_type_match=0&software[]=1&software_lock=1

There'll be knowledgeable and helpful people there who are more than 
willing to ask any question you might have. They do it for a fee, so the 
less you want to read the docs, the better for them.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Lead Developer


-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list