Monitoring large (ish) numbers of servers with exceptions to the rules...

Wheeler, JF (Jonathan) J.F.Wheeler at rl.ac.uk
Tue Jun 17 14:22:58 CEST 2008


> -----Original Message-----
> From: nagios-users On Behalf Of Matthew Macdonald-Wallace
> Sent: 17 June 2008 13:14
> 
> I currently help maintain and monitor around 50 servers across various
> parts of the UK using Nagios 2.  At the moment, we have a
configuration
> file for each host (%hostname%.cfg) and in that file we specify all
the
> services for the named host.
> 
> We are trying to reduce the number of configuration files as we take
on
> more and more servers because there are a large number checks that we
> need to be rolled out to all servers and we feel that we are
> duplicating our workload.
> 
> I'm open to ideas on how to achieve this however my thoughts were a
> setup along the lines of the following:
> 
>  - A "master" host template is created in which all services are
defined
>    for a host.
> 
>  - If a check does not need to be run for a given host (for example it
>    is not a web server), a stanza is added to that particular host's
>    config file that effectively tells nagios "don't check for this
>    service on this host"
> 
> I've tried defining all the services in a master templates file and
> this works perfectly however when I come to exclude certain services,
I
> am at a loss on how to do it.
> 
> Initially I tried adding a stanza with the same service name and
> "register 0" as one of the options, however this didn't work.
> 
> We have used HostGroups in the past to achieve a similar goal, however
> we ran into the issue that whilst we need to check the CPU Usage on
all
> of the servers, a few of the servers that we monitor can take a lot
> more of a beating than the majority.  This lead to us defining the CPU
> checks on a per-host basis as if we defined it separately from the
> hostgroup for the more powerful servers we presented with a load of
> errors regarding duplicate service names.
> 
> I hope I've made myself clear on what we're after and I look forward
to
> receiving your input on this.

One thing that I use in the configuration that I maintain is to have
something like this:

define service{
        use                     generic-hung-mounts
        hostgroup_name          experiments
        hosts                   !lfc0448
        contact_groups          experiments
}

where "lcg0448" is a host in host group "experiments" and I want to
apply the "generic-hung-mounts" check to all hosts in that group except
for "lcg0448".

This can lead to configuration like this:

define service{
        use                     check-pbs-offline
        hostgroup_name          workers
        hosts                   !lcg0614,!lcg0617,!lcg0618,!lcg0626
        contact_groups          tier1a
}
define service{
        use                     check-pbs-offline
        hosts                   lcg0614,lcg0617,lcg0618,lcg0626
        contact_groups          tier1a,grid-team
}

where the only difference is that the hosts in the second definition
have a second contact group.

HTH

Jonathan Wheeler
e-Science Centre
Rutherford Appleton Laboratory

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list