Feedback on Nagios

Marcus Vogt mgvogt at bigpond.com.au
Sat Dec 13 13:55:11 CET 2003
Previous message: PIPE_BUF Was: Logging for critical messgaes
Next message: Feedback on Nagios
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Hi guys,

Just a bit of feedback on Nagios and the issues we have and are running 
into with it.
The issues I raise may be because I do not understand Nagios or its 
plugins sufficiently
or I am trying to use it in a way it is not intended.  I am more than 
happy with any
constructive feedback or criticism on this.

I'd like to say first of all thanks for a really good product it does a 
pretty darn good job
of presenting the status of a service type network very well.

We previously used HP OpenView Network Node Manager (NNM) exclusively, 
however
it is designed specifically for monitoring network type devices and has 
no real concept of
services and service dependencies.

We are in the initial phases of trying Nagios out (about 1-2 months). 
 It is important to note
that I have not tinkered with the internals other than a small change 
for permissions.  I have
been focused on getting it to monitor and report on things.

We continue to use NNM for network discovery and also network display 
(drill down and
containers) as this is an area where Nagios is weak in comparison.

** Nagios has poor network discovery facilities - particularly Layer 2.
** Nagios does not have a network "drill down" type map ala NNM.
** Nagios can not handle really large layouts - does not fully draw the 
map - have not
       investigated this issue yet though.
** I have not yet found a way to have nagios understand complex network 
dependencies.
     We have a large number of redundant paths, and we can not draw 
these correctly.  This
     is very probably a lack of understanding on my part though.  

I know that there is nmap discovery, however it does not give you the 
layer 2 type dependency
information that is important for correctly configuring dependencies. 
 Additionally this does
not discover SNMP variables for CPU, Memory, Storage, or Networking. 
 Now I understand
that there are a 1001 different SNMP MIB's out there, but there are some 
rather obvious ones
that hold a fair amount of market share:

Cisco and HP for Networking equipment.
HOST MIBs (Covers MS Win2k, NetSNMP and others)
IF MIBs for network interfaces (IP & Layer 2) and routing.

** Nagios has poor SNMP discovery of common services on MIBs.  Is this 
really a problem though?
      Perhaps this is the responsibility of individual deployments.

Now this isn't a major problem for us - I wrote discovery scripts in 
perl that given a list of
hosts, will interrogate their SNMP services and provide all the goodies 
- services, service dependencies,
service extended information (discussion later) but that lead us to the 
next problem.

Currently with the first sweep of discovery (excluding networking type 
queries) we ended up with around
300 hosts and 1500 services with checking of services every 5 minutes. 
 This absolutely hammered CPU of the
box it was on (Sun E250 Dual CPU and 2Gb Memory).  This was okay, we 
used the embedded perl option
and this got us to just under 100% utilisation.  Yes this is an issue 
with the plugins and I'll discuss this later.

One of the problems with the embedded perl is that it has a rather large 
memory leak.    This has to be reset
every four to five days as it creeps up to 3-400 Mb Ram.  I know this is 
being addressed in the next version,
but I do point it out.  Nb. We use caching as well.

** Nagios Embedded perl (with caching) leaks memory a lot - work around 
is in next version.

To compound this we use performance monitoring to feed data into RRD 
tool for further processing.
We use RRD (RRDcgi is really neat) to provide historical trends and also 
handle non-gauge type collections
such as counters.  Admittedly we run this at maximum nice levels to 
ensure it does not impact primary
data collection work.

** Nagios does not natively deal with counters - not really a Nagios 
problem, just an observation. i.e. write your
      own plugins (we have).
** Nagios does not natively collect data that can be graphically 
displayed "out of the box" - again not really a problem
      just an observation.  Everyone can roll their own, but it would be 
nice if something was provided.
     (I can provide my simple prototype perl scripts if you like, but I 
think perl is a bit of a problem as it is not going to scale well.)

Now I mentioned that I wrote my own discovery scripts for SNMP. These 
are targetted at HOST & Vendor Mibs for Win2K
and HOST Mibs for Unix hosts.  This woks very nicely giving us the 1500 
odd services.  The problem came when I went to
deploy the Network discovery tools that monitor interfaces via the IF 
Mibs.   I discovered in excess of 7000 services only
on network devices.  Given that I monitor %utilisation, %errors, %drops, 
and one other for both in and out, this gives you an
idea of the number of interfaces.

Suffice to say, this caused the box (already heavily loaded) to have 
kittens.  Things ran very very slowly.  The things I found
really interesting was that becasue each service pretty much had its own 
dependency  back to SNMP, running the Nagios
config check (with nothing else running) would take 40+ seconds.  This 
is the real kicker.  This means that Nagios will not
scale well to even medium sites (I think we fall under small/medium).

This is a real concern as NNM can do this without even breaking a sweat 
- admittedly it does not have the dependency type
information included.

** Nagios reading of configuration files apears to be expensive.

Given that each CGI reads the config file every time it runs (refreshes) 
it means that there is this huge delay - to the point
where stupid IE will show a previously cached page because it timed out.

I have seen patches to improve performance on this (have not yet 
implemented/tested) and I think this is improved on the next
version.

** Nagios CGI's re-read configs on every execution - this leads to poor 
scaling.

If the interface CGI could be run as some sort of daemon along with 
Nagios, it could drastically improve performance by removing
that need to re-read config on every connection.  Otherwise, this will 
not be able to scale well to a larger number of end
users.

**  I really like the interface - particularly how you can do customised 
views per user.
       This is a really big benefit of Nagios - gets over information 
overload when people
        only need to see one thing.  A good example of this for us is 
facilities management -
        they do not need to see all the details, but they do want to see 
any environmental
        information (temp, humidity, voltage, etc..) from any device.


The joy of Plugins.

I think the plugin concept, in conjunction with passive monitoring makes 
nagios a really powerful tool.

I have prototyped all my plugins and discovery tools in perl.  The 
reason being is that I am comfortable
with it and find it a really good tool to knock up quick prototypes 
with.  I usually then write this up in C
after I am happy with the workings of and lessons from the prototype.

One of the reasons my plugins are slowish (aside from the fact that they 
are perl)  is that I do SNMP
gets based on labels.  This means that I will ask for disk utilisation 
on the filesystem /var or C:\ etc..
This is particularly important to me as this may have different instance 
numbers depending on what
machine you are on.   Plus you don't want to have N service definitions 
for just one type of collection.
I'll fix this by having caching of instance numbers in the production 
version.

Because we also have varied SNMP communities all over the place - don't 
you love security? - I also
have to handle this on the fly as well to again limit the number of 
service definitions.

** I am not sure that active checks scale at all well under Nagios.

I am planning to convert all the active SNMP checks into passive ones 
and run a daemon to schedule
and collect the data and then feed it to Nagios.  This will resolve the 
issue of having lots of processes
being kicked off.  I'm looking at snmpp, but it is not quite what I am 
after.

I'll probably use active checks as a backup when the passive fails as a 
confirmation - though this has
inherent scaling risks also.  I'd have to check how Nagios handles 
things like dependencies and the like
before I comments sensibly on this one.

Anyhow, in all this rambling I'm trying to say I think it is a mighty 
fine product with a couple of things
that could be improved to meet our needs (possibly others as well). 
 I'll be working on fixing the things
I see as issues for us and I'll see if I can get permission to release 
those back to the community.  Admittedly
all I have at the moment are poorly performing Perl prototypes :)


Cheers,



Marcus.



-------------------------------------------------------
This SF.net email is sponsored by: SF.net Giveback Program.
Does SourceForge.net help you be more productive?  Does it
help you create better code?  SHARE THE LOVE, and help us help
YOU!  Click Here: http://sourceforge.net/donate/
Previous message: PIPE_BUF Was: Logging for critical messgaes
Next message: Feedback on Nagios
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Developers mailing list