Large scale network monitoring limits with nagios

Paul L. Allen pla at softflare.com
Thu Mar 11 17:41:55 CET 2004


Daniel Henninger writes: 

> If it runs slow with 8000+ hosts, you might want to run a job that
> generates a static page, that gets auto-regenerated "as often as
> possible".  That way, at least there's only one connection that has to
> wait a long time, and the folk connecting to your interface would only
> have to wait on the brief static page load.

How about something like this? 

Set up two Nagios processes.  Could be on different servers or on the
same server.  One has every useful trap for every device of interest
defined as a service (pretty much what he has now).  It submits passive
check results.  However, instead of the standard submit_check_results
script, he uses something else that has some smarts. 

That something else would have to retain the state of every service
somehow.  When called, it would update the relevant bit of state info
(which would record host/service/status) and then check the statuses
of all the services for the host it just received a trap for: if any
of those retained service statuses are critical, it treats the host
as critical; if none of the service statuses are critical but there
is at least one warning then it treats the host as in a warning state;
if all of the statuses are OK then the host is OK.  Finally, it passes
on that host status to the other Nagios process as a pseudo service result
(say "health check"). 

That way you'd only have one service for each device on the Nagios that
receives the passive checks but you'd have all the services on the other
Nagios.  The Nagios receiving the passive checks would be the one you'd
normally look at, but if there is a problem you could look at the other
Nagios for more details (there might be more than one service with
problems). 

Implementation is, of course, left as an exercise for the reader.
Figuring out a way of retaining all that state info efficiently, with
low access overhead, etc., is the hard bit, especially if you want it
to scale well, because ideally you want a database but the overhead
of connecting to a database for each update/query would be high.  I would
be tempted to do the state retention with some sort of daemon that the
submission script talks to over a named pipe; the daemon could be passed
host/service/etc, communicate with the database (to which it would have
a persistent connection), and return current overall status, name of the
most recent service to enter that state and the reason it did so. 

If you write the daemon in perl, you could make use of its hashes to
create an in-memory database.  The upside is that it's likely to be
faster.  The downside is that it loses saved state if you have to restart
it. 

-- 
Paul Allen
Softflare Support 



-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Developers mailing list