Large Distributed Nagios w/ NSCA

Mooney, Ryan ryan.mooney at pnl.gov
Thu May 13 00:36:39 CEST 2004


I was one of the people having some problems.  I have ~1000 hosts and
almost 
15000 services.  Some of the issues I saw were:

  - No messages over kernel pipe size (512 bytes) or you have problems
with
    messages overwriting each other.
    NSCA may enforce this?  I don't know since we use a local
replacement)
  - The linux 2.4 scheduler has a problem whereby sometimes it doesn't
    schedule processes very well.  This will lead to a problem where you
    will see a lot of nagios processes stuck hanging around reading a
    pipe and never exiting (or taking a very long time to do so).  Note
that
    this is not THE nagios pipe, its an IPC pipe that the parent uses to
    communicate to a child that reads THE nagios pipe.  
    Some people have re-compiled the kernel with a larger pipe size to
fix 
    this (there is a good explanation of why this works in the archives
- I 
    lost my pointer to it unfortunately), I "fixed" it by putting an
alarm
    around the pipe call and aborting the process when it failed.  This
means
    I loose some small number of messages, but the box doesn't crash any
more 
    (definitely sub optimal, but...).  FreeBSD users don't have this
problem.

I started out with Postgres as the storage, but I don't think I can
recommend
that as solution for large sites.  Not because of any deficiency in
postgres,
but because of how nagios updates the data.  Essentially from what I can
tell
it simply does a delete * followed by a loop to insert every item at the

update interval.  This isn't very efficient and tends to thrash the DB
(mysql
may handle this better, again I don't know - I didn't test).  I ended up
using
a memory filesystem as the place to put the status file - this seems to
work
quite well.  If the DB code were redone to only do updates on changes it
would 
probably perform fine (although that introduces some other maintenance
issues
that you need to be careful of).  Also the DB code is going away in
future 
release so setting up to use it now is probably somewhat a waste of
time.

The other major problem you will find is that the cgi scripts are really
slow 
(as in takign minutes to load) for larger service counts.  This is due
to a 
N depth traversal of several linked  lists in the code that builds the 
service/host lists.  This is supposed to be fixed in 2.0 and may be an
argument 
for using a CVS snapshot.  There is also a patch for 1.2 floating around
that 
fixes the problem as well, but you will have to apply it, etc...

> -----Original Message-----
> From: Radcliffe, Jerome (ISS Southfield) [mailto:JRadcliffe at iss.net] 
> Sent: Wednesday, May 12, 2004 2:30 PM
> To: nagios-users at lists.sourceforge.net
> Subject: [Nagios-users] Large Distributed Nagios w/ NSCA
> 
> 
> Hello, 
> 
> I am currently heading a project to monitor roughly 1500 
> hosts with 4500 total service (3-4 checks per system) checks 
> with Nagios 1.2 with a Postgres DB.  There will be 
> "collectors" that will perform the service checks and report 
> them back with NSCA to the main server. My question is has 
> anyone done a similar setup? If so were there any problems 
> with nsca/named pipe issue that caused problems.  I have read 
> through the newsgroup and there seems to be some issues for 
> some people, and some have gone way over that with no issues. 
> 
> Thanks, 
> Jay
> 
> Jerome Radcliffe, CISSP
> Security Engineer Principal
> MSS Engineering 
> Internet Security Systems
>  
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by Sleepycat Software
> Learn developer strategies Cisco, Motorola, Ericsson & Lucent 
> use to deliver higher performing products faster, at low TCO. 
> http://www.sleepycat.com/telcomwpreg.php?> From=dnemail3
> 
> 
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS 
> when reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null
> 


-------------------------------------------------------
This SF.Net email is sponsored by Sleepycat Software
Learn developer strategies Cisco, Motorola, Ericsson & Lucent use to deliver
higher performing products faster, at low TCO.
http://www.sleepycat.com/telcomwpreg.php?From=osdnemail3
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list