Distributed monitoring setup

Andrew Kemp andrew_kemp at pacific.net.au
Tue Aug 13 03:44:28 CEST 2002

Previous message: check_smtp problem
Next message: icons and extended host definitions
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Greetings,

I am wishing to discuss with others their Nagios setup in a distributed
environment.

We have 3 servers running Nagios - 1 x external and 2 x internal. The
external server has outside parties having limited views to their
hosts/connections into our network. The 2 internal servers are currently
setup in a distributed environment, with one server sending results to
the other (via nsca) due to it's geographic location on our network. The
'central server' not only collects the results from the other
distributed server, but also actively checks approximately half of the
total number of hosts and services.

My reading of the Nagios documentation shows that it is assumed the
central server only accepts results from the distributed servers rather
than actively checking hosts and services itself. However, I see no
reason as to why the central server can not also actively check - there
is no design issue that I am aware of.

How do others run their distributed setup ?

Also, we believe that there are scheduling issues with Nagios under
different Linux kernels. With a 2.2.20 kernel in a distributed setup, we
found that the number of Nagios processes continued to grow - ie: there
was no reaping. An strace of a child process showed that it was waiting
to write to the external command files, while an strace of the parent
process showed no errors and the reaping worked as expected.

Therefore, we modified the start script for Nagios to include an strace
of the parent process and ran fine with this for many months. This is
with Nagios 1.0a7 through 1.0b3 and the previously undocumented
'command_check_interval=-1'.

Recently we upgraded the monitoring hosts to a 2.4 kernel, and
discovered an entirely different problem. The number of Nagios processes
grows exponentially until the load on the box is so large that a hard
reset is required. Again, the children processes do not appear to be
being reaped as would be expected. An strace of the child processes
shows that they are waiting on a write to the internal pipe (Nagios
parent process) after reading the results from the external command
file.

We have tried numerous ways of trying to correct this problem, including
upgrading to Nagios 1.0b4 and also including the latest base/checks.c
from CVS but can not get Nagios to sufficiently reap the children
processes. So, until we can resolve this problem we have been forced to
downgrade back to the 2.2 kernel, where Nagios 1.0b4 and base/checks.c
works fine (though with the strace on the parent process).

So, I would be interested in discussing with others who are running
Nagios in a distributed setup under Linux as to whether or not they are
experiencing similar issues.

Regards,

Andrew

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20020813/8eb3ce71/attachment.html>

Previous message: check_smtp problem
Next message: icons and extended host definitions
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

More information about the Users mailing list