Distributed Servers reporting false errors

Mark Parker Mark.Parker at beacon.com.au
Wed Nov 12 08:09:39 CET 2003


Hi all,

We have a distributed monitoring environment with a single central server and 2 distributed servers. The central server runs Nagios 1.1 and the two distributed servers both run Netsaint 0.0.7. The environment monitors roughly 45 hosts.

On two occasions now we have experienced problems with the distributed servers. After having run uninterrupted for around 3 weeks they begin experiencing timeouts on ping, http, and nrpep/nrpe checks and hence critical alerts are being reported at all hours of the day.

A full check of the service shows it to be fine, however manually running the netsaint/nagios check for that service shows very slow response times and frequent timeouts.

Restarting the Netsaint process on the distributed servers appears to clear these timeout issues, and the service checks begin responding as expected.  Immediately after a stop and restart of netsaint, performing the checks manually shows response times close to the baseline again.

Is there any known issue with running Netsaint for a prolonged period of time, or distributed servers over a certain period? Would the number of hosts the environment is checking influence the frequency of such a issue.

Any insights would be greatly appreciated.

Many Thanks

Mark Parker


-------------------------------------------------------
This SF.Net email sponsored by: ApacheCon 2003,
16-19 November in Las Vegas. Learn firsthand the latest
developments in Apache, PHP, Perl, XML, Java, MySQL,
WebDAV, and more! http://www.apachecon.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list