Distributed monitoring problem

Marcel Mitsuto Fucatu Sugano msugano at uolinc.com
Wed Dec 21 20:23:41 CET 2005


On Wed, 2005-12-21 at 12:08 +0100, Rob Hassing wrote:
> Hello all,

Hi Rob,

> I'm trying to setup a distributed  monitoring system.
> At the start all looked fine too me, but now I'm having some problems on
> not receiving all passive checks from other hosts.

Distributed monitoring is waaay cool. :) The only thing that could lead
to a issue is that CGIs that come with web-interface don't scale very
well. Here we ended up with a MySQL storing status with NEB-module. We
are now testing GroundWork's framework. It appears to fit our needs.
Only the config files generator we developed in-house, to properly setup
all distributed agents, storing all config on a database.

> The machine is a Intel(R) Xeon(TM) CPU 2.40GHz system with 512 MB RAM.

> The process info tells me this:
> Time Frame	Checks Completed
> <= 1 minute:	51 (16.6%)
> <= 5 minutes:	221 (71.8%)
> <= 15 minutes:	255 (82.8%)
> <= 1 hour:	260 (84.4%)
> Since program start:  	261 (84.7%)

Here is what we have:
<= 1 minute:2383 (21.3%)
<= 5 minutes:6138 (54.7%)
<= 15 minutes:8321 (74.2%)
<= 1 hour:10138 (90.4%)
Since program start:  10711 (95.5%)

> So it's receiving less then 85% of all checks :(
> There will be more passive checks to be send to this nagios server.
> Do we need other hardware ?
> Where do I need to look to solve this problem ?

To avoid staled services, you need to setup freshness_threshold properly
for your services. Here is your hint, setting up freshness_threshold is
something a little strange as we need to wait for the packet to arrive
with the check result, and the less services you have it configured,
letting Nagios calculates it, the better. But it is the only thing to
configure to avoid staling services results. We decided to make staled
results to appear in an Unknown status, because this could be only some
traffic issue along the packet way caused by backup/restore routines,
high traffic load, among other things that could cause such staling.

> The machines sending the passive check info are not too busy doing this,
> the checks are seperated over three different servers.

Here we have 11 distributed servers, sending check results via send_nsca
and they have around 2k services configured at each one. All sparc
servers sending to a SuSE9.3 box on commoditie hardware. This linux
machine has 2GRAM, and some SATA disks. It is a P4-HT.

> One example...
> This is /var/log/nagios/nagios.log:
> [1135162484] EXTERNAL COMMAND:
> PROCESS_SERVICE_CHECK_RESULT;cat29-w11-backup;PING;0;PING OK - Packet loss
> = 0%, RTA = 0.89 ms[1135162491] SERVICE ALERT:
> cat29-w11-backup;PING;OK;HARD;3;PING OK - Packet loss = 0%, RTA = 0.89 ms
> [1135162491] SERVICE NOTIFICATION:
> nagios;cat29-w11-backup;PING;OK;notify-by-epager;PING OK - Packet loss =
> 0%, RTA = 0.89 ms[1135162491] SERVICE NOTIFICATION:
> nagios;cat29-w11-backup;PING;OK;notify-by-email;PING OK - Packet loss =
> 0%, RTA = 0.89 ms
> [1135162941] Warning: The results of service 'PING' on host
> 'cat29-w11-backup' are stale by 32 seconds (threshold=425 seconds).  I'm
> forcing an immediate check of the service.
> [1135162951] SERVICE ALERT:
> cat29-w11-backup;PING;CRITICAL;SOFT;1;CRITICAL: Service results are stale!
> 
> It looks like its stale again too fast ?

Well, those last two lines don't indicate two staled services. The first
line which tells you the freshness_threshold indicates that Central
Nagios waited for 425 seconds and the result of the Active check arrived
32 seconds later. The last line, is indicating the Active Check being
processed by Central Nagios. Then it appears as a critical alert on
web-interface. The active check stale_service.sh or whatever line you
place there is processed. (it can be the real check, thus Central Nagios
will be actively checking on staled results, but this will cause some
load troubles :) 


HTH && Regards,
-- 
Marcel Mitsuto Fucatu Sugano <msugano at uolinc.com>
Universo Online S.A. -- http://www.uol.com.br


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list