Nagios and Gearman - huge environment performance problem

Rodney Ramos rodneyra at gmail.com
Wed Aug 24 18:35:53 CEST 2011


Hi Daniel. In my environment I have a lot of hosts that are down for a long
time. I can´t deal with this. One thing that should be clear is that I´m
using gearman and mod_gearman to make the checks. I have 9 workers (virtual
machines) to do the job. The central server, running Nagios 3.2.3, does not
execute any plugin. The central server is physical, with 8 CPUs, 4 GB ram,
running RHEL 5.4 64 bits. Thanks.

On Wed, Aug 24, 2011 at 11:37 AM, Daniel Wittenberg <
daniel.wittenberg.r0ko at statefarm.com> wrote:

>  I noticed from the output you have a high amount of unknown and critical
> services.  Are those taking a long time to timeout?  What you might try,
> which I know isn’t ideal, but removing certain checks that might be failing,
> like just start with host checks, and when those show good, add a few more
> services, few more, etc. until you notice the time going through the roof
> again.  That might help figure out where your threshold is, and if there are
> certain checks that are causing issues.  Is this a physical or virtual
> server?****
>
>
> Dan****
>
> ** **
>
> *From:* Rodney Ramos [mailto:rodneyra at gmail.com]
> *Sent:* Wednesday, August 24, 2011 9:26 AM
>
> *To:* Nagios Developers List
> *Subject:* Re: [Nagios-devel] Nagios and Gearman - huge environment
> performance problem****
>
> ** **
>
> Hi Sven. Thank you again. I´m pretty sure that my check interval is 15 min,
> for both, hosts and services. I´ve set this in the templates.cfg file (see
> below). I sending too the nagiostats output. I agree with you that if we
> divide 100 k checks / 15 min ~ 111 checks/sec, but the problem is that
> Nagios does not make these checks smoothly during the time. Thats the
> problem.
>
>
> ==========
> templates.cfg
> ==========
> define host{
>         name                            generic-host
>         ...
>         check_interval          15
>         ....
> }
>
> define service{
>         name                            generic-service
>         ...
>         normal_check_interval           15
>         ....
> }
>
> ==============
> nagiostats output
> ==============
> Nagios Stats 3.2.3
> Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
> Last Modified: 10-03-2010
> License: GPL
>
> CURRENT STATUS DATA
> ------------------------------------------------------
> Status File:                            /usr/local/nagios/var/status.dat
> Status File Age:                        0d 0h 0m 17s
> Status File Version:                    3.2.3
>
> Program Running Time:                   0d 17h 43m 2s
> Nagios PID:                             18854
> Used/High/Total Command Buffers:        0 / 0 / 4096
>
> Total Services:                         68206
> Services Checked:                       68206
> Services Scheduled:                     68206
> Services Actively Checked:              68206
> Services Passively Checked:             0
> Total Service State Change:             0.000 / 43.880 / 2.774 %
> Active Service Latency:                 40.671 / 503.137 / 234.919 sec
> Active Service Execution Time:          0.003 / 24.737 / 2.527 sec
> Active Service State Change:            0.000 / 43.880 / 2.774 %
> Active Services Last 1/5/15/60 min:     0 / 2897 / 35932 / 68206
> Passive Service Latency:                0.000 / 0.000 / 0.000 sec
> Passive Service State Change:           0.000 / 0.000 / 0.000 %
> Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
> Services Ok/Warn/Unk/Crit:              46943 / 56 / 7660 / 13547
> Services Flapping:                      980
> Services In Downtime:                   0
>
> Total Hosts:                            34103
> Hosts Checked:                          34103
> Hosts Scheduled:                        34103
> Hosts Actively Checked:                 34103
> Host Passively Checked:                 0
> Total Host State Change:                0.000 / 63.820 / 2.598 %
> Active Host Latency:                    0.000 / 474.337 / 247.944 sec
> Active Host Execution Time:             0.000 / 20.354 / 2.033 sec
> Active Host State Change:               0.000 / 63.820 / 2.598 %
> Active Hosts Last 1/5/15/60 min:        0 / 5936 / 29437 / 34103
> Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
> Passive Host State Change:              0.000 / 0.000 / 0.000 %
> Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
> Hosts Up/Down/Unreach:                  23591 / 10512 / 0
> Hosts Flapping:                         597
> Hosts In Downtime:                      0
>
> Active Host Checks Last 1/5/15 min:     3 / 89 / 209
>    Scheduled:                           0 / 0 / 0
>    On-demand:                           3 / 89 / 209
>    Parallel:                            0 / 0 / 0
>    Serial:                              0 / 0 / 0
>    Cached:                              3 / 89 / 209
> Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
> Active Service Checks Last 1/5/15 min:  0 / 0 / 0
>    Scheduled:                           0 / 0 / 0
>    On-demand:                           0 / 0 / 0
>    Cached:                              0 / 0 / 0
> Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
>
> External Commands Last 1/5/15 min:      0 / 0 / 0****
>
> On Tue, Aug 23, 2011 at 6:14 PM, Sven Nierlein <Sven.Nierlein at consol.de>
> wrote:****
>
> On 8/23/11 22:21, Rodney Ramos wrote:
> > When I´ve changed the max_concurrent_checks from "0" to "200", nagios
> process fell down to 30/50%. However, the latency increased a lot, going to
> more then 1000 sec!!****
>
> Which means you have usually more than 200 concurrent checks. Maybe
> 400-500. When i compare that to your inital mail, writing about 60k services
> + 30k hosts in a 15min interval i get only 100checks / second. Are you sure
> about the 15min interval? How many checks do you have per second? Did you
> change you interval_length?
>
>  Sven****
>
>
>
> ------------------------------------------------------------------------------
> EMC VNX: the world's simplest storage, starting under $10K
> The only unified storage solution that offers unified management****
>
> Up to 160% more powerful than alternatives and 25% more efficient.
> Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev****
>
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel****
>
> ** **
>
>
> ------------------------------------------------------------------------------
> EMC VNX: the world's simplest storage, starting under $10K
> The only unified storage solution that offers unified management
> Up to 160% more powerful than alternatives and 25% more efficient.
> Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-devel
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20110824/e4e3fd57/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
EMC VNX: the world's simplest storage, starting under $10K
The only unified storage solution that offers unified management 
Up to 160% more powerful than alternatives and 25% more efficient. 
Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list