Nagios and Gearman - huge environment performance problem

Daniel Wittenberg daniel.wittenberg.r0ko at statefarm.com
Wed Aug 24 19:50:39 CEST 2011


Makes me wonder then, what is the latency introduced with gearman?  Keep in mind I don't have any knowledge of the inner workings of gearman, but would be interesting to see what logging/debugging at that level can bring.  I just wonder what your server is so busy doing that it can't get checks out the door.

Dan

From: Rodney Ramos [mailto:rodneyra at gmail.com]
Sent: Wednesday, August 24, 2011 11:36 AM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Nagios and Gearman - huge environment performance problem

Hi Daniel. In my environment I have a lot of hosts that are down for a long time. I can´t deal with this. One thing that should be clear is that I´m using gearman and mod_gearman to make the checks. I have 9 workers (virtual machines) to do the job. The central server, running Nagios 3.2.3, does not execute any plugin. The central server is physical, with 8 CPUs, 4 GB ram, running RHEL 5.4 64 bits. Thanks.
On Wed, Aug 24, 2011 at 11:37 AM, Daniel Wittenberg <daniel.wittenberg.r0ko at statefarm.com<mailto:daniel.wittenberg.r0ko at statefarm.com>> wrote:
I noticed from the output you have a high amount of unknown and critical services.  Are those taking a long time to timeout?  What you might try, which I know isn't ideal, but removing certain checks that might be failing, like just start with host checks, and when those show good, add a few more services, few more, etc. until you notice the time going through the roof again.  That might help figure out where your threshold is, and if there are certain checks that are causing issues.  Is this a physical or virtual server?

Dan

From: Rodney Ramos [mailto:rodneyra at gmail.com<mailto:rodneyra at gmail.com>]
Sent: Wednesday, August 24, 2011 9:26 AM

To: Nagios Developers List
Subject: Re: [Nagios-devel] Nagios and Gearman - huge environment performance problem

Hi Sven. Thank you again. I´m pretty sure that my check interval is 15 min, for both, hosts and services. I´ve set this in the templates.cfg file (see below). I sending too the nagiostats output. I agree with you that if we divide 100 k checks / 15 min ~ 111 checks/sec, but the problem is that Nagios does not make these checks smoothly during the time. Thats the problem.


==========
templates.cfg
==========
define host{
        name                            generic-host
        ...
        check_interval          15
        ....
}

define service{
        name                            generic-service
        ...
        normal_check_interval           15
        ....
}

==============
nagiostats output
==============
Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org<http://www.nagios.org>)
Last Modified: 10-03-2010
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /usr/local/nagios/var/status.dat
Status File Age:                        0d 0h 0m 17s
Status File Version:                    3.2.3

Program Running Time:                   0d 17h 43m 2s
Nagios PID:                             18854
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         68206
Services Checked:                       68206
Services Scheduled:                     68206
Services Actively Checked:              68206
Services Passively Checked:             0
Total Service State Change:             0.000 / 43.880 / 2.774 %
Active Service Latency:                 40.671 / 503.137 / 234.919 sec
Active Service Execution Time:          0.003 / 24.737 / 2.527 sec
Active Service State Change:            0.000 / 43.880 / 2.774 %
Active Services Last 1/5/15/60 min:     0 / 2897 / 35932 / 68206
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              46943 / 56 / 7660 / 13547
Services Flapping:                      980
Services In Downtime:                   0

Total Hosts:                            34103
Hosts Checked:                          34103
Hosts Scheduled:                        34103
Hosts Actively Checked:                 34103
Host Passively Checked:                 0
Total Host State Change:                0.000 / 63.820 / 2.598 %
Active Host Latency:                    0.000 / 474.337 / 247.944 sec
Active Host Execution Time:             0.000 / 20.354 / 2.033 sec
Active Host State Change:               0.000 / 63.820 / 2.598 %
Active Hosts Last 1/5/15/60 min:        0 / 5936 / 29437 / 34103
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  23591 / 10512 / 0
Hosts Flapping:                         597
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     3 / 89 / 209
   Scheduled:                           0 / 0 / 0
   On-demand:                           3 / 89 / 209
   Parallel:                            0 / 0 / 0
   Serial:                              0 / 0 / 0
   Cached:                              3 / 89 / 209
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  0 / 0 / 0
   Scheduled:                           0 / 0 / 0
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0
On Tue, Aug 23, 2011 at 6:14 PM, Sven Nierlein <Sven.Nierlein at consol.de<mailto:Sven.Nierlein at consol.de>> wrote:
On 8/23/11 22:21, Rodney Ramos wrote:
> When I´ve changed the max_concurrent_checks from "0" to "200", nagios process fell down to 30/50%. However, the latency increased a lot, going to more then 1000 sec!!
Which means you have usually more than 200 concurrent checks. Maybe 400-500. When i compare that to your inital mail, writing about 60k services + 30k hosts in a 15min interval i get only 100checks / second. Are you sure about the 15min interval? How many checks do you have per second? Did you change you interval_length?

 Sven

------------------------------------------------------------------------------
EMC VNX: the world's simplest storage, starting under $10K
The only unified storage solution that offers unified management
Up to 160% more powerful than alternatives and 25% more efficient.
Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net<mailto:Nagios-devel at lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/nagios-devel


------------------------------------------------------------------------------
EMC VNX: the world's simplest storage, starting under $10K
The only unified storage solution that offers unified management
Up to 160% more powerful than alternatives and 25% more efficient.
Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net<mailto:Nagios-devel at lists.sourceforge.net>
https://lists.sourceforge.net/lists/listinfo/nagios-devel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20110824/a3c42d4f/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
EMC VNX: the world's simplest storage, starting under $10K
The only unified storage solution that offers unified management 
Up to 160% more powerful than alternatives and 25% more efficient. 
Guaranteed. http://p.sf.net/sfu/emc-vnx-dev2dev
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list