High check latency in a machine with low load

Javier Vela Diago jvela at s2grupo.es
Tue Oct 11 16:50:51 CEST 2011


Thank you for the advise, but due some problems in the past, I already 
have the mysql database in another machine with 2 cpus and 2GB of ram. 

Also, because of the problems I suffered, I have a script that every nigth 
optimizes and repairs the ndoutils database. My goal now is to change the 
engine from MyISAM to INNODB and apply some tunnig to the database. The 
engine change is because when  problems start, with MyISAM I have to 
truncate the database because optimize hangs out, but with InnoDB, in the 
tests I've made, works fine.

Javi



De:     Mike Guthrie <mguthrie at nagios.com>
Para:   Nagios Users List <nagios-users at lists.sourceforge.net>
Fecha:  11/10/2011 16:39
Asunto: Re: [Nagios-users] High check latency in a machine with low load



If ndoutils starts to create a heavy burden on the system you can also 
offload ndoutils/mysql to a second machine.  We wrote the below document 
for Nagios XI, but the doc has the info you'd need to make it work for 
Nagios Core as well. 

http://library.nagios.com/library/products/nagiosxi/documentation/462-offloading-mysql-to-remote-server




Javier Vela Diago wrote:
> I have a lot of custom checks, written mostly in perl, bash and some 
> in python. And some take a lo of time.
>
> Nevermind, I think I found the solution, or at least one part. I 
> configured to 1 the enable_large_instalallation_tweaks. This options, 
> 6 months ago, almost crashed my system, so i discarded it. Now, with 
> bigger problems, is the last thing that I wanted to test, but finally 
> this afternoon I tested  it.
>
> When I restarted Nagios, the load has started to grow until 6-8,  and 
> the latency problems dissapeared. I was sceptical about the utility of 
> this options but when the load changes form 2,5 to 6, it means that 
> the machine is doing a lot of work that before wasn't doing.
>
> Now the problem is that NDOUtils is causing  some latency because of 
> MYSQL, but well, at least I know what to optimize. Some tips will be 
> apreciated :)
>
> Thank you and sorry for your time.
>
>
> De:        Daniel Wittenberg <daniel.wittenberg.r0ko at statefarm.com>
> Para:        Nagios Users List <nagios-users at lists.sourceforge.net>
> Fecha:        11/10/2011 16:02
> Asunto:        Re: [Nagios-users] High check latency in a machine with 
> low load
> ------------------------------------------------------------------------
>
>
>
> I think you have the enable_high_latency option enabled J  j/k
> 
> Do you have any particular checks that are taking a long time?  i.e. 
> can you watch top and see checks taking a while?
> 
> Dan
> 
> 
> *From:* Javier Vela Diago [mailto:jvela at s2grupo.es] *
> Sent:* Tuesday, October 11, 2011 6:23 AM*
> To:* nagios-users at lists.sourceforge.net*
> Subject:* [Nagios-users] High check latency in a machine with low load
> 
> Hi,
>
> I have a Nagios 3.2.3 deployment with 1000+ Hosts and 3000+ services. 
> This Nagios runs together with NDO and PNP (in bulk mode) in a server 
> with 4GB of Ram and 4 cpus.
>
> One day I realized that the check delay in the performance CGI was 
> very high (300-400 seconds). It was very strange so I took the tunning 
> guide form nagios 
> (_http://nagios.sourceforge.net/docs/3_0/tuning.html_) and applied all 
> the points I could. In particular I adjusted the max_concurrent_checks 
> to zero (no limit):
>
> max_concurrent_checks=0
>
> The reaper event:
>
> service_reaper_frequency=5
> max_check_result_reaper_time=15
>
> and checked that the host checks where not forced. In addition I 
> configured 15 seconds of host check cache.
>
> cached_host_check_horizon=15
>
> But the problem remains. And the load of the server is not very high. 
> Load of 2,5, 2 GB of free memory and an average utilization of disc of 
> 7%. I disabled NDO and PNP but it was useless. After the first round 
> of checks, the delay returns, while the load of the server doesn't grow.
>
> I have searched in google but all the problems area because of the 
> load in the server, but here this is not the main problem. So my 
> question is ¿what can I do now?¿There is some variable that shows me 
> where to look? I'm a bit lost right now and I don't know how to find 
> the problem.
>
> ¿Or maybe the only way is to configure a master-slave nagios in order 
> to maximize the server utilization?
>
> In addition, I have pretty big timeouts (60 seconds) because of the 
> high latency on the network. All your help is appreciated. Thank you 
> in advance.
> *
> nagiostats*
> Nagios Stats 3.2.3
> Copyright (c) 2003-2008 Ethan Galstad (_www.nagios.org_)
> Last Modified: 10-03-2010
> License: GPL
>
> CURRENT STATUS DATA
> ------------------------------------------------------
> Status File: 
>  /usr/local/argos/aplicaciones/nagios/var/status.dat
> Status File Age:                        0d 0h 0m 11s
> Status File Version:                    3.2.3
>
> Program Running Time:                   0d 20h 56m 7s
> Nagios PID:                             21834
> Used/High/Total Command Buffers:        0 / 0 / 4096
>
> Total Services:                         4032
> Services Checked:                       4032
> Services Scheduled:                     4030
> Services Actively Checked:              4032
> Services Passively Checked:             0
> Total Service State Change:             0.000 / 37.300 / 0.163 %
> Active Service Latency:                 32.876 / 442.138 / 415.816 sec
> Active Service Execution Time:          0.051 / 60.097 / 1.545 sec
> Active Service State Change:            0.000 / 37.300 / 0.163 %
> Active Services Last 1/5/15/60 min:     237 / 1530 / 4020 / 4020
> Passive Service Latency:                0.000 / 0.000 / 0.000 sec
> Passive Service State Change:           0.000 / 0.000 / 0.000 %
> Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
> Services Ok/Warn/Unk/Crit:              3766 / 38 / 44 / 184
> Services Flapping:                      0
> Services In Downtime:                   0
>
> Total Hosts:                            931
> Hosts Checked:                          931
> Hosts Scheduled:                        931
> Hosts Actively Checked:                 931
> Host Passively Checked:                 0
> Total Host State Change:                0.000 / 12.370 / 0.077 %
> Active Host Latency:                    0.000 / 441.308 / 416.063 sec
> Active Host Execution Time:             0.062 / 10.113 / 0.395 sec
> Active Host State Change:               0.000 / 12.370 / 0.077 %
> Active Hosts Last 1/5/15/60 min:        74 / 423 / 931 / 931
> Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
> Passive Host State Change:              0.000 / 0.000 / 0.000 %
> Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
> Hosts Up/Down/Unreach:                  897 / 24 / 10
> Hosts Flapping:                         0
> Hosts In Downtime:                      1
>
> Active Host Checks Last 1/5/15 min:     109 / 535 / 1583
>   Scheduled:                           87 / 433 / 1300
>   On-demand:                           22 / 102 / 283
>   Parallel:                            87 / 438 / 1323
>   Serial:                              0 / 0 / 0
>   Cached:                              22 / 97 / 260
> Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
> Active Service Checks Last 1/5/15 min:  304 / 1605 / 4924
>   Scheduled:                           304 / 1605 / 4923
>   On-demand:                           0 / 0 / 1
>   Cached:                              0 / 0 / 0
> Passive Service Checks Last 1/5/15 min: 0 / 0 / 0
>
> External Commands Last 1/5/15 min:      0 / 0 / 0
> *
> nagios -s*
>
> Nagios Core 3.2.3
> Copyright (c) 2009-2010 Nagios Core Development Team and Community 
> Contributors
> Copyright (c) 1999-2009 Ethan Galstad
> Last Modified: 10-03-2010
> License: GPL
>
> Website: _http://www.nagios.org_ <http://www.nagios.org/>
> Warning: aggregate_status_updates directive ignored.  All status file 
> updates are now aggregated.
> Warning: downtime_file variable ignored.  Downtime entries are now 
> stored in the status and retention files.
> Warning: comment_file variable ignored.  Comments are now stored in 
> the status and retention files.
> Timing information on object configuration processing is listed
> below.  You can use this information to see if precaching your
> object configuration would be useful.
>
> Object Config Source: Config files (uncached)
>
> OBJECT CONFIG PROCESSING TIMES      (* = Potential for precache 
> savings with -u option)
> ----------------------------------
> Read:                 0.080036 sec
> Resolve:              0.010660 sec  *
> Recomb Contactgroups: 0.002666 sec  *
> Recomb Hostgroups:    0.004086 sec  *
> Dup Services:         0.034632 sec  *
> Recomb Servicegroups: 0.001277 sec  *
> Duplicate:            0.010939 sec  *
> Inherit:              0.005594 sec  *
> Recomb Contacts:      0.000001 sec  *
> Sort:                 0.000000 sec  *
> Register:             0.074413 sec
> Free:                 0.008730 sec
>                      ============
> TOTAL:                0.234920 sec  * = 0.071741 sec (30.54%) 
> estimated savings
>
>
> RETENTION DATA TIMES
> ----------------------------------
> Read and Process:     0.495480 sec
>                      ============
> TOTAL:                0.495480 sec
>
>
> Timing information on configuration verification is listed below.
>
> CONFIG VERIFICATION TIMES          (* = Potential for speedup with -x 
> option)
> ----------------------------------
> Object Relationships: 0.060039 sec
> Circular Paths:       0.026557 sec  *
> Misc:                 0.005999 sec
>                      ============
> TOTAL:                0.092595 sec  * = 0.026557 sec (28.7%) estimated 
> savings
>
>
> EVENT SCHEDULING TIMES
> -------------------------------------
> Get service info:        0.014509 sec
> Get host info info:      0.002853 sec
> Get service params:      0.000078 sec
> Schedule service times:  0.039947 sec
> Schedule service events: 0.034656 sec
> Get host params:         0.000001 sec
> Schedule host times:     0.007519 sec
> Schedule host events:    0.029519 sec
>                         ============
> TOTAL:                   0.129082 sec
>
>
> Projected scheduling information for host and service checks
> is listed below.  This information assumes that you are going
> to start running Nagios with your current config files.
>
> HOST SCHEDULING INFORMATION
> ---------------------------
> Total hosts:                     931
> Total scheduled hosts:           931
> Host inter-check delay method:   SMART
> Average host check interval:     259.01 sec
> Host inter-check delay:          0.28 sec
> Max host check spread:           30 min
> First scheduled check:           Tue Oct 11 13:14:08 2011
> Last scheduled check:            Tue Oct 11 13:18:26 2011
>
>
> SERVICE SCHEDULING INFORMATION
> -------------------------------
> Total services:                     4032
> Total scheduled services:           4030
> Service inter-check delay method:   SMART
> Average service check interval:     299.55 sec
> Inter-check delay:                  0.07 sec
> Interleave factor method:           SMART
> Average services per host:          4.33
> Service interleave factor:          5
> Max service check spread:           30 min
> First scheduled check:              Tue Oct 11 13:15:07 2011
> Last scheduled check:               Tue Oct 11 13:20:07 2011
>
>
> CHECK PROCESSING INFORMATION
> ----------------------------
> Check result reaper interval:       5 sec
> Max concurrent service checks:      Unlimited
>
>
> PERFORMANCE SUGGESTIONS
> -----------------------
> I have no suggestions - things look okay.
> -- 
> Javier Vela Diago
> S2 GRUPO
> Ramiro de Maeztu, 7 bajo. 46022 Valencia
> Tel: 963.110.300 Fax: 963.106.086
> e-mail : jvela arroba s2grupo punto es_
> __http://www.s2grupo.es_ 
> <http://www.s2grupo.es/
>------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> 
http://p.sf.net/sfu/splunk-d2d-oct_______________________________________________

> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when 
> reporting any issue.
> ::: Messages without supporting info will risk being sent to /dev/null
> ------------------------------------------------------------------------
>
> 
------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
> ------------------------------------------------------------------------
>
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when 
reporting any issue. 
> ::: Messages without supporting info will risk being sent to /dev/null


-- 


Mike Guthrie
Technical Team
___
Nagios Enterprises, LLC
Email:  mguthrie at nagios.com
Web:    www.nagios.com


------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when 
reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20111011/21fb22f2/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list