Average Check latency and execution time growth - 3.2.3

Max Schubert maxs at webwizarddesign.com
Sat Oct 8 17:19:29 CEST 2011


What minor RHEL rev are you running?  We had one poller that was
running RHEL 5.3 that had constantly increasing latency - a Compaw /
AMD based host.  None of the optimizations / configuration changes we
made to the other pollers we ran at the time seemed to help this one -
we updated the poller in-box from 5.3 to 5.4 and voila - issue gone.

As Joerge mentioned, probably was a memory leak / bug in a library the
parent Nagios poller process was using, we never did determine which
one and we haven't hit that same issue since then with any 5.4 or 5.5
pollers.

Even with stable software we end up bouncing our pollers every 2-3
days - 1) because we have an active customer base who make config
changes often and 2) because we take the metrics from the checks and
put them in a time series data warehouse that is sensitive to interval
skew...any poller that hits 10 seconds latency has to be bounced.

We are at 12 pollers or so right now and we will be up to almost 20 by
next year at this time.

Max

On 10/2/11, Stuart Browne <stuart.browne at ausregistry.com.au> wrote:
> Hi,
>
> I know this topic has been covered many times, but I've tried those tweaks
> and I have the remaining issue.
>
> After a few days, the latency on checks explodes.  It goes along quite
> happily with small values, then after (about) 3 days, the values rise quite
> sharply.  I've recently been graphing performance statistics (nagiostats,
> mrtg) and as you can see by the two attachments (day, week), it's rather
> surprising.
>
> We restart Nagios every few days (for other reasons) so thankfully the issue
> never gets completely out of control, but as you can see, it gets a bit
> crazy.
>
> I can't think of any combination of settings that would cause such growth
> after such a long period of time.  Does anybody have any knowledge as to why
> it would suddenly increase after running for days without issue?
>
> Basic Nagios system stats:
> 	2 x dual-core Xeon 5160 (3Ghz)
> 	6GB Memory
> 	4 x SAS, RAID1 (hardware, BBU, LVM over RAID1)
> 	RHEL5, fully patched
> 	Load average between 0.5 and 3.2
>
> 'nagios -s /etc/nagios/nagios.cfg' output (trimmed):
>
> HOST SCHEDULING INFORMATION
> ---------------------------
> Total hosts:                     252
> Total scheduled hosts:           252
> Host inter-check delay method:   SMART
> Average host check interval:     300.00 sec
> Host inter-check delay:          1.19 sec
> Max host check spread:           30 min
> First scheduled check:           Mon Oct  3 14:31:17 2011
> Last scheduled check:            Mon Oct  3 14:36:15 2011
>
>
> SERVICE SCHEDULING INFORMATION
> -------------------------------
> Total services:                     1575
> Total scheduled services:           1386
> Service inter-check delay method:   SMART
> Average service check interval:     878.40 sec
> Inter-check delay:                  0.63 sec
> Interleave factor method:           SMART
> Average services per host:          6.25
> Service interleave factor:          6
> Max service check spread:           30 min
> First scheduled check:              Mon Oct  3 14:33:43 2011
> Last scheduled check:               Mon Oct  3 14:48:21 2011
>
> CHECK PROCESSING INFORMATION
> ----------------------------
> Check result reaper interval:       5 sec
> Max concurrent service checks:      Unlimited
>
>
> PERFORMANCE SUGGESTIONS
> -----------------------
> I have no suggestions - things look okay.
>
> Stuart J. Browne
> Senior Linux Administrator
>

------------------------------------------------------------------------------
All of the data generated in your IT infrastructure is seriously valuable.
Why? It contains a definitive record of application performance, security
threats, fraudulent activity, and more. Splunk takes this data and makes
sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-d2dcopy2
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list