Nagios 2.6 still not draining command pipe fast enough (update with nagios 2.7)

John P. Rouillard rouilj+nagiosdev at cs.umb.edu
Wed Feb 28 21:11:01 CET 2007


In message <45DB09B8.2010609 at nagios.org>,
Ethan Galstad writes:

>John P. Rouillard wrote:
>> In message <45D9C0C6.8030204 at nagios.org>,
>> Ethan Galstad writes:
>> 
>>> John P. Rouillard wrote:
>>>> Hi all:
>>>>
>>>> I am trying to get my external correlation engine working with nagios
>>>> 2.x <http://www.cs.umb.edu/~rouilj/#secnagios>, and I just can't get
>>>> nagios to drain the command pipe fast enough. I see approx. 5% failure
>>>> rate on writing to the command pipe with an EAGAIN error.
>>>>
>>>> I have increased:
>>>>
>>>>   nagios.h:#define COMMAND_BUFFER_SLOTS              20480
>>>>   nagios.h:#define SERVICE_BUFFER_SLOTS             20480
>>>>
>>>> from the original 1024. In the increase of the settings from 10240 to
>>>> 20480, I may see a slight decrease (maybe .5%), but I think I just want 
>to
>>>> see it. I don't think it's statistically viable.
>>> John -  Does this problem still occur with Nagios 2.7 or the latest 2.x 
>>> CVS code?  A separate command file worker thread should be reading 
>>> entries from the external command file as fast as it can read them (as 
>>> long as their are free buffer slots).
>>>
>>> If there aren't any external commands, the thread waits 0.5 seconds 
>>> before checking for new commands in the file.  If you have occasional 
>>> bursts of check results, this could be too long to wait.  You could try 
>>> experimenting with decreasing the 0.5 second delay.  Around line 4948 of 
>>> base/utils.c, you'll find...
>>>
>>> /* wait a bit */
>>> tv.tv_sec=0;
>>> tv.tv_usec=500000;
>>> select(0,NULL,NULL,NULL,&tv);
>>>
>>> You could try decreasing the value of tv.tv_usec to 100000 (0.1 seconds) 
>>> and see if that helps at all.

I installed Nagios 2.7 last Thursday. Now the occurrence has dropped
from 5% to something in the neighborhood of .7%. But that may not be
the stable point as it is still growing, it was .5% a couple of days
ago. I haven't tried changing the sleep times mentioned above because
of a dramatic increase in average latency.

I am now seeing average latency in the 20 second range rather than 1
second as was occurring with my nagios 2.6 install. What is funny is
that the gui is showing:

   Check Latency: 0.00 sec 109.37 sec 34.685 sec

that doesn't agree with what nagiostats reports. The max latency is
understandable as we have been having some network drops, but even in
a freshly started nagios with no network issues, the latency is in the
same range after a couple of hours.  A 5 day old nagios process was
reporting the following from nagiostats:

  Nagios Stats 2.7
  Copyright (c) 2003-2007 Ethan Galstad (www.nagios.org)
  Last Modified: 01-19-2007
  License: GPL

  CURRENT STATUS DATA
  ----------------------------------------------------
  Status File:                          /var/log/nagios/status.dat
  Status File Age:                      0d 0h 0m 1s
  Status File Version:                  2.7

  Program Running Time:                 5d 21h 28m 58s
  Nagios PID:                           29914
  Used/High/Total Command Buffers:      0 / 45 / 4096
  Used/High/Total Check Result Buffers: 96 / 441 / 4096

  Total Services:                       1876
  Services Checked:                     1696
  Services Scheduled:                   1627
  Active Service Checks:                1692
  Passive Service Checks:               184
  Total Service State Change:           0.000 / 73.420 / 2.913 %
  Active Service Latency:               0.000 / 90.954 / 19.948 sec
  Active Service Execution Time:        0.000 / 55.244 / 4.032 sec
  Active Service State Change:          0.000 / 73.420 / 3.188 %
  Active Services Last 1/5/15/60 min:   870 / 1353 / 1414 / 1450
  Passive Service State Change:         0.000 / 16.780 / 0.381 %
  Passive Services Last 1/5/15/60 min:  123 / 175 / 176 / 177
  Services Ok/Warn/Unk/Crit:            1400 / 24 / 274 / 178
  Services Flapping:                    0
  Services In Downtime:                 0

  Total Hosts:                          118
  Hosts Checked:                        118
  Hosts Scheduled:                      0
  Active Host Checks:                   118
  Passive Host Checks:                  0
  Total Host State Change:              0.000 / 57.630 / 3.628 %
  Active Host Latency:                  0.000 / 0.000 / 0.000 sec
  Active Host Execution Time:           0.016 / 3.029 / 0.532 sec
  Active Host State Change:             0.000 / 57.630 / 3.628 %
  Active Hosts Last 1/5/15/60 min:      42 / 56 / 60 / 64
  Passive Host State Change:            0.000 / 0.000 / 0.000 %
  Passive Hosts Last 1/5/15/60 min:     0 / 0 / 0 / 0
  Hosts Up/Down/Unreach:                96 / 22 / 0
  Hosts Flapping:                       0
  Hosts In Downtime:                    0


>From these stat's it doesn't look like I am exceeding the ring buffer.
Top on nagios is showing it using a few percent of the CPU. It's not
running at 100% by any means. A sample from a restarted nagios
(running for 4 hours and 38 minutes) is:
 
  top - 19:55:41 up 153 days, 20:57,  3 users,  load average: 0.66, 1.03, 1.11
  Tasks:  84 total,   1 running,  82 sleeping,   1 stopped,   0 zombie
  Cpu0  :  1.7% us,  1.0% sy,  0.0% ni, 96.3% id,  1.0% wa,  0.0% hi,  0.0% si
  Cpu1  :  0.0% us,  0.3% sy,  0.0% ni, 99.3% id,  0.3% wa,  0.0% hi,  0.0% si
  Mem:   4151276k total,  3064692k used,  1086584k free,   153636k buffers
  Swap:  8191992k total,      328k used,  8191664k free,  2779684k cached

    PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
  18076 nagios    17   0 31748 7700  724 S    2  0.2   9:37.20 nagios


where 18076 is the main nagios process at this point (I restarted it
to see if the latency would creep back up to 30 seconds, sadly I
forgot to measure the original 5+day nagios). So I claim that the
nagios process has plenty of cycles available to process the
increased number of passive checks before it should start bogging down
and falling behind. Also is there any way to tell what the command
pipe thread's pid is (under linux)?

I believe that the scheduling really is falling behind as I have two
services defined:

  SecReport - active service, runs every minute

  SecAliveCheck - passive service, receives output from SecReport
                  via external correlator (sec). Has a 2
	          minute stale timer.

I am seeing a lot of stale checks being forced on SecAliveCheck. I
have added some additional rules to the SEC ruleset to detect and try
to characterize this.

So does anybody else see higher latency issues using 2.7 compared to
earlier versions? Would changing the sleep time affect thins (I can't
see how it would but...)?

				-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV




More information about the Developers mailing list