high latency

Andreas Ericsson ae at op5.se
Tue Dec 7 15:43:48 CET 2010


On 12/06/2010 09:12 PM, Frost, Mark {PBC} wrote:
> 
>> -----Original Message-----
>> From: Andreas Ericsson [mailto:ae at op5.se]
>> Sent: Monday, December 06, 2010 6:06 AM
>> To: Nagios Users List
>> Cc: Frost, Mark {PBC}
>> Subject: Re: [Nagios-users] high latency
>>
>> On 12/03/2010 08:14 PM, Frost, Mark {PBC} wrote:
>>>
>>> I too struggle with them and I'm running on lightly-loaded physical hardware.
>>> We have 2 servers doing the checks sending back to a central server.  Both
>>> distributed nodes use ocsp/ochp, but they do nothing more than append results
>>> to a file (i.e. it exits quickly).  Results are handled outside of Nagios.
>>>
>>
>> Try getting rid of the oc[sh]p commands and use Merlin or google for "pnsca" or
>> "persistent nsca". There's one available from op5's repositories that may or may
>> not work, and there's one from somewhere else that they're apparently using to
>> great effect.
>>
>> Even if it exits quickly, it's still executed serially, so checking halts a
>> small period of time for each and every check that runs.
> 
> Hmm.  So then I'd be so curious why the 2 distservers which are both using
> oc[sh]p commands the same way have such radically different latencies.
> 

Agreed. There must be other differences too. Perhaps there's trouble resolving
from one of the nodes? That usually makes checks run a helluva lot longer than
they normally have to.

> Either way, you're suggesting that having a NEB module handle the
> post-check work will eliminate the serialization.
> 

Yes. Sneaking a peak at what's needed in order for an event to get sent to
master via an eventbroker compared to running an oc[sh]p command renders
this, more or less:

broker module (nagios halts while this happens):
Run a chain of 3-4 functions (increasing/decreasing stack size, pushing
and popping registers etc).
Copy 500-1000 bytes of memory from the process to the kernel.

OC?P command:
fork() nagios, copying the complete stack and generating page tables for
the heap (usually 1-2M).
possibly fork() again, redoing the last step, unless large_install_...
execve() the shell, loading a 4M binary and all its linked dependencies
from disk. The kernel wipes the pages used by the fork()'ed and doubly
fork()'ed Nagios and sets up new stack and heap tables for the shell.
shell parses command-line (this is quite quick though)
shell execve()'s the command you set as oc?p command, possibly searching
through all files in all directories in your $PATH (which will be hot in
the cache, but still), causing the kernel to once again destroy and set
up all the memory tables.
The command opens a file and puts the testresult there, issuing an fsync()
and thus waiting for data to actually hit the disk before returning.
The command exits, causing the kernel to destroy its allocated memory.
Nagios reaps the command and moves on.


In terms of effort, the difference is sort of like either hopping on one
leg along the entire great wall of china or walking to the kitchen and grab
a beer.

>>> What's odd is that distserver 1 and distserver 2 are configured the same
>>>
>>> distserver1:
>>> Hosts Checked       675
>>> Services Checked:  4179
>>> Active Service Latency:         0.000 / 3.155 / 0.382 sec
>>> Active Service Execution Time:  0.000 / 60.038 / 0.145 sec
>>>
>>> distserver2:
>>> Hosts Checked:      261
>>> Services Checked:  4289
>>> Active Service Latency:         0.000 / 169.977 / 81.300 sec
>>> Active Service Execution Time:  0.000 / 15.270 / 0.211 sec
>>>
>>> yet as you can see, distserver2's latency is much higher and always has been.
>>> I tried turning off EPN yesterday on distserver2 and it had no discernable effect.
>>> We added 400 new service checks yesterday on distserver2 (just more of the same
>>> checks we already do but on 26 new hosts) and the latency went from 35 to over 80.
>>>
>>
>> What kind of checks are you running? Some plugins draw a lot of cpu.
>> Are any of the checks set to run in serial (grep for parallelize_check in your
>> objects.cache file).
> 
> parallelize_check is set to 1 everywhere.
> 

Does one server have a lot of random service failures? On-demand hostchecks are
still run in parallel.

> Most things are NRPE checks (also NRPE to NSClient++).  Some are locally
> running perl scripts and others are locally running things like check_http.
> 

Shouldn't be all that much work for it though.

> 
>> What version of Nagios are you running?
>>
> 
> 3.2.1
> 

I take it upgrading makes no difference?

>>> The checks we do are very different (Windows, Linux, Unix, many are app-centric) so
>>> it's difficult to compare exactly what runs on distserver1 and distserver2, but given
>>> the jump that was taken yesterday, I'm wondering if the fact that the type of checks
>>> on these new hosts are all built on dependencies make me wonder if that doesn't
>>> have something to do with it.  These hosts (Windows) have a basic check for NRPE
>>> and all other checks on the host are dependent on the NRPE check succeeding.
>>>
>>> I have to move to all new Nagios servers very soon.  I'm interested in Merlin, but
>>> given its non-production nature just yet, I'm hesitant to commit and I'm not sure if
>>> it will help me here.
>>>
>> It's been running at our 400+ customers with very few problems for the past month.
>> 0.9.1, released just yesterday, solves the known issues our customers have
>> encountered. You might want to take a look at it again. There are some issues on
>> FreeBSD though (was that you reporting them?). I just recently got a new laptop
>> with better support for running virtual systems, so I'm downloading a FreeBSD 8.1
>> install dvd as we speak. Hopefully I'll have those issues sorted out before the
>> end of the week.
>>
>> -- 
>> Andreas Ericsson                   andreas.ericsson at op5.se
> 
> Thanks, Andreas.  I'm hoping to allocate sufficient resources on the new servers
> to be able to play with Merlin more there.

It's quite resource-friendly actually. Well, compared to what you're running now
it's positively feather-light.

>  Will I be able to have the performance
> data from a poller be sent up to a NOC for digestion by pnp4nagios?

Yes, but you'll need the threadsafe version of Nagios you can obtain from either
CVS or git://git.op5.org/nagios.git for performance-data to work. Actually, you
need that for Merlin to work.

>  It may have
> been a long time ago, but I thought I remember seeing that performance data was
> not yet implemented.
> 

That was then. This is now :)

> No we'd be using some flavor of SLES.
> 

Should work marvellously then.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.

------------------------------------------------------------------------------
What happens now with your Lotus Notes apps - do you make another costly 
upgrade, or settle for being marooned without product support? Time to move
off Lotus Notes and onto the cloud with Force.com, apps are easier to build,
use, and manage than apps on traditional platforms. Sign up for the Lotus 
Notes Migration Kit to learn more. http://p.sf.net/sfu/salesforce-d2d
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list