[naemon-dev] CPU spike + hang when thruk attempts to connect using 1.06 and 1.0.7 in a docker container

Terence Kent terencekent at gmail.com
Wed May 17 02:04:13 CEST 2017


Hey Jelle,

Looks like you've got the same symptom with a different problem :-(. I can
say for certain the symptom in my case was caused by the double livestatus
loading - so we know what you're running into a different thing.

Best,
Terence


On Tue, May 16, 2017 at 8:12 AM, jesm <crap8 at smetj.net> wrote:

> Hi all,
>
> We experience exactly the same problem only that we don't have the
> situation of loading Livestatus twice.
> We're not using Docker and we're using the latest stable... We could try
> the nightly build but unfortunately we cannot reproduce the problem ...
> It comes and goes and we have no idea why ...
>
> The symptoms we are seeing:
>
>    - All Naemon related threads are consuming 100% cpu of every core.
>    - Thruk is not able to connect to the Unix domain socket and therefor
>    each incoming request starts a fcgi process exhausting the pool in no time.
>    - Weirdly enough during this state, it's possible to manually query
>    livestatus using unixcat or socat.
>    - Restarting Naemon does not help
>    - Rebooting the server does not help
>    - Removing retention.dat solves the problem
>    - Restoring the previously removed retention.dat from during the
>    outage does NOT invoke the problem again.
>    - Stracing the threads shows a continuous barrage of entries like: (I
>    have no more detailed extraction of this output)
>       -
>
>       <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
>
>
>
> After the retention.dat file was deleted and Naemon restarted we were not
> able to trigger the same problem.
>
>
> Any ideas?
>
> Cheers,
>
> Jelle
>
>
> May 12, 2017 4:27 AM, "Terence Kent" <terencekent at gmail.com
> <%22Terence%20Kent%22%20%3Cterencekent at gmail.com%3E>> wrote:
>
> Hey Sven,
> Thanks for getting back to me so quickly, this was particularly
> challenging to chase down. Using strace and livestatus debugging didn't
> actually give me more information on this one. I also confirmed I had the
> issue with the nightly build as well as 1.0.6.
> Anyway, I found the cause of the issue. It's configuration related and
> pretty subtle. If you uncomment the following directive in
> /etc/naemon/naemon.cfg file...
>
> broker_module=/usr/lib/naemon/naemon-livestatus/livestatus.so
> /var/cache/naemon/live
>
> ...then the livestatus socket gets initialized twice during naemons
> startup, causing the issue I describe earlier. The reason for this
> duplicate initialization is because the /etc/naemon/module-conf.d/livestatus.cfg,
> which also includes the same directive. There's a hint of the duplicate
> initialization in the naemon log, due to multiple log messages for
> livestatus initialization, but that's it.
> It seems the only issue here is that the configuration is very confusing (/etc/naemon/naemon.cfg
> gives you an example of how to use livestatus, making you think you should
> just be able to uncomment it) and that repeating a configuration directive
> doesn't produce an obvious error.
> Would you like me to file an issue for this? While it's easy to resolve,
> it's really hard to chase down.
> Thanks!
> Terence
> On Tue, May 9, 2017 at 12:12 AM, Sven Nierlein <Sven.Nierlein at consol.de>
> wrote:
>
> Hi Terence,
>
> Could you try the latest nightly build, just to be sure to not hunt
> already fixed bugs. If that doesn't help, you could increase the
> livestatus loglevel as well. Naemon has a debug log which could be
> enabled, and of course strace often gives a good idea on whats
> happening as well.
>
> Cheers,
> Sven
>
>
> On 09.05.2017 02:05, Terence Kent wrote:
> > Hello!
> >
> > We're trying to update our naemon docker image to 1.0.6 and we're
> running into a fairly difficult-to-debug issue. Here's the issue we're
> seeing:
> >
> > 1. Naemon + Apache start as expected and will run indefinitely, if Thruk
> is not accessed.
> > 2. Upon signin to Thruk, the Naemon process's CPU consumption jumps to
> 100% and will stay there indefinitely.
> >
> > We've been trying to get at some logging messages to see if we can
> diagnose the behavior, but that's been a bit more trouble than we expected.
> So far, we've just done the obvious thing of increasing the debuging levels
> found in /etc/naemon/naemon.cfg. However, this seems produce no additional
> information when the issue is hit.
> >
> > Anyway, here's some information about the container environment:
> >
> > *Base image:* phusion 0.9.21 (Which is Ubuntu 16.04)
> > *Naemon primary log file entries: *These always look like this. Not much
> to go off of.
> > ––––
> >
> > [1494286706] Naemon 1.0.6-pkg starting... (PID=51)
> >
> > [1494286706] Local time is Mon May 08 23:38:26 UTC 2017
> >
> > [1494286706] LOG VERSION: 2.0
> >
> > [1494286706] qh: Socket '/var/lib/naemon/naemon.qh' successfully
> initialized
> >
> > [1494286706] nerd: Channel hostchecks registered successfully
> >
> > [1494286706] nerd: Channel servicechecks registered successfully
> >
> > [1494286706] nerd: Fully initialized and ready to rock!
> >
> > [1494286706] wproc: Successfully registered manager as @wproc with query
> handler
> >
> > [1494286706] wproc: Registry request: name=Core Worker 55;pid=55
> >
> > [1494286706] wproc: Registry request: name=Core Worker 57;pid=57
> >
> > [1494286706] wproc: Registry request: name=Core Worker 59;pid=59
> >
> > [1494286706] wproc: Registry request: name=Core Worker 61;pid=61
> >
> > [1494286706] wproc: Registry request: name=Core Worker 58;pid=58
> >
> > [1494286706] wproc: Registry request: name=Core Worker 60;pid=60
> >
> > ––––
> > *Naemon livestatus log: *(Blank)
> > *Thruk Logs: *Nothing comes out here, until I kill the naemon service,
> then it's just:
> > ––––––––
> >
> > [2017/05/08 19:34:00][nameon][ERROR][Thruk] No Backend available
> >
> > [2017/05/08 19:34:00][nameon][ERROR][Thruk] on page:
> http://10.13.30.200/thruk/cgi-bin/minemap.cgi?_=1494272037931
> >
> > [2017/05/08 19:34:00][nameon][ERROR][Thruk] Naemon: ERROR: failed to
> connect - Connection refused. (/var/cache/naemon/live)
> >
> > –––––––––
> >
> >
> >
> > From tracing around, we're pretty confident the issue is when Thruk
> attempts to connect to the naemon live socket. However, what the cause of
> the issue is has been tough; we know the fs permissions are correct, we
> believe the socket is working from the log messages, and Thruk works as
> expected when we stop naemon (it shows it's interfaces and errors that it
> cannot connect to naemon). We can keep at this, of course, but I was hoping
> we could get pointed in the right direction.
> >
> >
> > Thanks!
> >
> > Terence
> >
> >
>
> --
> Sven Nierlein Sven.Nierlein at consol.de
> ConSol* GmbH http://www.consol.de
> Franziskanerstrasse 38 Tel.:089/45841-439
> 81669 Muenchen Fax.:089/45841-111
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/naemon-dev/attachments/20170516/ff76c653/attachment-0001.html>


More information about the Naemon-dev mailing list