[naemon-dev] CPU spike + hang when thruk attempts to connect using 1.06 and 1.0.7 in a docker container

Terence Kent terencekent at gmail.com
Fri May 12 04:26:45 CEST 2017


Hey Sven,

Thanks for getting back to me so quickly, this was particularly challenging
to chase down. Using strace and livestatus debugging didn't actually give
me more information on this one. I also confirmed I had the issue with the
nightly build as well as 1.0.6.

Anyway, I found the cause of the issue. It's configuration related and
pretty subtle. If you uncomment the following directive in
/etc/naemon/naemon.cfg file...

broker_module=/usr/lib/naemon/naemon-livestatus/livestatus.so
/var/cache/naemon/live

...then the livestatus socket gets initialized twice during naemons
startup, causing the issue I describe earlier. The reason for this
duplicate initialization is because the
/etc/naemon/module-conf.d/livestatus.cfg, which also includes the same
directive. There's a hint of the duplicate initialization in the naemon
log, due to multiple log messages for livestatus initialization, but that's
it.

It seems the only issue here is that the configuration is very confusing (
/etc/naemon/naemon.cfg gives you an example of how to use livestatus,
making you think you should just be able to uncomment it) and that
repeating a configuration directive doesn't produce an obvious error.

Would you like me to file an issue for this? While it's easy to resolve,
it's really hard to chase down.

Thanks!
Terence

On Tue, May 9, 2017 at 12:12 AM, Sven Nierlein <Sven.Nierlein at consol.de>
wrote:

> Hi Terence,
>
> Could you try the latest nightly build, just to be sure to not hunt
> already fixed bugs. If that doesn't help, you could increase the
> livestatus loglevel as well. Naemon has a debug log which could be
> enabled, and of course strace often gives a good idea on whats
> happening as well.
>
> Cheers,
>  Sven
>
>
> On 09.05.2017 02:05, Terence Kent wrote:
> > Hello!
> >
> > We're trying to update our naemon docker image to 1.0.6 and we're
> running into a fairly difficult-to-debug issue. Here's the issue we're
> seeing:
> >
> > 1. Naemon + Apache start as expected and will run indefinitely, if Thruk
> is not accessed.
> > 2. Upon signin to Thruk, the Naemon process's CPU consumption jumps to
> 100% and will stay there indefinitely.
> >
> > We've been trying to get at some logging messages to see if we can
> diagnose the behavior, but that's been a bit more trouble than we expected.
> So far, we've just done the obvious thing of increasing the debuging levels
> found in /etc/naemon/naemon.cfg. However, this seems produce no additional
> information when the issue is hit.
> >
> > Anyway, here's some information about the container environment:
> >
> > *Base image:* phusion 0.9.21 (Which is Ubuntu 16.04)
> > *Naemon primary log file entries: *These always look like this. Not much
> to go off of.
> > ––––
> >
> > [1494286706] Naemon 1.0.6-pkg starting... (PID=51)
> >
> > [1494286706] Local time is Mon May 08 23:38:26 UTC 2017
> >
> > [1494286706] LOG VERSION: 2.0
> >
> > [1494286706] qh: Socket '/var/lib/naemon/naemon.qh' successfully
> initialized
> >
> > [1494286706] nerd: Channel hostchecks registered successfully
> >
> > [1494286706] nerd: Channel servicechecks registered successfully
> >
> > [1494286706] nerd: Fully initialized and ready to rock!
> >
> > [1494286706] wproc: Successfully registered manager as @wproc with query
> handler
> >
> > [1494286706] wproc: Registry request: name=Core Worker 55;pid=55
> >
> > [1494286706] wproc: Registry request: name=Core Worker 57;pid=57
> >
> > [1494286706] wproc: Registry request: name=Core Worker 59;pid=59
> >
> > [1494286706] wproc: Registry request: name=Core Worker 61;pid=61
> >
> > [1494286706] wproc: Registry request: name=Core Worker 58;pid=58
> >
> > [1494286706] wproc: Registry request: name=Core Worker 60;pid=60
> >
> > ––––
> > *Naemon livestatus log: *(Blank)
> > *Thruk Logs: *Nothing comes out here, until I kill the naemon service,
> then it's just:
> > ––––––––
> >
> > [2017/05/08 19:34:00][nameon][ERROR][Thruk] No Backend available
> >
> > [2017/05/08 19:34:00][nameon][ERROR][Thruk] on page:
> http://10.13.30.200/thruk/cgi-bin/minemap.cgi?_=1494272037931
> >
> > [2017/05/08 19:34:00][nameon][ERROR][Thruk] Naemon: ERROR: failed to
> connect - Connection refused. (/var/cache/naemon/live)
> >
> > –––––––––
> >
> >
> >
> > From tracing around, we're pretty confident the issue is when Thruk
> attempts to connect to the naemon live socket. However, what the cause of
> the issue is has been tough; we know the fs permissions are correct, we
> believe the socket is working from the log messages, and Thruk works as
> expected when we stop naemon (it shows it's interfaces and errors that it
> cannot connect to naemon). We can keep at this, of course, but I was hoping
> we could get pointed in the right direction.
> >
> >
> > Thanks!
> >
> > Terence
> >
> >
>
>
> --
> Sven Nierlein             Sven.Nierlein at consol.de
> ConSol* GmbH              http://www.consol.de
> Franziskanerstrasse 38    Tel.:089/45841-439
> 81669 Muenchen            Fax.:089/45841-111
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/naemon-dev/attachments/20170511/d558244c/attachment.html>


More information about the Naemon-dev mailing list