[naemon-dev] CPU spike + hang when thruk attempts to connect using 1.06 and 1.0.7 in a docker container

Terence Kent terencekent at gmail.com
Tue May 9 02:05:37 CEST 2017


Hello!

We're trying to update our naemon docker image to 1.0.6 and we're running
into a fairly difficult-to-debug issue. Here's the issue we're seeing:

1. Naemon + Apache start as expected and will run indefinitely, if Thruk is
not accessed.
2. Upon signin to Thruk, the Naemon process's CPU consumption jumps to 100%
and will stay there indefinitely.

We've been trying to get at some logging messages to see if we can diagnose
the behavior, but that's been a bit more trouble than we expected. So far,
we've just done the obvious thing of increasing the debuging levels found
in /etc/naemon/naemon.cfg. However, this seems produce no additional
information when the issue is hit.

Anyway, here's some information about the container environment:

*Base image:* phusion 0.9.21 (Which is Ubuntu 16.04)
*Naemon primary log file entries: *These always look like this. Not much to
go off of.
––––

[1494286706] Naemon 1.0.6-pkg starting... (PID=51)

[1494286706] Local time is Mon May 08 23:38:26 UTC 2017

[1494286706] LOG VERSION: 2.0

[1494286706] qh: Socket '/var/lib/naemon/naemon.qh' successfully initialized

[1494286706] nerd: Channel hostchecks registered successfully

[1494286706] nerd: Channel servicechecks registered successfully

[1494286706] nerd: Fully initialized and ready to rock!

[1494286706] wproc: Successfully registered manager as @wproc with query
handler

[1494286706] wproc: Registry request: name=Core Worker 55;pid=55

[1494286706] wproc: Registry request: name=Core Worker 57;pid=57

[1494286706] wproc: Registry request: name=Core Worker 59;pid=59

[1494286706] wproc: Registry request: name=Core Worker 61;pid=61

[1494286706] wproc: Registry request: name=Core Worker 58;pid=58

[1494286706] wproc: Registry request: name=Core Worker 60;pid=60
––––
*Naemon livestatus log: *(Blank)
*Thruk Logs: *Nothing comes out here, until I kill the naemon service, then
it's just:
––––––––

[2017/05/08 19:34:00][nameon][ERROR][Thruk] No Backend available

[2017/05/08 19:34:00][nameon][ERROR][Thruk] on page:
http://10.13.30.200/thruk/cgi-bin/minemap.cgi?_=1494272037931

[2017/05/08 19:34:00][nameon][ERROR][Thruk] Naemon: ERROR: failed to
connect - Connection refused. (/var/cache/naemon/live)

–––––––––



>From tracing around, we're pretty confident the issue is when Thruk
attempts to connect to the naemon live socket. However, what the cause of
the issue is has been tough; we know the fs permissions are correct, we
believe the socket is working from the log messages, and Thruk works as
expected when we stop naemon (it shows it's interfaces and errors that it
cannot connect to naemon). We can keep at this, of course, but I was hoping
we could get pointed in the right direction.


Thanks!

Terence
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/naemon-dev/attachments/20170508/e11a49e7/attachment.html>


More information about the Naemon-dev mailing list