distributed host checks: freshness checking issues

Pascal Vandeputte nagios at asmodeus.be
Tue May 31 17:24:38 CEST 2011


Hi fellow Nagios users,

On a new Nagios 3.2.3 monitoring platform we're deploying at work (complete with a custom database & provisioning scripts tailored to our company activities) we've been running distributed service checks for over half a year now, without 
any issues whatsoever. I'm really happy with the results of that testing so far.

But because the "master" server, receiving all the passive service checks by means of NSCA, still performed its own host checking, some of the hosts which are in a private network (together with one of the distributed slave) were always 
"down" because obviously the master could never ping them.

This is where the passive host checking comes in. It didn't seem too difficult to get going, but once it was up and running I noticed that the master was having wild mood swings. Seemingly random hosts were going up/down/unreachable 
and the picture changed completely multiple times per hour. In other words, each host going "down" usually went "up" within the next few minutes, and others start going "down" at that time.

We have 2 masters running (the submit_check_result scripts on the slaves do a send_nsca to both of them), and the affected hosts are almost identical at any time... So while the weird behaviour looks random, it's apparently reproducible. 
So there must be some bug in my config or something else going on, but at least it's not intermittent.

Long troubleshooting story short: I've got the impression that for some reason, the master is executing the "host_results_are_stale" command while it shouldn't be doing it (because the last check results are still fresh). This causes it to go 
down anyway, and depending on the timing of the next host check result arriving, it flips up and down all the time.


I've picked out just one of the 300 hosts which are currently defined, there's usually about 15 to 90 hosts "down" at all times, but which ones are "down" changes constantly.

On the master, the host is defined as in attachment "master.txt". I've also included the host_results_are_stale command definition.
The same host definition on the slave is in "slave.txt". I've also included the contents of the submit_host_check OCHP script. The slave side of things seems to work well.

Some other relevant parameters in nagios.cfg on the master:

    check_external_commands=1
    command_check_interval=-1
    command_file=/usr/local/nagios/var/rw/nagios.cmd
    external_command_buffer_slots=32768
    check_result_reaper_frequency=10
    check_result_path=/usr/local/nagios/var/spool/checkresults
    interval_length=15
    execute_service_checks=0
    accept_passive_service_checks=1
    execute_host_checks=0
    accept_passive_host_checks=1
    check_service_freshness=1
    service_freshness_check_interval=60
    check_host_freshness=1
    host_freshness_check_interval=60


The interval_length is set to 15 seconds, but I've also tried the default of 60 seconds for a couple of hours today and it doesn't make any difference.
As you'll see in the configs, the slave checks the host every 2 minutes, the master should alert if the last check result is older than 3 minutes.


If I filter the master's nagios.log for just the NSCA host checks arriving, it looks like this (I've replaced the logged timestamp with the human-readable date using some perl, the original log is attached as well):

[15:01:52] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 1.14 ms (from nagslave002)
[15:03:38] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 1.37 ms (from nagslave002)
[15:03:58] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 49.99 ms (from nagslave002)
[15:06:08] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 9.70 ms (from nagslave002)
[15:08:19] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 3.16 ms (from nagslave002)
[15:08:39] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 1.23 ms (from nagslave002)
[15:10:29] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 15.64 ms (from nagslave002)


So the slave is regularly sending its results every 130 seconds according to the log above. 130 seconds is way below the freshness_threshold of 180 seconds set on the master.

If I leave in the rest of the log for this host, this is the result:

[15:01:52] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 1.14 ms (from nagslave002)
[15:02:02] PASSIVE HOST CHECK: rtr001;0;PING OK - Packet loss = 0%, RTA = 1.14 ms (from nagslave002)
[15:03:38] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 1.37 ms (from nagslave002)
[15:03:47] HOST ALERT: rtr001;DOWN;SOFT;1;CRITICAL: rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
[15:03:47] PASSIVE HOST CHECK: rtr001;0;PING OK - Packet loss = 0%, RTA = 1.37 ms (from nagslave002)
[15:03:47] HOST ALERT: rtr001;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 1.37 ms (from nagslave002)
[15:03:58] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 49.99 ms (from nagslave002)
[15:04:07] HOST ALERT: rtr001;DOWN;SOFT;1;CRITICAL: rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
[15:04:07] PASSIVE HOST CHECK: rtr001;0;PING OK - Packet loss = 0%, RTA = 49.99 ms (from nagslave002)
[15:04:07] HOST ALERT: rtr001;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 49.99 ms (from nagslave002)
[15:06:08] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 9.70 ms (from nagslave002)
[15:06:09] PASSIVE HOST CHECK: rtr001;0;PING OK - Packet loss = 0%, RTA = 9.70 ms (from nagslave002)
[15:08:19] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 3.16 ms (from nagslave002)
[15:08:20] PASSIVE HOST CHECK: rtr001;0;PING OK - Packet loss = 0%, RTA = 3.16 ms (from nagslave002)
[15:08:39] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 1.23 ms (from nagslave002)
[15:08:49] PASSIVE HOST CHECK: rtr001;0;PING OK - Packet loss = 0%, RTA = 1.23 ms (from nagslave002)
[15:08:49] HOST ALERT: rtr001;DOWN;SOFT;1;CRITICAL: rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
[15:09:19] HOST ALERT: rtr001;DOWN;HARD;2;CRITICAL: rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
[15:10:29] EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;rtr001;0;PING OK - Packet loss = 0%, RTA = 15.64 ms (from nagslave002)
[15:10:40] PASSIVE HOST CHECK: rtr001;0;PING OK - Packet loss = 0%, RTA = 15.64 ms (from nagslave002)
[15:10:40] HOST ALERT: rtr001;UP;HARD;1;PING OK - Packet loss = 0%, RTA = 15.64 ms (from nagslave002)
[15:10:50] HOST ALERT: rtr001;DOWN;SOFT;1;CRITICAL: rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
[15:11:29] HOST ALERT: rtr001;DOWN;HARD;2;CRITICAL: rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?


For example, where is the "stale" result at 15:03:47 coming from? The last result was from 15:01:52 , which is less than two minutes old at that time.


I've also replaced the check_dummy by a small wrapper so I could log the exact time when check_dummy was called by nagios:

Tue May 31 15:03:37 CEST 2011 args: 2 rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
Tue May 31 15:03:57 CEST 2011 args: 2 rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
Tue May 31 15:08:39 CEST 2011 args: 2 rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
Tue May 31 15:09:09 CEST 2011 args: 2 rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
Tue May 31 15:09:39 CEST 2011 args: 2 rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
Tue May 31 15:10:39 CEST 2011 args: 2 rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?
Tue May 31 15:11:20 CEST 2011 args: 2 rtr001 host results are stale, nagios slave problem (NSCA, connectivity)?

So the first "stale" command was already executed just 105 seconds after the previous host check result was received.

What am I missing?


Best regards,

Pascal
-------------- next part --------------
Host & host_results_are_stale definition on the distributed master
(which only processes passive host & service results):

    define host {
        address 1.2.3.4
        alias somerouter
        check_command host_results_are_stale
        check_freshness 1
        check_interval 0
        check_period 24x7
        contact_groups network-admins
        freshness_threshold 180
        host_name rtr001
        hostgroups ios,router
        icon_image vendor-logos/cisco.png
        max_check_attempts 2
        notification_interval 8
        notification_period 24x7
        retry_interval 4
        statusmap_image vendor-logos/cisco.gd2
        use host-pnp
    }

    define command {
        command_line /usr/local/nagios/libexec/check_dummy 2 "$HOSTNAME$ host results are stale, nagios slave problem (NSCA, connectivity)?"
        command_name host_results_are_stale
    }
-------------- next part --------------
Host definition on the slave:

    define host {
        address 1.2.3.4
        alias somerouter
        check_command check-host-alive!5!-!-!-!-!-!2000!80!3000!100!5
        check_interval 8
        check_period 24x7
        contact_groups network-admins
        host_name rtr001
        hostgroups ios,router
        icon_image vendor-logos/cisco.png
        max_check_attempts 2
        notification_interval 8
        notification_period 24x7
        retry_interval 4
        statusmap_image vendor-logos/cisco.gd2
    }


And my submit_host_check ochp script (sending results to 2 masters):

    #!/bin/sh
    host_name=$1
    shift
    state_id=$1
    shift
    plugin_output="$*"
    myhostname=`hostname`

    /usr/bin/printf "%s\t%s\t%s\n" "$host_name" "$state_id" "$plugin_output (from $myhostname)" | /usr/local/nagios/libexec/send_nsca -H 3.4.5.6 -c /usr/local/nagios/etc/send_nsca.cfg &
    /usr/bin/printf "%s\t%s\t%s\n" "$host_name" "$state_id" "$plugin_output (from $myhostname)" | /usr/local/nagios/libexec/send_nsca -H 2.3.4.5  -c /usr/local/nagios/etc/send_nsca.cfg
-------------- next part --------------
A non-text attachment was scrubbed...
Name: nagios.log
Type: text/x-log
Size: 2692 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/users/attachments/20110531/6085693a/attachment.bin>
-------------- next part --------------
------------------------------------------------------------------------------
Simplify data backup and recovery for your virtual environment with vRanger. 
Installation's a snap, and flexible recovery options mean your data is safe,
secure and there when you need it. Data protection magic?
Nope - It's vRanger. Get your free trial download today. 
http://p.sf.net/sfu/quest-sfdev2dev
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list