[naemon-users] Host checks occuring too fast

Robert Brockway robert at timetraveller.org
Tue Jul 13 08:41:40 CEST 2021


Hi all.  I looked at this for a while.  Naturally I solved it soon after 
mailing the list.  Apparently I didn't understand the host check logic as 
well as I thought I did.  It's right there in the doco.

Hosts are checked by the Naemon daemon:

*At regular intervals, as defined by the check_interval and retry_interval options in your host definitions.

*On-demand when a service associated with the host changes state.

*On-demand as needed as part of the host reachability logic.

*On-demand as needed for predictive host dependency checks.

These hosts have a lot of service checks.  Moving to a hard down state 
after five checks makes sense now.

https://www.naemon.org/documentation/usersguide/hostchecks.html

I've used Nagios/Icinga a lot over the years and now I'm using Naemon. 
In fact when I first used Nagios it was called Netsaint[1].  I don't 
remember running in to this problem before.  Perhaps the host check logic 
has changed over the years.  Either that or I ran in to this a decade or 
two ago and just forgot.

So the solution is first_notification_delay.

Cheers,

Rob

[1] Before Netsaint I used Big Brother.  Let us never speak of Big Brother 
again.

On Tue, 13 Jul 2021, Robert Brockway wrote:

> Hi all.  I have the following settings specified in the prod-linux-server 
> host template:
>
> check_interval		3
> max_check_attempts	5
> retry_interval		3
>
> In addition interval_length is at the default value of 60 so both intervals 
> above are measured in minutes.
>
> Despite this, host checks are occuring too fast.  An example is below but 
> this has happened many times.  The problem is that Naemon is waking up staff 
> during transient network failures.  Our infrastructure has redundancy and 
> hosts are configured to reboot as a result of various failure modes.  The 
> application is robust and copes with all this fine.
>
> As a result I don't want anyone woken up until the host has been down for 
> about 15 minutes.
>
> Example checks on a down host as presented by Thruk:
>
> [2021-07-11 18:52:44] HOST ALERT: bob;DOWN;HARD;5;CRITICAL - Socket timeout 
> after 10 seconds
> [2021-07-11 18:51:14] HOST ALERT: bob;DOWN;SOFT;4;CRITICAL - Socket timeout 
> after 10 seconds
> [2021-07-11 18:50:59] HOST ALERT: bob;DOWN;SOFT;3;CRITICAL - Socket timeout 
> after 10 seconds
> [2021-07-11 18:50:43] HOST ALERT: bob;DOWN;SOFT;2;CRITICAL - Socket timeout 
> after 10 seconds
> [2021-07-11 18:50:12] HOST ALERT: bob;DOWN;SOFT;1;CRITICAL - Socket timeout 
> after 10 seconds
>
> NB: The hostname isn't really called 'bob'.
>
> I thought perhaps that host freshness was the problem so I turned that off 
> but it hasn't made a difference. We don't currently have any passive checks 
> so I think it is safe to turn off host freshness.
>
> I'm going to set first_notification_delay to 10 minutes as a work-around. 
> Even a 10 minute delay will be a lot better than what is happening now.
>
> Any help greatly appreciated.
>
> Cheers,
>
> Rob
>


More information about the Naemon-users mailing list