freshness_threshold bug - big problem

Jochen Bern Jochen.Bern at LINworks.de
Thu Dec 16 21:59:40 CET 2010


On 12/16/2010 12:03 PM, Rodney Ramos wrote:
> As I´ve said before I think that it is a Nagios Core bug. I´ve tested it
> with Nagios 3.2.1 and I found the same problem.
> I think it´s a serious problem.


Oh, wow. 8-O I can confirm the effect on my 3.2.3, but there seems to be
*more* of a problem with host freshness checks. Test run with
check_interval 15, retry_interval 2, max_check_attempts 4; log excerpt:


18:23:55 Warning: Host 'Unfresh' has no services associated with it!
18:24:28 EXTERNAL COMMAND: PROCESS_HOST_CHECK_RESULT;Unfresh;0;Manual
Init to UP|
18:24:35 PASSIVE HOST CHECK: Unfresh;0;Manual Init to UP

18:39:55 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 12s
   (threshold=0d 0h 15m 16s). I'm forcing an immediate check of the host.
18:40:05 HOST ALERT: Unfresh;DOWN;SOFT;1;(null)

18:51:12 Warning: Host 'Unfresh' has no services associated with it!

18:56:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 59s
   (threshold=0d 0h 15m 17s). I'm forcing an immediate check of the host.
18:56:23 HOST ALERT: Unfresh;DOWN;SOFT;2;(null)
19:00:12 Warning: Host 'Unfresh' has no services associated with it!
19:12:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
   (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.
19:12:23 HOST ALERT: Unfresh;DOWN;SOFT;2;CRITICAL: All life functions
terminated
19:28:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
19:28:23 HOST ALERT: Unfresh;DOWN;SOFT;3;CRITICAL: All life functions
terminated
19:44:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
19:44:23 HOST ALERT: Unfresh;DOWN;HARD;4;CRITICAL: All life functions
terminated
20:00:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
20:16:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 41s
   (threshold=0d 0h 15m 17s). I'm forcing an immediate check of the host.
20:32:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 39s
   (threshold=0d 0h 15m 18s). I'm forcing an immediate check of the host.
20:48:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
   (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.
21:04:13 Warning: The results of host 'Unfresh' are stale by 0d 0h 0m 45s
   (threshold=0d 0h 15m 15s). I'm forcing an immediate check of the host.


(The additional "no services" crud stems from my not getting the check
command right the first time 'round, and having to re-reload the config.)


I took excerpts of status.dat and retention.dat initially and after the
first nine active checks, look at these current_attempt numbers:


# for FIL in *.dat* ; do echo -n "${FIL}:  " | \
> sed -e 's/_[a-z]*-/-/' -e 's/\.[a-z]*: */:/' ; \
> egrep '(current_attempt|state_type|(current|last_hard)_state=)' \
> $FIL | sed -e 's/\([a-z][a-z][a-z]\)[a-z]*\([_=]\)/\1\2/g' | \
> tr '\n\t' '  ' ; echo "" ; done
retention.dat-OK:       cur_sta=0 las_har_sta=0 cur_att=1 sta_typ=1
retention.dat-1:        cur_sta=0 las_har_sta=0 cur_att=1 sta_typ=1
retention.dat-2:        cur_sta=1 las_har_sta=0 cur_att=1 sta_typ=0
retention.dat-3:        cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
retention.dat-4:        cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
retention.dat-5:        cur_sta=1 las_har_sta=0 cur_att=2 sta_typ=0
retention.dat-6:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
retention.dat-7:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
retention.dat-8:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
retention.dat-9:        cur_sta=1 las_har_sta=0 cur_att=4 sta_typ=1
status.dat-OK:   cur_sta=0  las_har_sta=0  cur_att=1  sta_typ=1
status.dat-1:    cur_sta=1  las_har_sta=0  cur_att=1  sta_typ=0
status.dat-2:    cur_sta=1  las_har_sta=0  cur_att=2  sta_typ=0
status.dat-3:    cur_sta=1  las_har_sta=0  cur_att=2  sta_typ=0
status.dat-4:    cur_sta=1  las_har_sta=0  cur_att=3  sta_typ=0
status.dat-5:    cur_sta=1  las_har_sta=0  cur_att=4  sta_typ=1
status.dat-6:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1
status.dat-7:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1
status.dat-8:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1
status.dat-9:    cur_sta=1  las_har_sta=1  cur_att=1  sta_typ=1


extinfo.cgi told me "1/4 (SOFT state)" at 19:03 (after the *2nd* active
check, i.e., matching the data in retention.dat) but tells me "1/4 (HARD
state)" right now (matching status.dat instead) ...


Kind regards,
								J. Bern
-- 
Jochen Bern, Systemingenieur --- LINworks GmbH <http://www.LINworks.de/>
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP = D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C27
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Geschäftsführer Metin Dogan, Oliver Michel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Unfresh.tgz
Type: application/x-compressed-tar
Size: 20593 bytes
Desc: not available
URL: <https://www.monitoring-lists.org/archive/developers/attachments/20101216/786f493d/attachment.bin>
-------------- next part --------------
------------------------------------------------------------------------------
Lotusphere 2011
Register now for Lotusphere 2011 and learn how
to connect the dots, take your collaborative environment
to the next level, and enter the era of Social Business.
http://p.sf.net/sfu/lotusphere-d2d
-------------- next part --------------
_______________________________________________
Nagios-devel mailing list
Nagios-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-devel


More information about the Developers mailing list