Strange service checks behavior

Sandro Vaz - UOL sandromergvaz at uol.com.br
Thu Dec 2 03:29:07 CET 2004


1) Nagios 1-x-cvs (almost 1.3).

2) Compiled with --with-nagios-user=nagios --with-nagios-grp=nagios 
--with-file-perfdata

3) Cut and pasted from the GUI, for better viewing date and hour, but 
also looked at nagios.log and they contain the same logical information. 
Host and service name replaced for privacy reasons.

4) Don't know specifically, but my server is syncronized with others NTP 
servers.

5) Just one instance.

This is the same problem described in Wed, 16 Apr 2003 04:49:12 -0300, 
by Brinkmann, Bastian, at Nagios users list, without a single answer.

I don't know why, but this questions about service checks almost never 
get a response (thank you Andreas very much for you help). Yesterday (I 
guess) a listmate asket about why service checks happened once second 
after a host had a down and hard state (which I believe is the opposite 
we read at the manuals), and I've also found in my log files.

Once again, thanks.

SMV


Andreas Ericsson wrote:

> Sandro Vaz - UOL wrote:
>
>> Folks:
>>
>> I've read the f... manual, "State Types" section, but I can't 
>> understand why there is no hard recovery after a hard problem, 
>> generating wrong availability reports. Let me show you what's in my 
>> log files...
>>
>> Example 1) After a hard problem (6:38:00) we have a weird soft 
>> problem (6:42:32) and then a soft recovery (06:52:18). I can't find 
>> the following hard recovery in the logs. Is this correct?
>>
>
> Not by a longshot, but a little more info is needed before anyone can 
> correctly answer your question (at least without guessing).
>
> What version of nagios are you using? How did you compile it?
>
> Are you sure that this is the way the logs are or did you cut and 
> paste from the GUI?
>
> Did you replace the hostname and service description with other values 
> before posting? If so, how did you do that?
>
> Do you regularly run ntpdate from cron? If so, how do you sync the 
> server you run ntpdate against?
>
> Are you sure you don't have several instances of Nagios running (it's 
> supposed to fork, so don't get spooked if there are several processes)?
>
>>    November 30, 2004 06:00  [30-11-2004 06:52:18] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;OK;SOFT;5;OK - 10 enviados, 10 recebidos, 
>> 0% pacotes perdidos
>> [30-11-2004 06:51:32] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;WARNING;SOFT;4;CRITIAL - 10 enviados, 7 
>> recebidos, 30% pacotes perdidos
>> [30-11-2004 06:51:10] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;SOFT;3;CRITICAL - 10 enviados, 0 
>> recebidos, 100% pacotes perdidos
>> [30-11-2004 06:50:04] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;SOFT;2;CRITICAL - 10 enviados, 0 
>> recebidos, 100% pacotes perdidos
>> [30-11-2004 06:49:04] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;SOFT;1;CRITICAL - 10 enviados, 0 
>> recebidos, 100% pacotes perdidos
>> [30-11-2004 06:43:04] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;OK;SOFT;2;OK - 10 enviados, 10 recebidos, 
>> 0% pacotes perdidos
>> [30-11-2004 06:42:32] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;SOFT;1;CRITICAL - 10 enviados, 4 
>> recebidos, 60% pacotes perdidos
>> [30-11-2004 06:38:00] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;HARD;1;(Service Check Timed Out)
>>
>> Example 2) From time 8:19:24 thru 8:24:02, we have a hard problem and 
>> a hard recovery, which is correct. After that we had a hard problem 
>> (8:41:54) and then a bizarre critical soft (8:53:26), which I can't 
>> explain. 8:57:32 we have a Soft Recovery. Again I can't find the hard 
>> recovery in the log files...
>>
>>    November 30, 2004 08:00  [30-11-2004 08:57:32] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;OK;SOFT;6;OK - 10 enviados, 10 recebidos, 
>> 0% pacotes perdidos
>> [30-11-2004 08:57:24] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;SOFT;5;CRITICAL - 10 enviados, 0 
>> recebidos, 100% pacotes perdidos
>> [30-11-2004 08:56:22] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;SOFT;4;CRITICAL - 10 enviados, 0 
>> recebidos, 100% pacotes perdidos
>> [30-11-2004 08:55:22] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;SOFT;3;CRITICAL - 10 enviados, 0 
>> recebidos, 100% pacotes perdidos
>> [30-11-2004 08:54:22] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;SOFT;2;CRITICAL - 10 enviados, 0 
>> recebidos, 100% pacotes perdidos
>> [30-11-2004 08:53:26] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;SOFT;1;(Service Check Timed Out)
>> [30-11-2004 08:41:54] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;HARD;1;(Service Check Timed Out)
>> [30-11-2004 08:24:02] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;OK;HARD;1;OK - 10 enviados, 9 recebidos, 
>> 10% pacotes perdidos
>> [30-11-2004 08:19:42] SERVICE ALERT: 
>> Client-A-Host-2;Service-X;CRITICAL;HARD;1;(Service Check Timed Out)
>>
>> Analyzing these 2 situations, we have a wrong critical period 
>> (8:41:54 through 13:57:43, where we finally have a hard recovery). 
>> Some good soul could explain this behavior, because without correct 
>> logs, Nagios will generate unreliable availability reports, because 
>> Nagios uses only hard states to produce them.
>>
>> TIA,
>>
>> SMV
>>
>>
>>
>


-- 
No virus found in this outgoing message.
Checked by AVG Anti-Virus.
Version: 7.0.289 / Virus Database: 265.4.4 - Release Date: 30/11/2004



-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now. 
http://productguide.itmanagersjournal.com/
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list