Someone explain could explain me the correct behaviour for freshness checkings please?

Artur D'Assumpção artur.dassumpcao at di.com.pt
Mon Apr 11 13:32:26 CEST 2005


Hi,

Could any one explain me one thing related to this thread. I've been 
trying a few different options trough the weekend, and i've tried this:

both the central monitoring and distribuited server have

service_freshness_check_interval=60

and all services defined in the main configuration have a threshold of 
999999 (the idea is "near" the infinite), What I was expecting was that
each time the freshness check was triggered in the central server (each 
60s) the threshold would allways validate the last service status 
received, because it never expires. What I am seeing is a litle 
diferent, each time the 60s gets triggered I have a UNKNOWN state setted 
(the check_command does this because it gets stalled) and later the 
services get to the real status with the 2min rate distributed server 
submitions. This behaviour loops each freshness check.

Maybe I am interpreting the freshness behaviour wrong, maybe i'm 
configuring it wrong... can anyone give me a tip over here please?

i've already upgraded to latest version 2.0b3,

Thanks,

AD


Artur D'Assumpção wrote:

> Someone explain could explain me the correct behavior for freshness 
> checkings please? It's driving me crazy.
>
> The main configuration has:
>
> service_freshness_check_interval=60
>
> So I supose that this will define de check rate for the services 
> freshness check.
>
> Then, for every service I use the same template, where I have the 
> following configurations:
>
>
>    check_command        service-is-stale
>    check_freshness        1
>    freshness_threshold    300
>    parallelize_check        1
>    max_check_attempts    2
>    normal_check_interval           2
>    retry_check_interval            2
>
> So the logical behavior for me, is that everytime nagios will trigger 
> a freshness check (each 60s in this case), if the last submited check 
> sample for a given service is  more than 300s old  it will  declare 
> that service staled and  run  service-is-stale. Now, i'm pretty shure 
> that samples are being fed in a +-120s rate, and I'm having a lot of 
> status changes from OK to UNKNOWN (returned from the 
> service-is-stale)! Here it it goes some interestings logs:
>
> Apr 10 16:26:17 sr-0 nsca[13018]: SERVICE CHECK -> Host Name: 
> 'compal.pt_sfci-dr-1', Service Description: '[SYS] System Load', 
> Return Code: '1', Output: 'WARNING - load average: 1.00, 1.00, 1.00'
> Apr 10 16:26:47 sr-0 nsca[30732]: SERVICE CHECK -> Host Name: 
> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Disk Usage', Return 
> Code: '0', Output: 'DISK OK - free space: / 3692 MB (64%):'
> Apr 10 16:27:07 sr-0 nsca[23332]: SERVICE CHECK -> Host Name: 
> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return 
> Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
> Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] System 
> Load;1;WARNING - load average: 1.00, 1.00, 1.00
> Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Disk 
> Usage;0;DISK OK - free space: / 3692 MB (64%):
> Apr 10 16:27:10 sr-0 nagios: EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Swap 
> Usage;0;SWAP OK: 100% free (494 MB out of 494 MB)
> Apr 10 16:27:37 sr-0 nsca[29813]: SERVICE CHECK -> Host Name: 
> 'compal.pt_sfci-dr-1', Service Description: '[SRV] SSH', Return Code: 
> '0', Output: 'SSH OK - OpenSSH_3.9p1 (protocol 2.0)'
> Apr 10 16:28:08 sr-0 nsca[17504]: SERVICE CHECK -> Host Name: 
> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Interfaces', Return 
> Code: '0', Output: 'OK - interfaces lo eth0 tun0 are up'
> Apr 10 16:28:10 sr-0 nagios: EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SRV] SSH;0;SSH OK - 
> OpenSSH_3.9p1 (protocol 2.0)
> Apr 10 16:28:10 sr-0 nagios: EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Interfaces;0;OK 
> - interfaces lo eth0 tun0 are up
> Apr 10 16:28:17 sr-0 nsca[13184]: SERVICE CHECK -> Host Name: 
> 'compal.pt_sfci-dr-1', Service Description: '[SYS] System Load', 
> Return Code: '1', Output: 'WARNING - load average: 1.00, 1.00, 1.00'
>
> ---- SERVICES WHERE OK WHEN REACHED HERE ----
> ---- SERVICES CHANGED TO UNKNOWN AFTER THIS NEXT BLOCK ----
>
> Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS] 
> Disk Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds 
> (threshold=500 seconds).  I'm forcing an immediate check of the service.
> Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS] 
> Swap Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds 
> (threshold=500 seconds).  I'm forcing an immediate check of the service.
> Apr 10 16:28:47 sr-0 nsca[10633]: SERVICE CHECK -> Host Name: 
> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Disk Usage', Return 
> Code: '0', Output: 'DISK OK - free space: / 3692 MB (64%):'
> Apr 10 16:29:07 sr-0 nsca[6978]: SERVICE CHECK -> Host Name: 
> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return 
> Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
> Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] System 
> Load;1;WARNING - load average: 1.00, 1.00, 1.00
> Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Disk 
> Usage;0;DISK OK - free space: / 3692 MB (64%):
> Apr 10 16:29:10 sr-0 nagios: EXTERNAL COMMAND: 
> PROCESS_SERVICE_CHECK_RESULT;compal.pt_sfci-dr-1;[SYS] Swap 
> Usage;0;SWAP OK: 100% free (494 MB out of 494 MB)
>
> The last 2nd block of logs, and correct me if i'm wrong, shows me that 
> something is not ok here, first of all services are being considered 
> staled near 2 mins after a submited check:
>
> Apr 10 16:27:07 sr-0 nsca[23332]: SERVICE CHECK -> Host Name: 
> 'compal.pt_sfci-dr-1', Service Description: '[SYS] Swap Usage', Return 
> Code: '0', Output: 'SWAP OK: 100% free (494 MB out of 494 MB)'
>
> Apr 10 16:28:17 sr-0 nagios: Warning: The results of service '[SYS] 
> Swap Usage' on host 'compal.pt_sfci-dr-1' are stale by 40 seconds 
> (threshold=500 seconds).  I'm forcing an immediate check of the service.
>
> Then i'm looking to a 500s threshold and 40s stale that i've never 
> defined, and I'm shure of this, because all my objects, and they're 
> are very few for this testing environment, uses the same template that 
> i've shown before. Could be this any default value that is not being 
> overided? If it is, I can't find any reference to it in the 
> documentation.
>
> I'm using nagios 2.0b.
>
> I'd be very thankfull with some help in this subject please.
>
> AD
>
>
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nagios-users mailing list
> Nagios-users at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nagios-users
> ::: Please include Nagios version, plugin version (-v) and OS when 
> reporting any issue. ::: Messages without supporting info will risk 
> being sent to /dev/null




-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list