Nagios sometimes shows wrong status

Jon Angliss jon at netdork.net
Thu May 28 03:04:55 CEST 2009


On Wed, 27 May 2009 10:52:32 +0200 (CEST), "Michael Prochaska"
<michael at prochas.net> wrote:

>Hi!
>
>I've seen a strange behavior of nagios with a very simple check script.
>
>the relevant part of the script:
>#########################################################################
>MAINTCNT="`/usr/sbin/metastat |grep -i maint |wc -l`"
>RESYNCNT="`/usr/sbin/metastat |grep -i resync |wc -l`"


Whilst I'm not sure it has anything to do with your issue, nagios
executes scripts without an environment defined (usually), which means
just calling "grep" will not find the path.  You should define full
path to executables whenever possible.  Of course, you said "relevant
part of the script" which could imply you've defined the path earlier
on.

>NOTOK=0
>status=$STATE_UNKNOWN
>
>if [ $RESYNCNT -gt 0 ]; then
>        NOTOK=1
>        TEXT="WARNING - One or more disks are in resync state. "
>        status=$STATE_WARNING
>fi
>
>if [ $MAINTCNT -gt 0 ]; then
>        NOTOK=1
>        TEXT="CRITICAL - One or more disks are in maintenance state."
>status=$STATE_CRITICAL
>fi
>
>
>if [ $NOTOK -eq 1 ]; then
>        echo $TEXT
>        datum=`date`
>        echo $datum $status >> /tmp/svm.debug
>        exit $status
>fi
>
>echo "OK - There is no maintenance necessary!"
>exit $STATE_OK
>
>#########################################################################
>
>when executing the script from command line, the return code always is 2
>and the output always is "CRITICAL - One or more disks are in maintenance
>state." (because there is one dead disk) => thats ok
>
>when nagios executes the script, the output always is "CRITICAL - One or
>more disks are in maintenance state." but the return code sometimes is 0
>and sometimes is 2 => thats not good
>
>snippet from nagios.log:
>[1243410051] SERVICE ALERT: acgweb1;BASIC_SVM;CRITICAL;SOFT;1;CRITICAL -
>One or more disks are in maintenance state.
>[1243410063] EXTERNAL COMMAND:
>SCHEDULE_SVC_CHECK;acgweb1;BASIC_SVM;1243410061
>[1243410071] SERVICE ALERT: acgweb1;BASIC_SVM;OK;SOFT;2;CRITICAL - One or
>more disks are in maintenance state.
>[1243410083] EXTERNAL COMMAND:
>SCHEDULE_SVC_CHECK;acgweb1;BASIC_SVM;1243410081
>[1243410091] SERVICE ALERT: acgweb1;BASIC_SVM;CRITICAL;SOFT;1;CRITICAL -
>One or more disks are in maintenance state.
>[1243410124] EXTERNAL COMMAND:
>SCHEDULE_SVC_CHECK;acgweb1;BASIC_SVM;1243410122
>[1243410131] SERVICE ALERT: acgweb1;BASIC_SVM;OK;SOFT;2;CRITICAL - One or
>more disks are in maintenance state.
>[1243411031] SERVICE ALERT: acgweb1;BASIC_SVM;CRITICAL;SOFT;1;CRITICAL -
>One or more disks are in maintenance state.
>[1243411316] SERVICE ALERT: acgweb1;BASIC_SVM;OK;SOFT;2;CRITICAL - One or
>more disks are in maintenance state.
>[1243411323] EXTERNAL COMMAND:
>SCHEDULE_SVC_CHECK;acgweb1;BASIC_SVM;1243411320
>[1243411326] SERVICE ALERT: acgweb1;BASIC_SVM;CRITICAL;SOFT;1;CRITICAL -
>One or more disks are in maintenance state.
>[1243411363] EXTERNAL COMMAND:
>SCHEDULE_SVC_CHECK;acgweb1;BASIC_SVM;1243411361
>[1243411366] SERVICE ALERT: acgweb1;BASIC_SVM;OK;SOFT;2;CRITICAL - One or
>more disks are in maintenance state.
>[1243411370] EXTERNAL COMMAND:
>SCHEDULE_SVC_CHECK;acgweb1;BASIC_SVM;1243411368
>[1243411376] SERVICE ALERT: acgweb1;BASIC_SVM;CRITICAL;SOFT;1;CRITICAL -
>One or more disks are in maintenance state.
>[1243411391] EXTERNAL COMMAND:
>SCHEDULE_SVC_CHECK;acgweb1;BASIC_SVM;1243411389
>[1243411396] SERVICE ALERT: acgweb1;BASIC_SVM;CRITICAL;SOFT;2;CRITICAL -
>One or more disks are in maintenance state.
>[1243411398] EXTERNAL COMMAND:
>SCHEDULE_SVC_CHECK;acgweb1;BASIC_SVM;1243411396
>[1243411406] SERVICE ALERT: acgweb1;BASIC_SVM;CRITICAL;SOFT;3;CRITICAL -
>One or more disks are in maintenance state.
>[1243411407] EXTERNAL COMMAND:
>SCHEDULE_SVC_CHECK;acgweb1;BASIC_SVM;1243411405
>
>
>
>/tmp/svm.debug confirmes the command line result:
>> cat /tmp/svm.debug
>Wed May 27 08:21:33 GMT 2009 2
>Wed May 27 08:22:28 GMT 2009 2
>Wed May 27 08:22:39 GMT 2009 2
>Wed May 27 08:22:46 GMT 2009 2
>Wed May 27 08:23:00 GMT 2009 2
>Wed May 27 08:23:11 GMT 2009 2
>Wed May 27 08:23:46 GMT 2009 2
>Wed May 27 08:24:01 GMT 2009 2
>Wed May 27 08:27:09 GMT 2009 2
>Wed May 27 08:27:19 GMT 2009 2
>Wed May 27 08:27:35 GMT 2009 2
>Wed May 27 08:27:50 GMT 2009 2
>Wed May 27 08:27:56 GMT 2009 2
>Wed May 27 08:29:01 GMT 2009 2
>Wed May 27 08:32:55 GMT 2009 2
>Wed May 27 08:34:01 GMT 2009 2
>Wed May 27 08:37:55 GMT 2009 2
>Wed May 27 08:39:01 GMT 2009 2
>Wed May 27 08:39:55 GMT 2009 2
>Wed May 27 08:44:01 GMT 2009 2
>Wed May 27 08:44:55 GMT 2009 2
>
>and so on.....
>
>any ideas whats going here wrong?

Definetly an odd outcome.  What about dumping the content of the
metastat calls, and the variables you've assigned?  That way you can
see what nagios is actually seeing. ie:

echo "====" >> /tmp/svm.debug
datum=`date`
META=`/usr/sbin/metastat`
echo ${datum} ${META} >>  /tmp/svm.debug
MAINTCNT="`/usr/sbin/metastat |grep -i maint |wc -l`"
RESYNCNT="`/usr/sbin/metastat |grep -i resync |wc -l`"
echo ${datum} ${MAINTCNT} >> /tmp/svm.debug
echo ${datum} ${RESYNCNT} >> /tmp/svm.debug

[.. rest of script ..]

-- 
Jonathan Angliss
<jon at netdork.net>


------------------------------------------------------------------------------
Register Now for Creativity and Technology (CaT), June 3rd, NYC. CaT 
is a gathering of tech-side developers & brand creativity professionals. Meet
the minds behind Google Creative Lab, Visual Complexity, Processing, & 
iPhoneDevCamp as they present alongside digital heavyweights like Barbarian 
Group, R/GA, & Big Spaceship. http://p.sf.net/sfu/creativitycat-com 




More information about the Developers mailing list