Nagios false positive returns critical but serviceis really OK

Bryn Smith bryn at aas.duke.edu
Mon Feb 12 20:01:41 CET 2007


Morris, Patrick wrote:
>>>> Hi all,
>>>> I've got a setup of CentOS 4.4 and nagios-2.3.1, and I have one 
>>>> server that started reporting critical errors over the weekend.  I 
>>>> logged in and checked the server, and it was fine.  I restarted 
>>>> nagios, and the problem state persisted,
>>>>     
>>>>         
>>> I'd take a look at the Nagios logs to see why it was reporting them 
>>> down. The service output should also give you an idea.
>>>
>>>   
>>>       
>> I'd love to, but I can't find them.  Where do they usually 
>> live?  And ditto for the service output.
>>     
>
> Your logging options are set in your Nagios config. On mine, they're in
> /var/log/nagios.log and syslog. You may want to consider reading through
> the documentation sections on logging for details on where to look.
>
> The service output will be there, and in the web interface.
>
>   
I did find it in /usr/local/nagios/var, which I guess I could have 
epxected.
This is all it says: [1171170000] CURRENT SERVICE STATE: 
aas4.aas.duke.edu;NonBackupLoad;CRITICAL;HARD;3
;<A 
href="https://bb.aas.duke.edu/nagios/mon_data/load/loadmsg152.3.56.13.txt">Cann
ot Obtain Load Info, please check on 152.3.56.13

 I'm seeing this type of error on Disk Usage, Load (backup instance and 
regular instance), Procs, Swap, xinetd, and Logins.  All of the plug-ins 
are homegrown long before I started working here, and they're in the 
same services.cfg grouping as every other machine, which is to say that 
the grouping looks like this:

 define service{
        use                             generic-service         ; Name 
of service template to use
        host_name        
1stmachine.duke.edu,2ndmachine.duke.edu,3rdmachine.duke.edu,<snip about 
30 more machine names>
        service_description             DiskUsage
        is_volatile                     0
        check_period                    24x7
        max_check_attempts              3
        normal_check_interval           10
        retry_check_interval            1
        contact_groups                  admins
        notification_interval           60
        notification_period             24x7
        notification_options            w,u,c,r
        check_command                   snmp_check_load_external_url
        }

It's almost like the check_command is failing, but it isn't - when I run 
it from the command line, it works just fine.  The homegrown script just 
does an *ssh -l nagiosuser at 1stmachine.duke.edu df *and then parses it 
for the percentage of diskspace used.
The key is fine, I've checked.  I can run the command above from the 
command line and it returns an answer, and I can even run the shell 
script that snmp_check_load_external_url calls (disk.sh), and THAT 
returns an answer too.  It's like it just doesn't make it into the web 
interface, even though it does for every other machine in that service 
definition (and, of course, worked for this one until yesterday).

-- 
Bryn Smith (Ms) A&SIST 660-2434 jabber IM: bryn at jabber.duke.edu

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null





More information about the Users mailing list