Problem with check_openmanage plugin and storage

Nic Bernstein nic at onlight.com
Wed Jun 19 14:55:05 CEST 2013


On 06/18/2013 07:55 PM, nagios-users-request at lists.sourceforge.net wrote:
> Date: Wed, 19 Jun 2013 02:41:02 +0200 From: Trond Hasle Amundsen
> <t.h.amundsen at usit.uio.no> Subject: Re: [Nagios-users] Problem with
> check_openmanage plugin and storage To: Nagios Users List
> <nagios-users at lists.sourceforge.net> Message-ID:
> <15tk3lrrkyp.fsf at tux.uio.no> Content-Type: text/plain; charset=utf-8
> Nic Bernstein <nic at onlight.com> writes:
>> > We've recently been experimenting with Trond Hasle Amundsen's check_openmanage
>> > on a large network with about a hundred Dell servers of various ages,
>> > capabilities, etc.? Mostly PE-2950, R210, R410 and R720.? Much thanks to Trond
>> > for all his great work on Nagios plugins and other projects, by the way.
>> >
>> > We've hit a wall, however, with the storage monitoring aspects of this plugin.
>> >
>> > For example, here's a quite specific case.? This is a new PE R720, in debug:
>> >
>> >     onlight at monitor:~$ check_openmanage -H host -C secret -d
>> >        System:      PowerEdge R720           OMSA version:    7.1.0
>> >        ServiceTag:  #######                  Plugin version:  3.7.9
>> >        BIOS/date:   1.2.6 05/10/2012         Checking mode:   SNMPv2c UDP/IPv4
>> >     -----------------------------------------------------------------------------
>> >        Storage Components
>> >     =============================================================================
>> >       STATE  |    ID    |  MESSAGE TEXT
>> >     ---------+----------+--------------------------------------------------------
>> >           OK |        0 | Controller 0 [PERC H310 Mini] is Ready
>> >      WARNING |  0:0:1:0 | Physical Disk 0:1:0 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online, Not Certified
>> >      WARNING |  0:0:1:1 | Physical Disk 0:1:1 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online, Not Certified
>> >           OK |      0:0 | Logical Drive '/dev/sda' [RAID-1, 1862.50 GB] is Ready
>> >           OK |      0:0 | Connector 0 [SAS] on controller 0 is Ready
>> >           OK |      0:1 | Connector 1 [SAS] on controller 0 is Ready
>> >           OK |    0:0:1 | Enclosure 0:0:1 [Backplane] on controller 0 is Ready
>> [...]
>> > This run exits with 1 (WARNING).
>> >
>> > We're not sure we agree with the decision to make the fact that a disk is not
>> > Dell Certified a Warning, but we can at least understand that.? So, what if we
>> > exclude storage, with --no-storage?
> The decision to create a warning for non-certified disks belongs to
> Dell. I've tried to let the plugin simply relay the warning level from
> Openmanage, unless it's outright wrong (such as reporting disks in
> predictive failure as OK).

Yes, we completely understand that, and the use of the global status
flag.  I should have been clearer that we get that it wasn't your choice.

>> >     onlight at monitor:~$ check_openmanage -H host -C secret -d --no-storage
>> >        System:      PowerEdge R720           OMSA version:    7.1.0
>> >        ServiceTag:  #######                  Plugin version:  3.7.9
>> >        BIOS/date:   1.2.6 05/10/2012         Checking mode:   SNMPv2c UDP/IPv4
>> >     -----------------------------------------------------------------------------
>> >    
[...]
>> >     OOPS! Something is wrong with this server, but I don't know what. The global
>> >     system health status is WARNING, but every component check is OK. This may
>> >     be a bug in the Nagios plugin, please file a bug report.
>> >
>> > This yields exit code 3 (UNKNOWN).
> This is a bug. Using blacklisting or check manipulation (such as
> --no-storage) should disable the global health check.

Okay, that's what we'd expect.

>> > Now, just for argument's sake, let's say we obviate the check for certified
>> > drives, by commenting out the ????? "workaround for OMSA 7.1.0 bug" code (just
>> > a handy little short-cut).? Here's what we get then:
>> >
>> [...]
>> > Again, as with the original case, exit code is 1 (WARNING).
>> >
>> > Is there any way around this?? Should I be disabling global health checks??
> Openmanage contains a bug that flips the reported warning level
> wrt. certified disks. Any certified disks are reported as non-certified
> and vice versa. The output above is expected when you remove the
> workaround in the code.
>
>> > Here's a run to test that, and it works:
>> >
>> >     onlight at monitor:~$ check_openmanage -H host -C secret -b pdisk=all
>> >     OK - System: 'PowerEdge R720', SN: '#######', 16 GB ram (4 dimms), 1 logical drives, 2 physical drives
> Here, the physical disks aren't checked at all, and the global check is
> correctly disabled, so this is an expected result.
>
>> > Interestingly, when combining the blacklist with debug ("-d -b pdisk=all"), the
>> > exit code is 3 (UNKNOWN), but with debug off, it's 0 (OK).
> Sounds like a bug, perhaps related to the one discussed earlier.
>
>> > So, I guess what I'm wondering is why we need to blacklist the physical disks
>> > (pdisk) instead of using --no-storage?? Shouldn't --no-storage also cause
>> > globalstatus to be ignored?
> Yes it should, I'll look into that, thanks for the report :)

Great, thanks!

> Regarding the non-certified disks problem... There is a special
> blacklisting keyword to suppress the message about non-certified disks:
>
>   check_openmanage -b pdisk_cert=all
>
> Please try this and see if it resolves your issue. Using blacklisting
> should also disable the global health check.

Ah, that's just what we need.  Much appreciated...

No, that doesn't seem to be in my version (3.7.9, downloaded yesterday)

    onlight at monitor:~$ perl check_openmanage -H host -C secret -b pdisk_cert=all
    Physical Disk 0:1:0 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online
    Physical Disk 0:1:1 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online
    onlight at monitor:~$ echo $?
    1

I guess I'll wait for a patch.

Say Trond, I sent you some notes last week about enhancements we made to
your check_linux_bonding plugin.  Would you prefer I re-post those to
the list instead?

Thanks again!
    -nic

-- 
Nic Bernstein                             nic at onlight.com
Onlight, Inc.                             www.onlight.com
219 N. Milwaukee St., Suite 2a            v. 414.272.4477
Milwaukee, Wisconsin  53202

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://www.monitoring-lists.org/archive/users/attachments/20130619/40030d1d/attachment.html>
-------------- next part --------------
------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
-------------- next part --------------
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list