Problem with check_openmanage plugin and storage

Trond Hasle Amundsen t.h.amundsen at usit.uio.no
Wed Jun 19 02:41:02 CEST 2013


Nic Bernstein <nic at onlight.com> writes:

> We've recently been experimenting with Trond Hasle Amundsen's check_openmanage
> on a large network with about a hundred Dell servers of various ages,
> capabilities, etc.  Mostly PE-2950, R210, R410 and R720.  Much thanks to Trond
> for all his great work on Nagios plugins and other projects, by the way.
>
> We've hit a wall, however, with the storage monitoring aspects of this plugin.
>
> For example, here's a quite specific case.  This is a new PE R720, in debug:
>
>     onlight at monitor:~$ check_openmanage -H host -C secret -d
>        System:      PowerEdge R720           OMSA version:    7.1.0
>        ServiceTag:  #######                  Plugin version:  3.7.9
>        BIOS/date:   1.2.6 05/10/2012         Checking mode:   SNMPv2c UDP/IPv4
>     -----------------------------------------------------------------------------
>        Storage Components
>     =============================================================================
>       STATE  |    ID    |  MESSAGE TEXT
>     ---------+----------+--------------------------------------------------------
>           OK |        0 | Controller 0 [PERC H310 Mini] is Ready
>      WARNING |  0:0:1:0 | Physical Disk 0:1:0 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online, Not Certified
>      WARNING |  0:0:1:1 | Physical Disk 0:1:1 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online, Not Certified
>           OK |      0:0 | Logical Drive '/dev/sda' [RAID-1, 1862.50 GB] is Ready
>           OK |      0:0 | Connector 0 [SAS] on controller 0 is Ready
>           OK |      0:1 | Connector 1 [SAS] on controller 0 is Ready
>           OK |    0:0:1 | Enclosure 0:0:1 [Backplane] on controller 0 is Ready
>     -----------------------------------------------------------------------------
>        Chassis Components
>     =============================================================================
>       STATE  |  ID  |  MESSAGE TEXT
>     ---------+------+------------------------------------------------------------
>           OK |    0 | Memory module 0 [DIMM_A1, 4096 MB] is Ok
>           OK |    1 | Memory module 1 [DIMM_A2, 4096 MB] is Ok
>           OK |    2 | Memory module 2 [DIMM_A3, 4096 MB] is Ok
>           OK |    3 | Memory module 3 [DIMM_A4, 4096 MB] is Ok
>           OK |    0 | Chassis fan 0 [System Board Fan1 RPM] reading: 1200 RPM
>           OK |    1 | Chassis fan 1 [System Board Fan2 RPM] reading: 1080 RPM
>           OK |    2 | Chassis fan 2 [System Board Fan3 RPM] reading: 1200 RPM
>           OK |    3 | Chassis fan 3 [System Board Fan4 RPM] reading: 1080 RPM
>           OK |    4 | Chassis fan 4 [System Board Fan5 RPM] reading: 1080 RPM
>           OK |    5 | Chassis fan 5 [System Board Fan6 RPM] reading: 1080 RPM
>           OK |    0 | Power Supply 0 [AC]: Presence detected
>           OK |    0 | Temperature Probe 0 [System Board Inlet Temp] reads 26 C (min=3/-7, max=42/47)
>           OK |    1 | Temperature Probe 1 [System Board Exhaust Temp] reads 33 C (min=8/3, max=70/75)
>           OK |    2 | Temperature Probe 2 [CPU1 Temp] reads 49 C (min=8/3, max=83/88)
>           OK |    0 | Processor 0 [Intel Xeon E5-2603 0 1.80GHz] is Present
>           OK |    0 | Voltage sensor 0 [CPU1 VCORE PG] is Good
>           OK |    1 | Voltage sensor 1 [System Board 3.3V PG] is Good
>           OK |    2 | Voltage sensor 2 [System Board 5V PG] is Good
>           OK |    3 | Voltage sensor 3 [CPU1 PLL PG] is Good
>           OK |    4 | Voltage sensor 4 [System Board 1.1V PG] is Good
>           OK |    5 | Voltage sensor 5 [CPU1 M23 VDDQ PG] is Good
>           OK |    6 | Voltage sensor 6 [CPU1 M23 VTT PG] is Good
>           OK |    7 | Voltage sensor 7 [System Board FETDRV PG] is Good
>           OK |    8 | Voltage sensor 8 [CPU1 VSA PG] is Good
>           OK |    9 | Voltage sensor 9 [CPU1 M01 VDDQ PG] is Good
>           OK |   10 | Voltage sensor 10 [System Board NDC PG] is Good
>           OK |   11 | Voltage sensor 11 [CPU1 VTT PG] is Good
>           OK |   12 | Voltage sensor 12 [System Board 1.5V PG] is Good
>           OK |   13 | Voltage sensor 13 [PS2 PG Fail] is Good
>           OK |   14 | Voltage sensor 14 [System Board PS1 PG Fail] is Good
>           OK |   15 | Voltage sensor 15 [System Board BP1 5V PG] is Good
>           OK |   16 | Voltage sensor 16 [CPU1 M01 VTT PG] is Good
>           OK |   17 | Voltage sensor 17 [PS1 Voltage 1] reads 114 V
>           OK |    0 | Battery probe 0 [System Board CMOS Battery] is Presence Detected
>           OK |    0 | Amperage probe 0 [PS1 Current 1] reads 0.6 A
>           OK |    1 | Amperage probe 1 [System Board Pwr Consumption] reads 56 W
>           OK |    0 | Chassis intrusion 0 detection: Ok (Not Breached)
>           OK |    0 | SD Card 0 [vFlash] is Absent
>     -----------------------------------------------------------------------------
>        Other messages
>     =============================================================================
>       STATE  |  MESSAGE TEXT
>     ---------+-------------------------------------------------------------------
>           OK | ESM log health is Ok (less than 80% full)
>           OK | Chassis Service Tag is sane
>
> This run exits with 1 (WARNING).
>
> We're not sure we agree with the decision to make the fact that a disk is not
> Dell Certified a Warning, but we can at least understand that.  So, what if we
> exclude storage, with --no-storage?

The decision to create a warning for non-certified disks belongs to
Dell. I've tried to let the plugin simply relay the warning level from
Openmanage, unless it's outright wrong (such as reporting disks in
predictive failure as OK).

>     onlight at monitor:~$ check_openmanage -H host -C secret -d --no-storage
>        System:      PowerEdge R720           OMSA version:    7.1.0
>        ServiceTag:  #######                  Plugin version:  3.7.9
>        BIOS/date:   1.2.6 05/10/2012         Checking mode:   SNMPv2c UDP/IPv4
>     -----------------------------------------------------------------------------
>        Chassis Components
>     =============================================================================
>       STATE  |  ID  |  MESSAGE TEXT
>     ---------+------+------------------------------------------------------------
>           OK |    0 | Memory module 0 [DIMM_A1, 4096 MB] is Ok
>           OK |    1 | Memory module 1 [DIMM_A2, 4096 MB] is Ok
>           OK |    2 | Memory module 2 [DIMM_A3, 4096 MB] is Ok
>           OK |    3 | Memory module 3 [DIMM_A4, 4096 MB] is Ok
>           OK |    0 | Chassis fan 0 [System Board Fan1 RPM] reading: 1080 RPM
>           OK |    1 | Chassis fan 1 [System Board Fan2 RPM] reading: 1080 RPM
>           OK |    2 | Chassis fan 2 [System Board Fan3 RPM] reading: 1200 RPM
>           OK |    3 | Chassis fan 3 [System Board Fan4 RPM] reading: 1080 RPM
>           OK |    4 | Chassis fan 4 [System Board Fan5 RPM] reading: 1080 RPM
>           OK |    5 | Chassis fan 5 [System Board Fan6 RPM] reading: 1080 RPM
>           OK |    0 | Power Supply 0 [AC]: Presence detected
>           OK |    0 | Temperature Probe 0 [System Board Inlet Temp] reads 26 C (min=3/-7, max=42/47)
>           OK |    1 | Temperature Probe 1 [System Board Exhaust Temp] reads 33 C (min=8/3, max=70/75)
>           OK |    2 | Temperature Probe 2 [CPU1 Temp] reads 49 C (min=8/3, max=83/88)
>           OK |    0 | Processor 0 [Intel Xeon E5-2603 0 1.80GHz] is Present
>           OK |    0 | Voltage sensor 0 [CPU1 VCORE PG] is Good
>           OK |    1 | Voltage sensor 1 [System Board 3.3V PG] is Good
>           OK |    2 | Voltage sensor 2 [System Board 5V PG] is Good
>           OK |    3 | Voltage sensor 3 [CPU1 PLL PG] is Good
>           OK |    4 | Voltage sensor 4 [System Board 1.1V PG] is Good
>           OK |    5 | Voltage sensor 5 [CPU1 M23 VDDQ PG] is Good
>           OK |    6 | Voltage sensor 6 [CPU1 M23 VTT PG] is Good
>           OK |    7 | Voltage sensor 7 [System Board FETDRV PG] is Good
>           OK |    8 | Voltage sensor 8 [CPU1 VSA PG] is Good
>           OK |    9 | Voltage sensor 9 [CPU1 M01 VDDQ PG] is Good
>           OK |   10 | Voltage sensor 10 [System Board NDC PG] is Good
>           OK |   11 | Voltage sensor 11 [CPU1 VTT PG] is Good
>           OK |   12 | Voltage sensor 12 [System Board 1.5V PG] is Good
>           OK |   13 | Voltage sensor 13 [PS2 PG Fail] is Good
>           OK |   14 | Voltage sensor 14 [System Board PS1 PG Fail] is Good
>           OK |   15 | Voltage sensor 15 [System Board BP1 5V PG] is Good
>           OK |   16 | Voltage sensor 16 [CPU1 M01 VTT PG] is Good
>           OK |   17 | Voltage sensor 17 [PS1 Voltage 1] reads 112 V
>           OK |    0 | Battery probe 0 [System Board CMOS Battery] is Presence Detected
>           OK |    0 | Amperage probe 0 [PS1 Current 1] reads 0.6 A
>           OK |    1 | Amperage probe 1 [System Board Pwr Consumption] reads 56 W
>           OK |    0 | Chassis intrusion 0 detection: Ok (Not Breached)
>           OK |    0 | SD Card 0 [vFlash] is Absent
>     -----------------------------------------------------------------------------
>        Other messages
>     =============================================================================
>       STATE  |  MESSAGE TEXT
>     ---------+-------------------------------------------------------------------
>           OK | ESM log health is Ok (less than 80% full)
>           OK | Chassis Service Tag is sane
>     OOPS! Something is wrong with this server, but I don't know what. The global
>     system health status is WARNING, but every component check is OK. This may
>     be a bug in the Nagios plugin, please file a bug report.
>
> This yields exit code 3 (UNKNOWN).

This is a bug. Using blacklisting or check manipulation (such as
--no-storage) should disable the global health check.

> Now, just for argument's sake, let's say we obviate the check for certified
> drives, by commenting out the       "workaround for OMSA 7.1.0 bug" code (just
> a handy little short-cut).  Here's what we get then:
>
>     onlight at monitor:~$ check_openmanage -H host -C secret -d
>        System:      PowerEdge R720           OMSA version:    7.1.0
>        ServiceTag:  #######                  Plugin version:  3.7.9
>        BIOS/date:   1.2.6 05/10/2012         Checking mode:   SNMPv2c UDP/IPv4
>     -----------------------------------------------------------------------------
>        Storage Components
>     =============================================================================
>       STATE  |    ID    |  MESSAGE TEXT
>     ---------+----------+--------------------------------------------------------
>           OK |        0 | Controller 0 [PERC H310 Mini] is Ready
>      WARNING |  0:0:1:0 | Physical Disk 0:1:0 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online
>      WARNING |  0:0:1:1 | Physical Disk 0:1:1 [Ata ST2000DM001-9YN164, 2.0TB] on ctrl 0 is Online
>           OK |      0:0 | Logical Drive '/dev/sda' [RAID-1, 1862.50 GB] is Ready
>           OK |      0:0 | Connector 0 [SAS] on controller 0 is Ready
>           OK |      0:1 | Connector 1 [SAS] on controller 0 is Ready
>           OK |    0:0:1 | Enclosure 0:0:1 [Backplane] on controller 0 is Ready
>     -----------------------------------------------------------------------------
>        Chassis Components
>     =============================================================================
>       STATE  |  ID  |  MESSAGE TEXT
>     ---------+------+------------------------------------------------------------
>           OK |    0 | Memory module 0 [DIMM_A1, 4096 MB] is Ok
>           OK |    1 | Memory module 1 [DIMM_A2, 4096 MB] is Ok
>           OK |    2 | Memory module 2 [DIMM_A3, 4096 MB] is Ok
>           OK |    3 | Memory module 3 [DIMM_A4, 4096 MB] is Ok
>           OK |    0 | Chassis fan 0 [System Board Fan1 RPM] reading: 1080 RPM
>           OK |    1 | Chassis fan 1 [System Board Fan2 RPM] reading: 1200 RPM
>           OK |    2 | Chassis fan 2 [System Board Fan3 RPM] reading: 1200 RPM
>           OK |    3 | Chassis fan 3 [System Board Fan4 RPM] reading: 1080 RPM
>           OK |    4 | Chassis fan 4 [System Board Fan5 RPM] reading: 1080 RPM
>           OK |    5 | Chassis fan 5 [System Board Fan6 RPM] reading: 1200 RPM
>           OK |    0 | Power Supply 0 [AC]: Presence detected
>           OK |    0 | Temperature Probe 0 [System Board Inlet Temp] reads 26 C (min=3/-7, max=42/47)
>           OK |    1 | Temperature Probe 1 [System Board Exhaust Temp] reads 33 C (min=8/3, max=70/75)
>           OK |    2 | Temperature Probe 2 [CPU1 Temp] reads 48 C (min=8/3, max=83/88)
>           OK |    0 | Processor 0 [Intel Xeon E5-2603 0 1.80GHz] is Present
>           OK |    0 | Voltage sensor 0 [CPU1 VCORE PG] is Good
>           OK |    1 | Voltage sensor 1 [System Board 3.3V PG] is Good
>           OK |    2 | Voltage sensor 2 [System Board 5V PG] is Good
>           OK |    3 | Voltage sensor 3 [CPU1 PLL PG] is Good
>           OK |    4 | Voltage sensor 4 [System Board 1.1V PG] is Good
>           OK |    5 | Voltage sensor 5 [CPU1 M23 VDDQ PG] is Good
>           OK |    6 | Voltage sensor 6 [CPU1 M23 VTT PG] is Good
>           OK |    7 | Voltage sensor 7 [System Board FETDRV PG] is Good
>           OK |    8 | Voltage sensor 8 [CPU1 VSA PG] is Good
>           OK |    9 | Voltage sensor 9 [CPU1 M01 VDDQ PG] is Good
>           OK |   10 | Voltage sensor 10 [System Board NDC PG] is Good
>           OK |   11 | Voltage sensor 11 [CPU1 VTT PG] is Good
>           OK |   12 | Voltage sensor 12 [System Board 1.5V PG] is Good
>           OK |   13 | Voltage sensor 13 [PS2 PG Fail] is Good
>           OK |   14 | Voltage sensor 14 [System Board PS1 PG Fail] is Good
>           OK |   15 | Voltage sensor 15 [System Board BP1 5V PG] is Good
>           OK |   16 | Voltage sensor 16 [CPU1 M01 VTT PG] is Good
>           OK |   17 | Voltage sensor 17 [PS1 Voltage 1] reads 114 V
>           OK |    0 | Battery probe 0 [System Board CMOS Battery] is Presence Detected
>           OK |    0 | Amperage probe 0 [PS1 Current 1] reads 0.6 A
>           OK |    1 | Amperage probe 1 [System Board Pwr Consumption] reads 56 W
>           OK |    0 | Chassis intrusion 0 detection: Ok (Not Breached)
>           OK |    0 | SD Card 0 [vFlash] is Absent
>     -----------------------------------------------------------------------------
>        Other messages
>     =============================================================================
>       STATE  |  MESSAGE TEXT
>     ---------+-------------------------------------------------------------------
>           OK | ESM log health is Ok (less than 80% full)
>           OK | Chassis Service Tag is sane
>
> Again, as with the original case, exit code is 1 (WARNING).
>
> Is there any way around this?  Should I be disabling global health checks? 

Openmanage contains a bug that flips the reported warning level
wrt. certified disks. Any certified disks are reported as non-certified
and vice versa. The output above is expected when you remove the
workaround in the code.

> Here's a run to test that, and it works:
>
>     onlight at monitor:~$ check_openmanage -H host -C secret -b pdisk=all
>     OK - System: 'PowerEdge R720', SN: '#######', 16 GB ram (4 dimms), 1 logical drives, 2 physical drives

Here, the physical disks aren't checked at all, and the global check is
correctly disabled, so this is an expected result.

> Interestingly, when combining the blacklist with debug ("-d -b pdisk=all"), the
> exit code is 3 (UNKNOWN), but with debug off, it's 0 (OK).

Sounds like a bug, perhaps related to the one discussed earlier.

> So, I guess what I'm wondering is why we need to blacklist the physical disks
> (pdisk) instead of using --no-storage?  Shouldn't --no-storage also cause
> globalstatus to be ignored?

Yes it should, I'll look into that, thanks for the report :)

Regarding the non-certified disks problem... There is a special
blacklisting keyword to suppress the message about non-certified disks:

  check_openmanage -b pdisk_cert=all

Please try this and see if it resolves your issue. Using blacklisting
should also disable the global health check.

Regards,
-- 
Trond H. Amundsen <t.h.amundsen at usit.uio.no>
Center for Information Technology Services, University of Oslo

------------------------------------------------------------------------------
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
_______________________________________________
Nagios-users mailing list
Nagios-users at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagios-users
::: Please include Nagios version, plugin version (-v) and OS when reporting any issue. 
::: Messages without supporting info will risk being sent to /dev/null


More information about the Users mailing list