Decode SEL Errors

Examples of decoding SEL error Event Data codes.

The Baseboard Management Controller (BMC) on S2600 motherboards stores the System Event Log (SEL) log. Different sensors within the server/blade report operating conditions/status to the BMC including temperature, fan speeds, and power modes. The BMC monitors the system for critical events and logs event entries to the SEL ranging from informational to critical. The SEL can be displayed by using the ipmitool command.

This section provides three examples that show how to display and decode an SEL event.

Prerequisite

How to Decode SEL Events

Example 1: Memory Error

  1. Display contents of the System Event Log. The sel list|elist commands display the contents of the SEL. The elist command (extended list) also displays the name of the sensor that caused the event:
    [root@node1 ~]# ipmitool sel elist
    ...
      7c | 10/29/2015 | 16:56:58 | System Event BIOS Evt Sensor | Timestamp Clock Sync | Asserted
      7d | 10/29/2015 | 16:56:58 | System Event BIOS Evt Sensor | Timestamp Clock Sync | Asserted
      7e | 10/29/2015 | 16:59:13 | System Event BIOS Evt Sensor | OEM System boot event | Asserted
      7f | 11/07/2015 | 14:58:10 | Power Unit Pwr Unit Status | Power off/down | Asserted
      80 | 11/07/2015 | 14:58:20 | Power Unit Pwr Unit Status | Power off/down | Deasserted
      81 | 11/07/2015 | 14:58:30 | System Event BIOS Evt Sensor | Timestamp Clock Sync | Asserted
      82 | 11/07/2015 | 14:58:48 | System Event BIOS Evt Sensor | Timestamp Clock Sync | Asserted
      83 | 11/07/2015 | 15:01:04 | System Event BIOS Evt Sensor | OEM System boot event | Asserted
      84 | 11/11/2015 | 09:07:31 | Memory Mmry ECC Sensor | Correctable ECC | Asserted
  2. Display details of a specific SEL entry using the hexadecimal 0x prefix and entry number (84):
    [root@node1 ~]# ipmitool sel get 0x84
    SEL Record ID : 0084
    Record Type : 02 
    Timestamp : 01/11/2016 21:42:09 
    Generator ID : 0033 
    EvM Revision : 04 
    Sensor Type : Memory 
    Sensor Number : 02 
    Event Type : Sensor-specific Discrete 
    Event Direction : Assertion Event 
    Event Data (RAW) : a00239 
    Event Interpretation : Missing 
    Description : Correctable ECC 
    
    Sensor ID : Mmry ECC Sensor (0x2) 
    Entity ID : 32.2 (Memory Device) 
    Sensor Type : Memory (0x0c)
    
  3. Using the above example, locate the following fields:
    Generator ID : 0033 
    Sensor Number : 02 
    Event Data (RAW) : a00239
  4. Convert the hexadecimal Event Data code to binary format:
    a00239 = 1010 0000 0000 0010 0011 1001
  5. Search the S2600 System Event Log Troubleshooting Guide for the Generator ID using the abbreviation and hex format (GID = 0033h) to find the following section of the guide:

  6. Find the sensor number and corresponding sensor name:

    02h = Memory Correctable and Uncorrectable ECC Error.

  7. Click the corresponding Details Section link to find the following table.

    Memory Correctable and Uncorrectable ECC Error

  8. The Event Data (RAW) part of the Event contains three bytes of data. The decoding of these bytes, based on the table definitions for the specific error category, is shown below:

  9. Look for the corresponding DIMM ID stenciled on the motherboard OR look up the CPU, Channel, and DIMM information in a DIMM slot identification chart for the S2600 motherboard as shown in Figure 1.
  10. For a S2600KP board, a00108 decodes to CPU 1, DIMM B1.

Example 2: Memory Error

This Example is the same type of error as Example 1 except that it has an Event Data code of a1003a from a S2600WT board.
  1. Convert the hexadecimal Event Data code to binary format and decode it as shown below:

  2. For a S2600WT, a1003a decodes to CPU 2, DIMM H3.
Figure: S2600 DIMM Slot Identification Charts

Example 3: PCI Express Critical Error

PCIe error events are either correctable (informational) or fatal. In both cases, information is logged to help identify the source of the PCIe error and the bus, device, and function is included in the extended data fields.

  1. For this example, the following SEL entry is displayed:
    [root@node1 ~]# ipmitool sel elist
    ...
      23 | 08/07/2016 | 14:58:30 | System Event BIOS Evt Sensor | Timestamp Clock Sync | Asserted
      24 | 08/07/2016 | 14:58:48 | System Event BIOS Evt Sensor | Timestamp Clock Sync | Asserted
      25 | 08/07/2016 | 14:58:51 | System Firmware Error #0x06 | Unknown Error | Asserted
      26 | 08/07/2016 | 14:58:52 | Critical Interrupt #0x04 | 
  2. Display details of the example 26 SEL entry:
    [root@node1 ~]# ipmitool sel get 0x26
    SEL Record ID : 0026
     Record Type : 02
     Timestamp : 08/07/2016 14:58:52
     Generator ID : 0033
     EvM Revision : 04
     Sensor Type : Critical Interrupt
     Sensor Number : 04
     Event Type : OEM
     Event Direction : Assertion Event
     Event Data (RAW) : a60018
     Description :
    
    Sensor ID : PCIe Fat Sensor (0x4) 
    Entity ID : 49.1 (PCI Express Bus) 
    Sensor Type : Critical Interrupt 
    
  3. Looking at the GID = 033h table in the SEL Troubleshooting Guide, the 04 Sensor Number indicates:

    04h = PCI Express Fatal Error.

  4. Click the corresponding Details Section link to find the following table.

  5. Convert the hexadecimal Event Data code to binary format:
    a60018 = 1010 0110 0000 0000 0001 1000
  6. Decode the Event Data code using this table definitions as shown below:

  7. Next, use lspci to find the name and physical location of the PCIe device using the decoded information:

    lspci -s <PCI Bus ##>:<PCI Device ##>.<PCI Function #> -v

    [root@node1 ~]# lspci -s 03:00.0 -v  
    03:00.0 3D controller: nVidia Corporation Device 1091 (rev a1)  
    Subsystem: nVidia Corporation Device 0887 
    Physical Slot: 2  
    Flags: bus master, fast devsel, latency 0, IRQ 32  
    ...
  8. Now that the identity, cause of the error, and location (slot 2) of the PCIe device is known, complete any "Next Steps" that are defined in the SEL Troubleshooting Guide. For fatal PCI errors:
    1. Reseat the card and verify it is inserted properly. Check if the error persists.
    2. Install the card in another slot and check if the error follows the card or stays with the current slot.
    3. Verify and update all firmware and drivers.
    4. Replace the card with a known-good spare.