Troubleshoot Node that is Unable to Boot

Process-of-elimination boot technique

This section provides a troubleshooting technique for Cray® GreenBlade™ and 2U rackmount servers that are unable to POST and boot. This technique applies only to Intel® S2600 motherboards. Events for blades/servers that are unable to boot range from major to critical. Analysis of log files may not indicate any failing hardware component.

If a compute node is unable to boot, user's will experience degraded performance due to loss of capacity. If a management node is unable to boot, user's will experience a downed cluster (non-HA) or degraded cluster (HA).

Process-of-Elimination Technique (Boot with minimum configuration)

The Intel S2600 family of motherboards are able to boot with only CPU 1 and a single DIMM. This capability is advantageous when performing process-of-elimination troubleshooting to identify the failing part. Cray recommends customers to attempt to boot a system in the "validation configuration" as part of the troubleshooting process:

When removing CPUs, take care to inspect the socket for any bent pins.

  1. Remove connections to all external devices from the problem blade/server, reseat all DIMMs, and attempt to boot the blade/server again.
  2. If the problem persists, begin the process-of-elimination to identify the faulty CPU, DIMM, or motherboard.
  3. Remove all parts except the minimum.
    1. Remove all peripheral devices (HCAs, network cards, GPUs, etc.).
    2. Remove CPU 2 and install a socket cover. (CPU 2 is farthest from the motherboard I/O.)
    3. Remove all DIMMs except for the DIMM in slot A1 (see below)
  4. Attempt to boot the blade/server.

    Able to boot - then the cause is isolated to one or more of the components that were removed (CPU 2, DIMMs, peripherals).

    Unable to boot - then swap CPU 2 into the CPU 1 socket and change DIMMs in slot A1. Then, attempt to boot again. If node is able to boot, then problem should be isolated to the CPU and/or DIMM just removed. If node is unable to boot, the problem is likely caused by the motherboard, bridgeboard, or IFB board.

  5. Once the faulty part has been identified, open a Parts Replacement case through CrayPort to have the part replaced and/or spares replenished.

Invalid DIMM Platform Errors

This type of error can indicate a problem with the memory controller onboard the CPU.

A few troubleshooting steps:
  1. If you can access the BIOS, please verify the displayed "Effective Memory" and "Total Memory" of the system.
  2. Attempt to boot the system with a minimal memory configuration. DIMM A1 (for CPU1) and DIMM E1 (for CPU2)
  3. As described above, remove CPU 2 and attempt to boot. If the boot fails, swap CPU 2 into the CPU 1 socket and attempt another single-CPU boot. In each case, please remove the corresponding banks of DIMMs.