Troubleshoot Node that is Unable to Boot
Process-of-elimination boot technique
This section provides a troubleshooting technique for Cray® GreenBlade™ and 2U rackmount servers that are unable to POST and boot. This technique applies only to Intel® S2600 motherboards. Events for blades/servers that are unable to boot range from major to critical. Analysis of log files may not indicate any failing hardware component.
If a compute node is unable to boot, user's will experience degraded performance due to loss of capacity. If a management node is unable to boot, user's will experience a downed cluster (non-HA) or degraded cluster (HA).
Process-of-Elimination Technique (Boot with minimum configuration)
The Intel S2600 family of motherboards are able to boot with only CPU 1 and a single DIMM. This capability is advantageous when performing process-of-elimination troubleshooting to identify the failing part. Cray recommends customers to attempt to boot a system in the "validation configuration" as part of the troubleshooting process:
When removing CPUs, take care to inspect the socket for any bent pins.
- Remove connections to all external devices from the problem blade/server, reseat all DIMMs, and attempt to boot the blade/server again.
- If the problem persists, begin the process-of-elimination to identify the faulty CPU, DIMM, or motherboard.
- Remove all parts except the minimum.
- Remove all peripheral devices (HCAs, network cards, GPUs, etc.).
- Remove CPU 2 and install a socket cover. (CPU 2 is farthest from the motherboard I/O.)
- Remove all DIMMs except for the DIMM in slot A1 (see below)
- Attempt to boot the blade/server.
Able to boot - then the cause is isolated to one or more of the components that were removed (CPU 2, DIMMs, peripherals).
Unable to boot - then swap CPU 2 into the CPU 1 socket and change DIMMs in slot A1. Then, attempt to boot again. If node is able to boot, then problem should be isolated to the CPU and/or DIMM just removed. If node is unable to boot, the problem is likely caused by the motherboard, bridgeboard, or IFB board.
- Once the faulty part has been identified, open a Parts Replacement case through CrayPort to have the part replaced and/or spares replenished.
Invalid DIMM Platform Errors
This type of error can indicate a problem with the memory controller onboard the CPU.
- If you can access the BIOS, please verify the displayed "Effective Memory" and "Total Memory" of the system.
- Attempt to boot the system with a minimal memory configuration. DIMM A1 (for CPU1) and DIMM E1 (for CPU2)
- As described above, remove CPU 2 and attempt to boot. If the boot fails, swap CPU 2 into the CPU 1 socket and attempt another single-CPU boot. In each case, please remove the corresponding banks of DIMMs.