Part Swap Troubleshooting
Basic Part Swapping with Known-Good Spares.
Known-Good Spares
Usually when troubleshooting a suspected failing hardware component/part, the first and simplest approach is to simply replace or swap a suspect part with a known-good spare. This technique doesn't require any special tools or sophisticated diagnostics and can frequently be used to quickly identify the failing part. Known-good parts are ones that have been previously used or are currently being used in another location/rack/subrack. Known-good parts can also come from an onsite spares pool.
Sometimes a failing part, for example a DIMM, could present a range of symptoms ranging from a completely dead blade/server to one that boots normally but crashes when running an application. The simplest technique for finding the problem could be to swap the DIMM to see if placing a known-good DIMM into the slot resolves the problem. This technique can be used to swap locations of a suspect part with a known-good part to see if the problem moves with the suspect part to a new location.
Customers should open a Part Replacement case via CrayPort to request a replacement part to replenish spares. Parts that are not covered in this document should be handled on a case-by-case basis by opening a Hardware Support case via CrayPort.
Swapping Techniques for Cray CS Parts
A high-level overview of swapping techniques for Cray® CS™ parts is listed below. Due to the variety of parts/components in CS systems, specific troubleshooting steps and documentation will vary from site to site and from system to system.
- DIMMs
-
Suspect DIMMs should be swapped to a different slot to determine if the problem follows the DIMM. On systems with multiple-socket motherboards with memory controllers onboard the CPU, a different slot assigned to the other CPU. This will identify the root cause as either the CPU/Socket or the DIMM itself.
- Processors/CPUs
-
Faulty CPUs can be identified by swapping CPUs between sockets. Most multiple-socket motherboards support single-CPU operation from a specific, primary socket. In the case of the Intel® S2600 motherboard family, the primary socket is CPU1. Remove all non-primary CPUs and corresponding DIMMs and attempt to boot the system. If the system boots, swap out the CPU with the CPU from the other socket, repeating until the faulty CPU has been identified.
- Power supplies (PSU)
-
Power supplies with noticeable damage should be immediately replaced. Power supplies with fault-indicating diagnostic lights should be reseated or swapped to another slot. If the problem continues, replace the PSU with a known-good unit.
- Hard drives
-
Hard drives or other internal storage that experience both predictable and unpredictable failures should be diagnosed using utilities such as Seagate's SeaTools. Storage devices that fail these diagnostics and/or throw SMART errors should be replaced.