Origin2000/Onyx2 Diagnostics: Node Board LEDs
Origin2000/Onyx2 Node Board LEDs
The are a series of LEDs on the bulkhead of Origin2000/Onyx2 processor node boards. As the system is powered up, the LEDs display information as the system executes boot-time PROM code. If the system hangs or cannot complete execution of the PROM code the LEDs display can be used as a diagnostic tool.
Each Origin2000 or Onyx2 node board has two columns of LEDs. Each column has eight LEDs; the LEDs in the left column are assigned to CPU A, the right column to CPU B.
The following diagram represents the position and function of the node board LEDs. The descriptions "Binary Group 1 and 2", "Processor Activity" and "Processor Heartbeat" apply to the LEDs in either column:
CPU A CPU B ---X X-- Binary _ | X X | Group 1 | X X | ---X X | |--Processor Activity ---X X | Binary _ | X X | Group 2 | X X-- ---X X-----Processor Heartbeat (after IRIX loads)
The first seven pairs of LEDs indicate processor activity during PROM code execution and after the operating system (IRIX) loads. The eighth pair of LEDs indicates the processor heartbeat, and are only active after IRIX loads.
Decoding the Node Board LED Display
If the system does not successfully execute PROM code, the coded pattern of LEDs illuminated on the node board can used as a diagnostic tool to determine the failure point.
The LEDs should be read one column (CPU) at a time. Begin at the top of the column of LEDs, and read the first four LEDs as a group (marked as "Binary Group 1 in the diagram). Assign a binary numeric value to each LED, where an illuminated LED receives a value of "0", and an inactive LED receives a value of "1".
Note that he assignment of binary values is the reverse of typical "on"/"off" value assignments.
So if the first four LEDs were off, this pattern would be assigned the value 1111. Conversely, if all four LEDs were on, the value would be 0000. If the LED illumination pattern was on-off-off-on, the value would be 0110.
Move to the second set of four LEDs in the same processor column (marked as "Binary Group 2" in the diagram) and apply the same binary values used with the first group of four.
Assign each group of binary numbers a hexadecimal value using the following list of equivalents:
Binary Hexadecimal Value Value 0000 0 0001 1 0010 2 0011 3 0100 4 0101 5 0110 6 0111 7 1000 8 1001 9 1010 a 1011 b 1100 c 1101 d 1110 e 1111 f
As an example, if the first group of LEDs displayed as on-on-on-on, the binary value would be 0000 and the first hexadecimal value would be 0. If the second group of LEDs displayed off-off-on-off, the binary value would be 1101 and the second hexadecimal value would be d. The first and second hexadecimal values are in sequence to determine the PROM code execution point. In the previous example, the LED code would be read as 0d.
Knowing the location of certain system components is helpful during diagnosis. For instance, the CPU, Scache and Hub chip are all located on the node board. The Bridge chip is on the Base I/O board (IO6 or IO6G) or other I/O board. The Crossbow (XBOW) chip is on the midplane.
Diagnosing PROM Code Boot Progress
If a system, node board or CPU does not successfully complete PROM code initialization, the static value display on the respective LEDs can be used to determine the point at which the node board or CPU failed.
During the initialization process there are also times that the node boards execute certain aspects of the PROM code sequentially. As each node board executes the code in turn, the remaining node boards wait for that board to signal completion. If the node board hangs while executing that aspect of the PROM code, the remaining boards continue to wait for the completion signal. Because the rest of the node boards continue wait for a completion signal, this might appear as a complete system hang. The LEDs on the node boards cab ne read to determine which have actually failed and which are merely waiting.
Hexadecimal values between 00 and 7f are used to indicate the progress of PROM code execution (if hexadecimal values within the specified range are not listed they are unused).
If a suspected point of failure is not listed, the cause cannot not be isolated to a specific component without additional proprietary diagnostics.
LED Code Boot Phase Suspected Point of Failure 00 System Reset CPU 01 Init CPU CPU 02 Test CPU CPU 03 Run TLB CPU 04 Test Pri Instruction Cache CPU 05 Test Pri Data Cache CPU 06 Test Secondary Cache CPU 07 Flush all Caches CPU 0a Invalidate Pri Inst Cache CPU 0b Invalidate Pri Data Cache CPU 0c Invalidate Secondary Cache Scache 0d Succeed - Jump to Main 0e About to increase PROM Access Speed PROM 0f Increased PROM Access Speed PROM 10 Init Pri data Cache CPU 11 Init Pri Instruction Cache CPU 12 Init CPU COP0 Registers CPU 13 Flush TLB CPU 1a Probe for MSC MSC, nodeboard, midplane 1b Probe for Junk UART MSC, nodeboard, midplane 1c Done with MSC Probe MSC, UART 1d About to Init UART MSC 1e Done with UART Init UART 20 Start Power on Diagnostics (POD) 21 About to enter POD mode C portion 22 About to enter POD prompt loop 23 About to enter POD mode(assembler) 24 Local CPU (A/B) Arbitration CPU 25 Init Secondary Cache Scache 28 About to perform 1st Local Barrier CPU, nodeboard hub 2a Config DEX mode - Stack and Data CPU, Scache 2b Reached Main CPU 38 1st Local Barrier Succeeded 3c About to Jump to UALIAS Space RAM 3d Jumped to UALIAS Space RAM 3e About to Jump to Cached Space Scache 3f Jumped to Cached Space Scache 40 About to Test Stack Area RAM 41 Done Testing Stack Area Scache, RAM 45 About to enter Slave Launch Loop Master CPU 46 Received Launch Interrupt 47 Calling Launched Function RAM, Scache 48 Launched Function Returned 4a About to Init Hub MD & SIMM Controls Hub 4b About to Probe & Config Memory Size RAM 4c About to Init PCF8512C Chip MSC, Midplane 4d Done Init - PCF8512C Chip MSC 4f About to Discover Hub I/O Hub 50 About to Write Hub Config info 51 About to Write Router Config info 52 About to Init Hub I/O Hub 53 About to Probe I/O for Console Base I/O, Bride, XBow 54 Probe I/O for Console Success Base I/O 56 Hub I/O Init Done Midplane, I/O card, Base I/O 57 Saved Errors Stored from Reset Hub 58 Cleared all Error Registers Hub 59 Enabled Error Checking Hub 5a Done Discovering Hub I/O Hub 5b About to Init NMI Handler Area RAM, Scache 5c About to Test Hub Interrupts Hub
Fatal Node Board Error Codes
If a node board suffers a fatal error while during PROM diagnostics and is disabled, the LEDs of each CPU on that node board record the failure code. The failure codes displayed on the node board LEDs can be read using the information in the #Decoding the Node Board LED Display section of this article.
LED Code Reason for Failure Suspected Point of Failure 81 CoProcessor Failed Register Test CPU 83 Pri Instruction Cache Failed Test CPU 84 Pri Data Cache Failed Test CPU 85 Secondary Cache Failed Test Scache 86 CPU Disabled by Another CPU CPU 87 Real-Time Counter Broken CPU 8c General Exception 91 Hub Local Failed CPU, Hub 93 Some Node Not Premium (>32 CPUs) Directory Memory 98 Node has no Local memory No RAM or Disabled RAM 9a CPU is Disabled CPU Disabled 9b Memory Download Failed RAM, PROM 9e Failed Writing Hub Config Info RAM 9f Failed Writing Router Config Info RAM a0 Hub I/O Init Failed Hub a1 Node failed Init RAM a4 Hub Chip Failed Hub a5 Router Chip Failed Router a6 Waiting for Reset To Go Hub a7 LLP Failed After Reset Hub a8 LLP Never Up After Reset Hub a9 Node Board - No Good Local Memory No RAM or Disabled RAM ab Network Discovery Failed ac NASID Calculation Failed CrayLink Cabling Error ad Route Calculation Failed CrayLink Cabling Error ae Route Distribution Failed Check Router LEDs for Error af NASID Distribution Failed CrayLink Cabling Error b0 Master Not Assigned NASID Check Router LEDs for Error b1 Module ID Arbitration Failed MSC b2 Origin2000 Craylinked to Origin200 Illegal Configuration b3 Partition Config Error User Error
Node Board Early Exception LED Codes
If an exception occurs in the PROM code execution before the exception can be displayed by normal means, the CPU LEDs will begin a blinking error code. As the LEDs flash, they will alternate between flashing all eight LEDs and flashing only the LEDs that indicate the error code.
LED Code Exception Suspected Point of Failure f2 General Exception CPU f3 ECC Exception CPU f4 TLB Exception CPU f5 XTLB Exception CPU f6 Unimplimented Exception CPU f7 Cache Error Exception CPU
Post Initialization LED Displays
After the CPUs have completed initialization they display a different set of LED patterns.
Prior to IRIX booting, the master CPU will alternate the display of 55 and 00 (see #Decoding the Node Board LED Display for additional information).
After IRIX loads, the bottom (eighth) LED is used to indicate the CPU heartbeat. LEDs 1 through 7 will progressively illuminate from bottom to top to indicate CPU activity.