Origin2000/Onyx2 Diagnostics: Node Board LEDs
Contents |
Origin2000/Onyx2 Node Board LEDs
The are a series of LEDs on the bulkhead of Origin2000/Onyx2 processor node boards. As the system is powered up, the LEDs display information as the system executes boot-time PROM code. If the system hangs or cannot complete execution of the PROM code the LEDs display can be used as a diagnostic tool.
Each Origin2000 or Onyx2 node board has two columns of LEDs. Each column has eight LEDs; the LEDs in the left column are assigned to CPU A, the right column to CPU B.
The following diagram represents the position and function of the node board LEDs. The descriptions "Binary Group 1 and 2", "Processor Activity" and "Processor Heartbeat" apply to the LEDs in either column:
CPU A CPU B
---X X--
Binary _ | X X |
Group 1 | X X |
---X X |
|--Processor Activity
---X X |
Binary _ | X X |
Group 2 | X X--
---X X-----Processor Heartbeat (after IRIX loads)
The first seven pairs of LEDs indicate processor activity during PROM code execution and after the operating system (IRIX) loads. The eighth pair of LEDs indicates the processor heartbeat, and are only active after IRIX loads.
Decoding the Node Board LED Display
If the system does not successfully execute PROM code, the coded pattern of LEDs illuminated on the node board can used as a diagnostic tool to determine the failure point.
The LEDs should be read one column (CPU) at a time. Begin at the top of the column of LEDs, and read the first four LEDs as a group (marked as "Binary Group 1 in the diagram). Assign a binary numeric value to each LED, where an illuminated LED receives a value of "0", and an inactive LED receives a value of "1".
Note that he assignment of binary values is the reverse of typical "on"/"off" value assignments.
So if the first four LEDs were off, this pattern would be assigned the value 1111. Conversely, if all four LEDs were on, the value would be 0000. If the LED illumination pattern was on-off-off-on, the value would be 0110.
Move to the second set of four LEDs in the same processor column (marked as "Binary Group 2" in the diagram) and apply the same binary values used with the first group of four.
Assign each group of binary numbers a hexadecimal value using the following list of equivalents:
Binary Hexadecimal
Value Value
0000 0
0001 1
0010 2
0011 3
0100 4
0101 5
0110 6
0111 7
1000 8
1001 9
1010 a
1011 b
1100 c
1101 d
1110 e
1111 f
As an example, if the first group of LEDs displayed as on-on-on-on, the binary value would be 0000 and the first hexadecimal value would be 0. If the second group of LEDs displayed off-off-on-off, the binary value would be 1101 and the second hexadecimal value would be d. The first and second hexadecimal values are in sequence to determine the PROM code execution point. In the previous example, the LED code would be read as 0d.
Knowing the location of certain system components is helpful during diagnosis. For instance, the CPU, Scache and Hub chip are all located on the node board. The Bridge chip is on the Base I/O board (IO6 or IO6G) or other I/O board. The Crossbow (XBOW) chip is on the midplane.
Diagnosing PROM Code Boot Progress
If a system, node board or CPU does not successfully complete PROM code initialization, the static value display on the respective LEDs can be used to determine the point at which the node board or CPU failed.
During the initialization process there are also times that the node boards execute certain aspects of the PROM code sequentially. As each node board executes the code in turn, the remaining node boards wait for that board to signal completion. If the node board hangs while executing that aspect of the PROM code, the remaining boards continue to wait for the completion signal. Because the rest of the node boards continue wait for a completion signal, this might appear as a complete system hang. The LEDs on the node boards can be read to determine which have actually failed and which are merely waiting.
Hexadecimal values between 00 and 7f are used to indicate the progress of PROM code execution (if hexadecimal values within the specified range are not listed they are unused).
If a suspected point of failure is not listed, the cause cannot not be isolated to a specific component without additional proprietary diagnostics.
LED Code Boot Phase Suspected Point of Failure
00 System Reset CPU
01 Init CPU CPU
02 Test CPU CPU
03 Run TLB CPU
04 Test Pri Instruction Cache CPU
05 Test Pri Data Cache CPU
06 Test Secondary Cache CPU
07 Flush all Caches CPU
0a Invalidate Pri Inst Cache CPU
0b Invalidate Pri Data Cache CPU
0c Invalidate Secondary Cache Scache
0d Succeed - Jump to Main
0e About to increase PROM Access Speed PROM
0f Increased PROM Access Speed PROM
10 Init Pri data Cache CPU
11 Init Pri Instruction Cache CPU
12 Init CPU COP0 Registers CPU
13 Flush TLB CPU
1a Probe for MSC MSC, nodeboard, midplane
1b Probe for Junk UART MSC, nodeboard, midplane
1c Done with MSC Probe MSC, UART
1d About to Init UART MSC
1e Done with UART Init UART
20 Start Power on Diagnostics (POD)
21 About to enter POD mode C portion
22 About to enter POD prompt loop
23 About to enter POD mode(assembler)
24 Local CPU (A/B) Arbitration CPU
25 Init Secondary Cache Scache
28 About to perform 1st Local Barrier CPU, nodeboard hub
2a Config DEX mode - Stack and Data CPU, Scache
2b Reached Main CPU
38 1st Local Barrier Succeeded
3c About to Jump to UALIAS Space RAM
3d Jumped to UALIAS Space RAM
3e About to Jump to Cached Space Scache
3f Jumped to Cached Space Scache
40 About to Test Stack Area RAM
41 Done Testing Stack Area Scache, RAM
45 About to enter Slave Launch Loop Master CPU
46 Received Launch Interrupt
47 Calling Launched Function RAM, Scache
48 Launched Function Returned
4a About to Init Hub MD & SIMM Controls Hub
4b About to Probe & Config Memory Size RAM
4c About to Init PCF8512C Chip MSC, Midplane
4d Done Init - PCF8512C Chip MSC
4f About to Discover Hub I/O Hub
50 About to Write Hub Config info
51 About to Write Router Config info
52 About to Init Hub I/O Hub
53 About to Probe I/O for Console Base I/O, Bride, XBow
54 Probe I/O for Console Success Base I/O
56 Hub I/O Init Done Midplane, I/O card, Base I/O
57 Saved Errors Stored from Reset Hub
58 Cleared all Error Registers Hub
59 Enabled Error Checking Hub
5a Done Discovering Hub I/O Hub
5b About to Init NMI Handler Area RAM, Scache
5c About to Test Hub Interrupts Hub
Fatal Node Board Error Codes
If a node board suffers a fatal error while during PROM diagnostics and is disabled, the LEDs of each CPU on that node board record the failure code. The failure codes displayed on the node board LEDs can be read using the information in the #Decoding the Node Board LED Display section of this article.
LED Code Reason for Failure Suspected Point of Failure
81 CoProcessor Failed Register Test CPU
83 Pri Instruction Cache Failed Test CPU
84 Pri Data Cache Failed Test CPU
85 Secondary Cache Failed Test Scache
86 CPU Disabled by Another CPU CPU
87 Real-Time Counter Broken CPU
8c General Exception
91 Hub Local Failed CPU, Hub
93 Some Node Not Premium (>32 CPUs) Directory Memory
98 Node has no Local memory No RAM or Disabled RAM
9a CPU is Disabled CPU Disabled
9b Memory Download Failed RAM, PROM
9e Failed Writing Hub Config Info RAM
9f Failed Writing Router Config Info RAM
a0 Hub I/O Init Failed Hub
a1 Node failed Init RAM
a4 Hub Chip Failed Hub
a5 Router Chip Failed Router
a6 Waiting for Reset To Go Hub
a7 LLP Failed After Reset Hub
a8 LLP Never Up After Reset Hub
a9 Node Board - No Good Local Memory No RAM or Disabled RAM
ab Network Discovery Failed
ac NASID Calculation Failed CrayLink Cabling Error
ad Route Calculation Failed CrayLink Cabling Error
ae Route Distribution Failed Check Router LEDs for Error
af NASID Distribution Failed CrayLink Cabling Error
b0 Master Not Assigned NASID Check Router LEDs for Error
b1 Module ID Arbitration Failed MSC
b2 Origin2000 Craylinked to Origin200 Illegal Configuration
b3 Partition Config Error User Error
Node Board Early Exception LED Codes
If an exception occurs in the PROM code execution before the exception can be displayed by normal means, the CPU LEDs will begin a blinking error code. As the LEDs flash, they will alternate between flashing all eight LEDs and flashing only the LEDs that indicate the error code.
LED Code Exception Suspected Point of Failure
f2 General Exception CPU
f3 ECC Exception CPU
f4 TLB Exception CPU
f5 XTLB Exception CPU
f6 Unimplimented Exception CPU
f7 Cache Error Exception CPU
Post Initialization LED Displays
After the CPUs have completed initialization they display a different set of LED patterns.
Prior to IRIX booting, the master CPU will alternate the display of 55 and 00 (see #Decoding the Node Board LED Display for additional information).
After IRIX loads, the bottom (eighth) LED is used to indicate the CPU heartbeat. LEDs 1 through 7 will progressively illuminate from bottom to top to indicate CPU activity.
Full list of codes
LED Name Description
Value
0x01 INITCPU Initialize the general-purpose registers (GPRs),
floating-point registers (FPR), and COP0 registers
0x02 TESTCP1 Test the COP1 registers
0x03 RUNTLB Switch to mapped mode
0x04 TESTICACHE Test the primary instruction cache
0x05 TESTDCACHE Test the primary data cache
0x06 TESTSCACHE Test secondary cache
0x07 FLUSHCACHES Flush all caches
0x0a INVICACHE Invalidate the primary instruction cache
0x0b INVDCACHE Invalidate the primary data cache
0x0c INVSCACHE Invalidate secondary cache
0x0d INMAIN Successfully jumped to the main() function
0x0e SPEEDUP Prepare to increase PROM access speed
0x0f SPEEDUPOK Successfully increased PROM access speed
0x1a MSCPROBE Prepare to probe for presence of the MSC
0x1c DONEPROBE Completed the probe for the presence of the MSC
0x1d UARTINIT Prepare to initialize selected UART
0x1e UARTINITDONE Completed the initialization of the selected UART
0x21 PODLOOP Prepare to enter POD mode (C code portion)
0x22 PODPROMPT Prepare to enter the POD prompt loop
0x23 PODMODE Prepare to enter POD mode (assembly code portion)
0x24 LOCALARB Perform local arbitration (between CPU A and CPU B)
0x28 BARRIER Prepare to perform first local barrier
0x2a MAKESTACK Prepare to configure Dex mode stack and date
0x2b MAIN Code execution has reached the main() function
0x31 NMI Received external nonmaskable interrupt
0x35 RTCINIT Prepare to initialize the HUB real-time counter
0x36 RTCINITDONE Completed the initialization of the HUB real-time counter
0x38 BARRIEROK Successfully completed the first local barrier operation
0x3c JUMPRAMU Prepare to jump to UALIAS space
0x3d JUMPRAMUOK Successfully jumped to UALIAS space
0x3e JUMPRAMC Prepare to jump to cached space
0x3f JUMPRAMCOK Successfully jumped to cached space
0x40 STACKRAM Prepare to test the stack area of memory
0x41 STACKRAMOK Successfully tested the stack area of memory
0x45 LAUNCHLOOP Prepare to enter the slave launch loop
0x46 LAUNCHINTR Received a launch interrupt
0x47 LAUNCHCALL Call the launched() function
0x48 LAUNCHDONE Returned from the launched() function
0x4a MDIRINIT Prepare to initialize the HUB MD and SIMM controls
0x4b MDIRCONFIG Prepare to determine and configure the memory size
0x4c I2CINIT Prepare to initialize the PCF8584 I2C chip
0x4d I2CDONE Completed the initialization of the PCF8584 I2C chip
0x4f IODISCOVER Prepare to discover Hub I/O
0x50 HUB_CONFIG Prepare to write Hub configuration information into the KLCONFIG structure
0x51 ROUTER_CONFIG Prepare to write the router configuration information into the KLCONFIG structure
0x52 INITIO Prepare to initialize the I/O section of the Hub
0x53 CONSOLE_GET Prepare tp probe the I/O section of the Hub
0x54 CONSOLE_GET_OK Successfully completed the probe of the I/O section for the console
0x56 INITIODONE Completed the initialization of the I/O section of the Hub
0x57 STASH2 Reset error state saved
0x58 STASH3 Clear Hub error registers
0x59 STASH4 Enable Hub error checking
0x5a IODISCOVER_DONE Completed the discovery of the Hub I/O
0x5b NMI_INIT Prepare to initialize the NMI handler area
0x5c TEST_INTS Prepare to test Hub interrupts
0x5d IORESET Prepare to perform early reset of Hub I/O section
Failure codes
LED Name Description
Value
0x81 CP1 Register test failed
0x82 RESTART Restart master was unable to load the BaseIO PROM
0x83 ICACHE Primary instruction cache test failed (The failing FRU is the node board)
0x84 DCACHE Primary data cache test failed (The failing FRU is the node board)
0x85 SCACHE Secondary cache test failed (The failing FRU is the node board)
0x86 KILLED CPU disabled by another node
0x87 RTC Real-time counter not counting
0x91 HUBLOCAL Hub local arbitration failed; ignore this CPU
0x93 PREM_DIR_REQ All nodes must have premium DIMMs for this configuration
0x97 MAINRET Returned from main() function
0x98 NOMEM Node board does not have local memory
0x9a DISABLED CPU disabled by an environment variable
0x9b DOWNLOAD Failure occured while downloading the PROM code into RAM
0x9c COREDEBUG Boot process cannot set the CORE debug registers
0x9d IODISCOVER Failure occured during the HUB I/O discovery process
0x9e HUB_CONFIG Failure occured while writing the HUB information into the KLCONFIG structure
0x9f ROUTER_CONFIG Failure occured while writing the router information into the KLCONFIG structure
0xa0 HUBIO_INIT Failure occured while trying to initialize the HUB I/O interface
0xa1 CONFIG_INIT Failure occured while trying to initialize the KLCONFIG structure
0xa2 RTRCHIP Failure occured while testing the Router chip
0xa3 LINKDEAD Failure occured while testing the LLP link
0xa4 HUBBIST Failure occured while the HUB chip executed the built-in self test (BIST)
0xa5 RTRBIST Failure occured while the router chip executed the built-in self test (BUILT)
0xa6 RESETWAIT Waiting for a reset to occur
0xa7 LLP_FAIL LLP failed after the reset
0xa8 LLP_NORESET LLP never came up after the reset
0xa9 BADMEM Local memory is corrupted
0xab NET_DISCOVER Failure occured for the Hub network discovery
0xac NASID_CALC Failure occured for the NASID calculation
0xad ROUTE_CALC Failure occured for the route calculation
0xae ROUTE_DIST Failure occured for the route distribution
0xaf NASID_DIST Failure occured for the NASID distribution
0xb0 NO_NASID Master did not assign a NASID
0xb1 NO_MODULEID Failure occured for the module ID arbitration
0xb2 MIXED_SN00 Origin 200 system is configured with an Origin 2000 system
0xb3 ERRPART Failure occured in the partition configuration
0xb4 MODEBIT Failure occured while copying the processor mode bits
0xb5 BACK_CALC Failure occured while calculating the midplane frequency