Origin2000/Onyx2 Diagnostics: Node Board LEDs

From Nekochan
Revision as of 14:42, 18 February 2008 by Recondas (Talk | contribs) (Origin2000/Onyx2 Node Board LEDs)

Jump to: navigation, search

Origin2000/Onyx2 Node Board LEDs

The are a series of LEDs on the bulkhead of Origin2000/Onyx2 processor node boards. As the system is powered up, the LEDs display information as the system executes boot-time PROM code. If the system hangs or cannot complete execution of the PROM code the LEDs display can be used as a diagnostic tool.

Each Origin2000 or Onyx2 node board has two columns of LEDs. Each column has eight LEDs; the LEDs in the left column are assigned to CPU A, the right column to CPU B.

The following diagram represents the position and function of the node board LEDs. The descriptions "Binary Group 1 and 2", "Processor Activity" and "Processor Heartbeat" apply to the LEDs in either column:

          CPU A    CPU B
          ---X      X--
Binary  _ |  X      X  |
Group 1   |  X      X  |
          ---X      X  |
                       |--Processor Activity
          ---X      X  |
Binary  _ |  X      X  |
Group 2   |  X      X--
          ---X      X-----Processor Heartbeat (after IRIX loads)

The first seven pairs of LEDs indicate processor activity during PROM code execution and after the operating system (IRIX) loads. The eighth pair of LEDs indicates the processor heartbeat, and are only active after IRIX loads.

Decoding the Node Board LED Display

If the system does not successfully execute PROM code, the coded pattern of LEDs illuminated on the node board can used as a diagnostic tool to determine the failure point.

The LEDs should be read one column (CPU) at a time. Begin at the top of the column of LEDs, and read the first four LEDs as a group (marked as "Binary Group 1 in the diagram). Assign a binary numeric value to each LED, where an illuminated LED receives a value of "0", and an inactive LED receives a value of "1".

Note that he assignment of binary values is the reverse of typical "on"/"off" value assignments.

So if the first four LEDs were off, this pattern would be assigned the value 1111. Conversely, if all four LEDs were on, the value would be 0000. If the LED illumination pattern was on-off-off-on, the value would be 0110.

Move to the second set of four LEDs in the same processor column (marked as "Binary Group 2" in the diagram) and apply the same binary values used with the first group of four.

Assign each group of binary numbers a hexadecimal value using the following list of equivalents:

  Binary      Hexadecimal
   Value          Value
   0000            0
   0001            1
   0010            2
   0011            3
   0100            4
   0101            5
   0110            6
   0111            7 
   1000            8
   1001            9
   1010            a
   1011            b
   1100            c
   1101            d
   1110            e
   1111            f

As an example, if the first group of LEDs displayed as on-on-on-on, the binary value would be 0000 and the first hexadecimal value would be 0. If the second group of LEDs displayed off-off-on-off, the binary value would be 1101 and the second hexadecimal value would be d. The first and second hexadecimal values are in sequence to determine the PROM code execution point. In the previous example, the LED code would be read as 0d.

Knowing the location of certain system components is helpful during diagnosis. For instance, the CPU, Scache and Hub chip are all located on the node board. The Bridge chip is on the Base I/O board (IO6 or IO6G) or other I/O board. The Crossbow (XBOW) chip is on the midplane.

Diagnosing PROM Code Boot Progress

If a system, node board or CPU does not successfully complete PROM code initialization, the static value display on the respective LEDs can be used to determine the point at which the node board or CPU failed.

During the initialization process there are also times that the node boards execute certain aspects of the PROM code sequentially. As each node board executes the code in turn, the remaining node boards wait for that board to signal completion. If the node board hangs while executing that aspect of the PROM code, the remaining boards continue to wait for the completion signal. Because the rest of the node boards continue wait for a completion signal, this might appear as a complete system hang. The LEDs on the node boards cab ne read to determine which have actually failed and which are merely waiting.

Hexadecimal values between 00 and 7f are used to indicate the progress of PROM code execution (if hexadecimal values within the specified range are not listed they are unused).

If a suspected point of failure is not listed, the cause cannot not be isolated to a specific component without additional proprietary diagnostics.

   LED Code      Boot Phase                   Suspected Point of Failure
      00         System Reset                          CPU
      01         Init CPU                              CPU
      02         Test CPU                              CPU
      03         Run TLB                               CPU
      04         Test Pri Instruction Cache            CPU
      05         Test Pri Data Cache                   CPU
      06         Test Secondary Cache                  CPU
      07         Flush all Caches                      CPU
      0a         Invalidate Pri Inst Cache             CPU
      0b         Invalidate Pri Data Cache             CPU 
      0c         Invalidate Secondary Cache            Scache
      0d         Succeed - Jump to Main                  
      0e         About to increase PROM Access Speed   PROM
      0f         Increased PROM Access Speed           PROM
      10         Init Pri data Cache                   CPU
      11         Init Pri Instruction Cache            CPU 
      12         Init CPU COP0 Registers               CPU
      13         Flush TLB                             CPU 
      1a         Probe for MSC                         MSC, nodeboard, midplane
      1b         Probe for Junk UART                   MSC, nodeboard, midplane 
      1c         Done with MSC Probe                   MSC, UART
      1d         About to Init UART                    MSC
      1e         Done with UART Init                   UART
      20         Start Power on Diagnostics (POD)                     
      21         About to enter POD mode C portion                  
      22          About to enter POD prompt loop                        
      23         About to enter POD mode(assembler)                  
      24         Local CPU (A/B) Arbitration           CPU
      25         Init Secondary Cache                  Scache
      28         About to perform 1st Local Barrier    CPU, nodeboard hub
      2a         Config DEX mode - Stack and Data      CPU, Scache
      2b         Reached Main                          CPU 
      38         1st Local Barrier Succeeded                  
      3c         About to Jump to UALIAS Space         RAM
      3d         Jumped to UALIAS Space                RAM
      3e         About to Jump to Cached Space         Scache
      3f         Jumped to Cached Space                Scache
      40         About to Test Stack Area              RAM 
      41         Done Testing Stack Area               Scache, RAM
      45         About to enter Slave Launch Loop      Master CPU
      46         Received Launch Interrupt
      47         Calling Launched Function             RAM, Scache
      48         Launched Function Returned
      4a         About to Init Hub MD & SIMM Controls  Hub
      4b         About to Probe & Config Memory Size   RAM
      4c         About to Init PCF8512C Chip           MSC, Midplane
      4d         Done Init - PCF8512C Chip             MSC
      4f         About to Discover Hub I/O             Hub
      50         About to Write Hub Config info        
      51         About to Write Router Config info
      52         About to Init Hub I/O                 Hub
      53         About to Probe I/O for Console        Base I/O, Bride, XBow
      54         Probe I/O for Console Success         Base I/O
      56         Hub I/O Init Done                     Midplane, I/O card, Base I/O
      57         Saved Errors Stored from Reset        Hub
      58         Cleared all Error Registers           Hub
      59         Enabled Error Checking                Hub
      5a         Done Discovering Hub I/O              Hub
      5b         About to Init NMI Handler Area        RAM, Scache
      5c         About to Test Hub Interrupts          Hub  

Fatal Node Board Error Codes

If a node board suffers a fatal error while during PROM diagnostics and is disabled, the LEDs of each CPU on that node board record the failure code. The failure codes displayed on the node board LEDs can be read using the information in the #Decoding the Node Board LED Display section of this article.

   LED Code      Reason for Failure           Suspected Point of Failure
      81         CoProcessor Failed Register Test      CPU
      83         Pri Instruction Cache Failed Test     CPU
      84         Pri Data Cache Failed Test            CPU
      85         Secondary Cache Failed Test           Scache
      86         CPU Disabled by Another CPU           CPU
      87         Real-Time Counter Broken              CPU
      8c         General Exception
      91         Hub Local Failed                      CPU, Hub
      93         Some Node Not Premium (>32 CPUs)      Directory Memory
      98         Node has no Local memory              No RAM or Disabled RAM 
      9a         CPU is Disabled                       CPU Disabled
      9b         Memory Download Failed                RAM, PROM
      9e         Failed Writing Hub Config Info        RAM
      9f         Failed Writing Router Config Info     RAM
      a0         Hub I/O Init Failed                   Hub
      a1         Node failed Init                      RAM 
      a4         Hub Chip Failed                       Hub
      a5         Router Chip Failed                    Router
      a6         Waiting for Reset To Go               Hub
      a7         LLP Failed After Reset                Hub
      a8         LLP Never Up After Reset              Hub
      a9         Node Board - No Good Local Memory     No RAM or Disabled RAM  
      ab         Network Discovery Failed
      ac         NASID Calculation Failed              CrayLink Cabling Error 
      ad         Route Calculation Failed              CrayLink Cabling Error
      ae         Route Distribution Failed             Check Router LEDs for Error
      af         NASID Distribution Failed             CrayLink Cabling Error
      b0         Master Not Assigned NASID             Check Router LEDs for Error
      b1         Module ID Arbitration Failed          MSC
      b2         Origin2000 Craylinked to Origin200    Illegal Configuration
      b3         Partition Config Error                User Error

Node Board Early Exception LED Codes

If an exception occurs in the PROM code execution before the exception can be displayed by normal means, the CPU LEDs will begin a blinking error code. As the LEDs flash, they will alternate between flashing all eight LEDs and flashing only the LEDs that indicate the error code.

   LED Code      Exception                    Suspected Point of Failure
      f2         General Exception                     CPU
      f3         ECC Exception                         CPU
      f4         TLB Exception                         CPU
      f5         XTLB Exception                        CPU
      f6         Unimplimented Exception               CPU
      f7         Cache Error Exception                 CPU

Post Initialization LED Displays

After the CPUs have completed initialization they display a different set of LED patterns.

Prior to IRIX booting, the master CPU will alternate the display of 55 and 00 (see #Decoding the Node Board LED Display for additional information).

After IRIX loads, the bottom (eighth) LED is used to indicate the CPU heartbeat. LEDs 1 through 7 will progressively illuminate from bottom to top to indicate CPU activity.