Onyx2/Origin2000 Node boards

From Nekochan
Revision as of 01:48, 6 October 2011 by Regan russell (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

An Onyx2 node fits on a single 16" by 11" printed circuit board that contains one or two processors, the main memory, the directory memory and the Hub ASIC. The node board plugs into the backplane through a 300-pad CPOP (Compression Pad-on-Pad) connector. The connector actually combines two connections, one to the NUMAlink router network and another to the XIO I/O subsystem.



Each processor and their secondary cache is contained on a HIMM (Horizontal Inline Memory Module) daughter card that plugs into the node board. At the time of introduction, the Onyx2 used the IP27 board, featuring one or two R10000 processors clocked at 180 MHz with 1 MB secondary cache(s). A high-end model with two 195 MHz R10000 processors with 4 MB secondary caches was also available. In February 1998, the IP31 board was introduced with two 250 MHz R10000 processors with 4 MB secondary caches. Later, the IP31 board was upgraded to support two 300, 350 or 400 MHz R12000 processors. The 300 and 400 MHz models had 8 MB L2 caches, while the 350 MHz model had 4 MB L2 caches. Near the end of its life, a variant of the IP31 board that could utilize the 500 MHz R14000 with 8 MB L2 caches was made available.

Known node board CPU speeds

IP27: CPUs are mounted directly to the node board individually.

180 MHz R10000 (Can not be mixed with others speed node boards)

195 MHz R10000

IP31: CPUs are mounted in pairs (along with their respective caches) to a PIMM, a pluggable module which then mounts to the node board.

250 MHz R10000

300 MHz R12000

350 MHz R12000 (Can not be used in configurations greater than 8 CPUs)

400 MHz R12000

500 MHz R14000

Main memory and directory memory

Each node board can support a maximum of 4 GB of memory through 16 DIMM slots by using proprietary ECC SDRAM DIMMs with capacities of 16, 32, 64 and 256 MB. Because the memory bus is 144 bits wide (128 bits for data and 16 bits for ECC), memory modules are inserted in pairs. Directory memory, which contains information on the contents of remote caches for maintaining cache coherency, must be used in configurations with more than 32 processors as the Onyx2 uses a distributed shared memory model. The directory memory is contained on proprietary DIMMs that are inserted into eight DIMM slots set aside for its use. In configurations where there are fewer than 32 processors, the directory memory is contained within the main memory.



The Hub ASIC interfaces the processors, memory and XIO to the NUMAlink 2 system interconnect. The ASIC contains five major sections: the crossbar (referred to as the "XB"), the I/O interface (referred to as the "II"), the network interface (referred to as the "NI"), the processor interface (referred to as the "PI") and the memory and directory interface (referred to as the "DM"), which also serves as the memory controller. The interfaces communicate with each other via FIFO buffers that are connected to the crossbar. When two processors are connected to the Hub ASIC, the node does not behave in a SMP fashion. Instead, the two processors operate separately and their buses are multiplexed over the single processor interface. This was done to save pins on the Hub ASIC. The Hub ASIC is clocked at 100 MHz and contains 900,000 gates fabricated in a five-layer metal process.

Origin2000/Onyx2 Node Board LEDs

The are a series of LEDs on the bulkhead of Origin2000/Onyx2 processor node boards. As the system is powered up, the LEDs display information as the system executes boot-time PROM code. If the system hangs or cannot complete execution of the PROM code the LEDs display can be used as a diagnostic tool.

Each Origin2000 or Onyx2 node board has two columns of LEDs. Each column has eight LEDs; the LEDs in the left column are assigned to CPU A, the right column to CPU B.

The following diagram represents the position and function of the node board LEDs. The descriptions "Binary Group 1 and 2", "Processor Activity" and "Processor Heartbeat" apply to the LEDs in either column:

          CPU A    CPU B
          ---X      X--
Binary  _ |  X      X  |
Group 1   |  X      X  |
          ---X      X  |
                       |--Processor Activity
          ---X      X  |
Binary  _ |  X      X  |
Group 2   |  X      X--
          ---X      X-----Processor Heartbeat (after IRIX loads)

The first seven pairs of LEDs indicate processor activity during PROM code execution and after the operating system (IRIX) loads. The eighth pair of LEDs indicates the processor heartbeat, and are only active after IRIX loads.

Decoding the Node Board LED Display

If the system does not successfully execute PROM code, the coded pattern of LEDs illuminated on the node board can used as a diagnostic tool to determine the failure point.

The LEDs should be read one column (CPU) at a time. Begin at the top of the column of LEDs, and read the first four LEDs as a group (marked as "Binary Group 1 in the diagram). Assign a binary numeric value to each LED, where an illuminated LED receives a value of "0", and an inactive LED receives a value of "1".

Note that he assignment of binary values is the reverse of typical "on"/"off" value assignments.

So if the first four LEDs were off, this pattern would be assigned the value 1111. Conversely, if all four LEDs were on, the value would be 0000. If the LED illumination pattern was on-off-off-on, the value would be 0110.

Move to the second set of four LEDs in the same processor column (marked as "Binary Group 2" in the diagram) and apply the same binary values used with the first group of four.

Assign each group of binary numbers a hexadecimal value using the following list of equivalents:

  Binary      Hexadecimal
   Value          Value
   0000            0
   0001            1
   0010            2
   0011            3
   0100            4
   0101            5
   0110            6
   0111            7 
   1000            8
   1001            9
   1010            a
   1011            b
   1100            c
   1101            d
   1110            e
   1111            f

As an example, if the first group of LEDs displayed as on-on-on-on, the binary value would be 0000 and the first hexadecimal value would be 0. If the second group of LEDs displayed off-off-on-off, the binary value would be 1101 and the second hexadecimal value would be d. The first and second hexadecimal values are in sequence to determine the PROM code execution point. In the previous example, the LED code would be read as 0d.

Knowing the location of certain system components is helpful during diagnosis. For instance, the CPU, Scache and Hub chip are all located on the node board. The Bridge chip is on the Base I/O board (IO6 or IO6G) or other I/O board. The Crossbow (XBOW) chip is on the midplane.

Diagnosing PROM Code Boot Progress

If a system, node board or CPU does not successfully complete PROM code initialization, the static value display on the respective LEDs can be used to determine the point at which the node board or CPU failed.

During the initialization process there are also times that the node boards execute certain aspects of the PROM code sequentially. As each node board executes the code in turn, the remaining node boards wait for that board to signal completion. If the node board hangs while executing that aspect of the PROM code, the remaining boards continue to wait for the completion signal. Because the rest of the node boards continue wait for a completion signal, this might appear as a complete system hang. The LEDs on the node boards can be read to determine which have actually failed and which are merely waiting.

Hexadecimal values between 00 and 7f are used to indicate the progress of PROM code execution (if hexadecimal values within the specified range are not listed they are unused).

If a suspected point of failure is not listed, the cause cannot not be isolated to a specific component without additional proprietary diagnostics.

   LED Code      Boot Phase                   Suspected Point of Failure
      00         System Reset                          CPU
      01         Init CPU                              CPU
      02         Test CPU                              CPU
      03         Run TLB                               CPU
      04         Test Pri Instruction Cache            CPU
      05         Test Pri Data Cache                   CPU
      06         Test Secondary Cache                  CPU
      07         Flush all Caches                      CPU
      0a         Invalidate Pri Inst Cache             CPU
      0b         Invalidate Pri Data Cache             CPU 
      0c         Invalidate Secondary Cache            Scache
      0d         Succeed - Jump to Main                  
      0e         About to increase PROM Access Speed   PROM
      0f         Increased PROM Access Speed           PROM
      10         Init Pri data Cache                   CPU
      11         Init Pri Instruction Cache            CPU 
      12         Init CPU COP0 Registers               CPU
      13         Flush TLB                             CPU 
      1a         Probe for MSC                         MSC, nodeboard, midplane
      1b         Probe for Junk UART                   MSC, nodeboard, midplane 
      1c         Done with MSC Probe                   MSC, UART
      1d         About to Init UART                    MSC
      1e         Done with UART Init                   UART
      20         Start Power on Diagnostics (POD)                     
      21         About to enter POD mode C portion                  
      22          About to enter POD prompt loop                        
      23         About to enter POD mode(assembler)                  
      24         Local CPU (A/B) Arbitration           CPU
      25         Init Secondary Cache                  Scache
      28         About to perform 1st Local Barrier    CPU, nodeboard hub
      2a         Config DEX mode - Stack and Data      CPU, Scache
      2b         Reached Main                          CPU 
      38         1st Local Barrier Succeeded                  
      3c         About to Jump to UALIAS Space         RAM
      3d         Jumped to UALIAS Space                RAM
      3e         About to Jump to Cached Space         Scache
      3f         Jumped to Cached Space                Scache
      40         About to Test Stack Area              RAM 
      41         Done Testing Stack Area               Scache, RAM
      45         About to enter Slave Launch Loop      Master CPU
      46         Received Launch Interrupt
      47         Calling Launched Function             RAM, Scache
      48         Launched Function Returned
      4a         About to Init Hub MD & SIMM Controls  Hub
      4b         About to Probe & Config Memory Size   RAM
      4c         About to Init PCF8512C Chip           MSC, Midplane
      4d         Done Init - PCF8512C Chip             MSC
      4f         About to Discover Hub I/O             Hub
      50         About to Write Hub Config info        
      51         About to Write Router Config info
      52         About to Init Hub I/O                 Hub
      53         About to Probe I/O for Console        Base I/O, Bride, XBow
      54         Probe I/O for Console Success         Base I/O
      56         Hub I/O Init Done                     Midplane, I/O card, Base I/O
      57         Saved Errors Stored from Reset        Hub
      58         Cleared all Error Registers           Hub
      59         Enabled Error Checking                Hub
      5a         Done Discovering Hub I/O              Hub
      5b         About to Init NMI Handler Area        RAM, Scache
      5c         About to Test Hub Interrupts          Hub

Fatal Node Board Error Codes

If a node board suffers a fatal error while during PROM diagnostics and is disabled, the LEDs of each CPU on that node board record the failure code. The failure codes displayed on the node board LEDs can be read using the information in the #Decoding the Node Board LED Display section of this article.

   LED Code      Reason for Failure           Suspected Point of Failure
      81         CoProcessor Failed Register Test      CPU
      83         Pri Instruction Cache Failed Test     CPU
      84         Pri Data Cache Failed Test            CPU
      85         Secondary Cache Failed Test           Scache
      86         CPU Disabled by Another CPU           CPU
      87         Real-Time Counter Broken              CPU
      8c         General Exception
      91         Hub Local Failed                      CPU, Hub
      93         Some Node Not Premium (>32 CPUs)      Directory Memory
      98         Node has no Local memory              No RAM or Disabled RAM 
      9a         CPU is Disabled                       CPU Disabled
      9b         Memory Download Failed                RAM, PROM
      9e         Failed Writing Hub Config Info        RAM
      9f         Failed Writing Router Config Info     RAM
      a0         Hub I/O Init Failed                   Hub
      a1         Node failed Init                      RAM 
      a4         Hub Chip Failed                       Hub
      a5         Router Chip Failed                    Router
      a6         Waiting for Reset To Go               Hub
      a7         LLP Failed After Reset                Hub
      a8         LLP Never Up After Reset              Hub
      a9         Node Board - No Good Local Memory     No RAM or Disabled RAM  
      ab         Network Discovery Failed
      ac         NASID Calculation Failed              CrayLink Cabling Error 
      ad         Route Calculation Failed              CrayLink Cabling Error
      ae         Route Distribution Failed             Check Router LEDs for Error
      af         NASID Distribution Failed             CrayLink Cabling Error
      b0         Master Not Assigned NASID             Check Router LEDs for Error
      b1         Module ID Arbitration Failed          MSC
      b2         Origin2000 Craylinked to Origin200    Illegal Configuration
      b3         Partition Config Error                User Error

Node Board Early Exception LED Codes

If an exception occurs in the PROM code execution before the exception can be displayed by normal means, the CPU LEDs will begin a blinking error code. As the LEDs flash, they will alternate between flashing all eight LEDs and flashing only the LEDs that indicate the error code.

   LED Code      Exception                    Suspected Point of Failure
      f2         General Exception                     CPU
      f3         ECC Exception                         CPU
      f4         TLB Exception                         CPU
      f5         XTLB Exception                        CPU
      f6         Unimplimented Exception               CPU
      f7         Cache Error Exception                 CPU

Post Initialization LED Displays

After the CPUs have completed initialization they display a different set of LED patterns.

Prior to IRIX booting, the master CPU will alternate the display of 55 and 00 (see #Decoding the Node Board LED Display for additional information).

After IRIX loads, the bottom (eighth) LED is used to indicate the CPU heartbeat. LEDs 1 through 7 will progressively illuminate from bottom to top to indicate CPU activity.

Full list of codes

LED   Name            Description
0x01  INITCPU         Initialize the general-purpose registers (GPRs), 
                      floating-point registers (FPR), and COP0 registers
0x02  TESTCP1         Test the COP1 registers
0x03  RUNTLB          Switch to mapped mode
0x04  TESTICACHE      Test the primary instruction cache
0x05  TESTDCACHE      Test the primary data cache
0x06  TESTSCACHE      Test secondary cache
0x07  FLUSHCACHES     Flush all caches
0x0a  INVICACHE       Invalidate the primary instruction cache
0x0b  INVDCACHE       Invalidate the primary data cache
0x0c  INVSCACHE       Invalidate secondary cache
0x0d  INMAIN          Successfully jumped to the main() function
0x0e  SPEEDUP         Prepare to increase PROM access speed
0x0f  SPEEDUPOK       Successfully increased PROM access speed
0x1a  MSCPROBE        Prepare to probe for presence of the MSC
0x1c  DONEPROBE       Completed the probe for the presence of the MSC
0x1d  UARTINIT        Prepare to initialize selected UART
0x1e  UARTINITDONE    Completed the initialization of the selected UART
0x21  PODLOOP         Prepare to enter POD mode (C code portion)
0x22  PODPROMPT       Prepare to enter the POD prompt loop
0x23  PODMODE         Prepare to enter POD mode (assembly code portion)
0x24  LOCALARB        Perform local arbitration (between CPU A and CPU B)
0x28  BARRIER         Prepare to perform first local barrier
0x2a  MAKESTACK       Prepare to configure Dex mode stack and date
0x2b  MAIN            Code execution has reached the main() function
0x31  NMI             Received external nonmaskable interrupt
0x35  RTCINIT         Prepare to initialize the HUB real-time counter
0x36  RTCINITDONE     Completed the initialization of the HUB real-time counter
0x38  BARRIEROK       Successfully completed the first local barrier operation
0x3c  JUMPRAMU        Prepare to jump to UALIAS space
0x3d  JUMPRAMUOK      Successfully jumped to UALIAS space
0x3e  JUMPRAMC        Prepare to jump to cached space
0x3f  JUMPRAMCOK      Successfully jumped to cached space
0x40  STACKRAM        Prepare to test the stack area of memory
0x41  STACKRAMOK      Successfully tested the stack area of memory
0x45  LAUNCHLOOP      Prepare to enter the slave launch loop
0x46  LAUNCHINTR      Received a launch interrupt
0x47  LAUNCHCALL      Call the launched() function
0x48  LAUNCHDONE      Returned from the launched() function
0x4a  MDIRINIT        Prepare to initialize the HUB MD and SIMM controls
0x4b  MDIRCONFIG      Prepare to determine and configure the memory size
0x4c  I2CINIT         Prepare to initialize the PCF8584 I2C chip
0x4d  I2CDONE         Completed the initialization of the PCF8584 I2C chip
0x4f  IODISCOVER      Prepare to discover Hub I/O
0x50  HUB_CONFIG      Prepare to write Hub configuration information into the KLCONFIG structure
0x51  ROUTER_CONFIG   Prepare to write the router configuration information into the KLCONFIG structure
0x52  INITIO          Prepare to initialize the I/O section of the Hub
0x53  CONSOLE_GET     Prepare tp probe the I/O section of the Hub
0x54  CONSOLE_GET_OK  Successfully completed the probe of the I/O section for the console
0x56  INITIODONE      Completed the initialization of the I/O section of the Hub
0x57  STASH2          Reset error state saved
0x58  STASH3          Clear Hub error registers
0x59  STASH4          Enable Hub error checking
0x5a  IODISCOVER_DONE Completed the discovery of the Hub I/O
0x5b  NMI_INIT        Prepare to initialize the NMI handler area
0x5c  TEST_INTS       Prepare to test Hub interrupts
0x5d  IORESET         Prepare to perform early reset of Hub I/O section

Failure codes

LED   Name          Description
0x81  CP1           Register test failed         
0x82  RESTART       Restart master was unable to load the BaseIO PROM       
0x83  ICACHE        Primary instruction cache test failed (The failing FRU is the node board)         
0x84  DCACHE        Primary data cache test failed (The failing FRU is the node board)       
0x85  SCACHE        Secondary cache test failed (The failing FRU is the node board)   
0x86  KILLED        CPU disabled by another node       
0x87  RTC           Real-time counter not counting   
0x91  HUBLOCAL      Hub local arbitration failed; ignore this CPU
0x93  PREM_DIR_REQ  All nodes must have premium DIMMs for this configuration
0x97  MAINRET       Returned from main() function
0x98  NOMEM         Node board does not have local memory     
0x9a  DISABLED      CPU disabled by an environment variable     
0x9b  DOWNLOAD      Failure occured while downloading the PROM code into RAM     
0x9c  COREDEBUG     Boot process cannot set the CORE debug registers
0x9d  IODISCOVER    Failure occured during the HUB I/O discovery process       
0x9e  HUB_CONFIG    Failure occured while writing the HUB information into the KLCONFIG structure   
0x9f  ROUTER_CONFIG Failure occured while writing the router information into the KLCONFIG structure 
0xa0  HUBIO_INIT    Failure occured while trying to initialize the HUB I/O interface     
0xa1  CONFIG_INIT   Failure occured while trying to initialize the KLCONFIG structure     
0xa2  RTRCHIP       Failure occured while testing the Router chip   
0xa3  LINKDEAD      Failure occured while testing the LLP link     
0xa4  HUBBIST       Failure occured while the HUB chip executed the built-in self test (BIST) 
0xa5  RTRBIST       Failure occured while the router chip executed the built-in self test (BUILT)       
0xa6  RESETWAIT     Waiting for a reset to occur     
0xa7  LLP_FAIL      LLP failed after the reset       
0xa8  LLP_NORESET   LLP never came up after the reset   
0xa9  BADMEM        Local memory is corrupted 
0xab  NET_DISCOVER  Failure occured for the Hub network discovery     
0xac  NASID_CALC    Failure occured for the NASID calculation
0xad  ROUTE_CALC    Failure occured for the route calculation
0xae  ROUTE_DIST    Failure occured for the route distribution
0xaf  NASID_DIST    Failure occured for the NASID distribution
0xb0  NO_NASID      Master did not assign a NASID       
0xb1  NO_MODULEID   Failure occured for the module ID arbitration   
0xb2  MIXED_SN00    Origin 200 system is configured with an Origin 2000 system 
0xb3  ERRPART       Failure occured in the partition configuration   
0xb4  MODEBIT       Failure occured while copying the processor mode bits       
0xb5  BACK_CALC     Failure occured while calculating the midplane frequency