NUMAlink is a high-speed low-latency switched fabric computer bus used as a shared memory computer cluster processor interconnection in Silicon Graphics computer systems. NUMAlink was developed by SGI for their Origin and Onyx systems. It was initially branded as "CrayLink" during SGIs ownership of the Cray Computer Corporation.
For computer clusters, low latency of the interconnect is often more important to overall performance than overall bandwidth. This is more an issue for applications that pass small messages. For instance, gigabit Ethernet has performance on the order of 100 MB/s, but typical latencies of 30 usecs even for one-word messages. This is due to the overhead of the Ethernet protocol stack, which has to encapsulate the message in a standardized package and then unpack it at the far end.
NUMAlink, like other products aimed as the same market space, attempt to improve performance by dramatically reducing the packet overhead. Typically this is accomplished by using a much smaller minimum packet size, and using circuit switched networks that do not have to be actively routed during transport (the route is set up only once). SGI claims particularly impressive numbers for NUMAlink, stating that the typical short-message overhead is only 1 usec, half that of competing systems.
Latency directly effects the "efficiency" of a system, one of the important measures used in Linpack for benchmarking supercomputer installations. NUMAlink offered an average of 84% efficiency on the TOP500 list , while QsNet and Infiniband reached 75%, Myrinet 63%, and 59% for gigabit ethernet.
Moreover, NUMAlink is extremely fast. The basic system offers 3.2 GB/s unidirectionally, about twice that of most similar systems, and 32 times that of gigabit ethernet. Fully expanded InfiniBand systems, that is "quad-rate 12X" systems, offer up to 12 GB/s, but it appears no such solution is actually in use.
The following excerpt is taken from an archived copy of an SGI web page (see References, below) and refers to the NUMAlink 4 interconnect:
Data crosses over an SGI NUMAlink switch, round-trip, in as little as 50 nanoseconds—less time than it takes a beam of light to travel 50 feet—compared to 10,000 nanoseconds or more with many commodity clustering interconnects. Furthermore, SGI NUMAlink technology is the only interconnect that provides global shared memory between cluster nodes.
The industry-leading performance of NUMAlink interconnect technology is clear when comparing bandwidth and latency characteristics to other interconnects (Table 1). This translates into better system performance in MPI applications as well as industry standard system benchmarks, such as Linpack (Table 2).
(usec, short message)
|Bandwidth per link
|NUMAlink 4 (Altix)||SGI||1||3200|
|RapidArray (XD1)||Cray||1.8||2000 (1)|
|QsNet II||Quadrics||2||900 (2)|
|High Performance Switch||IBM||5||1000 (4)|
|Myrinet XP2||Myricom||5.7||495 (5)|
|SP Switch 2||IBM||18||500 (6)|
Bandwidth per link citations are from the following sources:
1. http://www.cray.com/products/xd1/index.html#RapidArrayInterconnect 2. http://doc.quadrics.com/Quadrics/QuadricsHome.nsf/DisplayPages/81DD13F71CFD762580256EAD0010AA75/$File/Performance.pdf 3. http://nowlab.cis.ohio-state.edu/projects/mpi-iba/ 4. http://publib-b.boulder.ibm.com/Redbooks.nsf/f338d71ccde39f08852568dd006f956d/55258945787\efc2e85256db00051980a?OpenDocument 5. http://www.myricom.com/myrinet/performance/ 6. http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/sp_switch_perf.pdf
|System/interconnect||Avg. Linpack efficiency
for 256P system, %*
|Sample size, number of
systems on list*
|SGI Altix/NUMAlink 4||84||14|
|Various/Infiniband||75||3 (one system @288P)|
- Linpack Rmax/Rpeak for 256P systems listed on November, 2004 Top 500 list - see www.top500.org
There was no NUMAlink 1, as SGI's engineers deemed the system interconnect used in the Stanford DASH multicomputer to be the first generation NUMAlink interconnect.
NUMAlink 2 is the second generation of the interconnect, introduced in 1996 and used in the Onyx2 visualization systems, the Origin 200 and the Origin 2000 servers and supercomputers. The NUMAlink 2 interface was the Hub ASIC. NUMAlink 2 is capable of 1.6 GB/s of peak bandwidth through two 800 MB/s, PECL 400 MHz 16-bit unidirectional links.
NUMAlink 3 is the third generation of the interconnect, introduced in 2000 and used in the Origin 3000 and Altix 3000. NUMAlink 3 is capable of 3.2 GB/s of peak bandwidth through two 1.6 GB/s unidirectional links.
NUMAlink 4 is the fourth generation of the interconnect, introduced in 2004 and used in the Altix 4000. NUMAlink 4 is capable of 6.4 GB/s of peak bandwidth through two 3.2 GB/s unidirectional links.
NUMAlink 5 is the fifth generation of the interconnect, introduced in 2009 and used in the Altix UV series. NUMAlink 5 is capable of 15 GB/s of peak bandwidth through two 7.5 GB/s unidirectional links.
- SGI NUMAlink Interconnect Fabric (courtesy Archive.org)
- Lenoski, D. et al, The Stanford DASH Multiprocessor, IEEE Computer Vol 25 Issue 3, 06 August 2002
- Joseph Heinrich, Origin and Onyx2 Theory of Operations Manual, 007-3439-002, Silicon Graphics.
- SGI NUMAlink White Paper, 3771, March 2005, Silicon Graphics