The Itanium 2 is a family of IA-64 64-bit microprocessors produced by Intel. The architecture was developed jointly by Hewlett-Packard (HP) and Intel. The first Itanium 2 was introduced on July 8, 2002, superseding the original Itanium. Several newer family members have been introduced. The Itanium 2 is intended for use in high-end "enterprise" servers.
- 1 Computing capabilities of Itanium
- 2 Architectural Features and Attributes
- 3 Competitors
- 4 Supercomputers
- 5 Itanium 2 processor versions
- 6 Upcoming revisions
- 7 References
- 8 External links
Computing capabilities of Itanium
Floating-point performance is a function of both the ability to perform floating-point operations in parallel and the cycle time necessary for the processor to execute these operations. On the Itanium this number is increased by functional units which can perform two operations in a single pass. The Itanium is configured with two floating-point functional units known as floating-point multiply add calculations (FMACs), which can multiply two values and add that result to a third value. (Such operations are at the heart of many technical calculations.) Thus, an Itanium running at 800 MHz can produce four floating-point results a cycle for a peak 64-bit performance rating of 3.2 billion floating-point operations per second (GFLOPS). The Itanium architecture also includes two single-precision (32-bit) FMACs that are tuned for 3D graphics performance which can each perform an additional four floating-point operations per cycle for a 6.4 GFLOPS single-precision rating on an 800 MHz processor. It is important to note that these performance numbers will automatically increase with each step-up of clock rates in the Itanium processor family. In addition, the Itanium architecture is designed to allow future versions of the processor to be configured with additional FMACs. The above analysis presents a best-case scenario in which the functional units are always busy. Although computer processors can maintain peak performance for only brief periods, Intel has incorporated a number of features in the Itanium architecture that help to maximize sustained performance. These include:
- Pipelined functional units
- Arithmetic operations generally require more than one machine cycle to complete. A pipelining scheme is used to allow the FMACs to produce results each cycle. The arithmetic operations are broken into a set of independent steps, each requiring one machine cycle to complete. The FMACs perform arithmetic operations in an assembly-line fashion, with each step accepting data from the previous step and sending results to the next step. Thus, after the pipeline is full, a result is produced each cycle.
- Dual-function arithmetic units
- A secondary benefit of the dual function FMAC strategy is that the processor is able to use both functional units even when the distribution of adds and multiplies is biased toward one operation. For example, if a section of code performs only additions, both FMACs can be employed on the task. In contrast, a system with separate addition and multiplication functional units would use the adder but would have to leave the multiply unit idle.
- Large register sets
- Intel designed the Itanium processor to support 128 integer, 128 floating point, 8 branch and 64 predicate registers (for comparison, IA-32 processors support 8 registers and other RISC processors support 32 registers). The use of these registers allows more database data and intermediate calculations to be stored in on-chip registers and reduces the repetitive load/store of intermediate data values. The more data that is directly available to the FMACs, the less likely a functional unit will stall due to lack of data. In addition, the large register sets provide a buffer for the memory system to move data in and out of memory. These capabilities combine to greatly improve the overall response time of an application’s database manipulation request.
- Internal parallelism
- The Itanium can issue up to six instructions per cycles in a fixed set of combinations of four integer arithmetic/ logical operations, two load/store operations, two floating-point operations, and three branch operations. The advantage of double-precision or 64-bit operations over single-precision or 32-bit operations is that the former allow larger sets of calculations to be performed before accumulated round-off errors begin to affect the accuracy of the final results. Because 64-bit systems are able to produce 64-bit results in a single cycle, as opposed to two cycles for 32-bit systems, the speed of operations on 64-bit data types (such as doubles) is greatly increased. Multiple operations not only keeps as much of the processor working as possible but also allows for the pre-fetching of data from memory into registers and cache memory, thus minimizing processor stalls due to data unavailability. The processor also enables a load-double pair instruction to feed the processor with a balance of a memory operation per floating-point operation.
- Compiler support for parallelism
- The IA-64 architecture was designed to allow for closer coordination between the processor and compilers which generate the machine instructions for the processor. Three instructions are bundled along with a template field where the compiler can provide “hints” to the hardware on the interactions between the instructions. These hints are used by the processor to schedule instructions in real time and for pre-fetching of data for future operations.
Memory performance is measured in terms of both latency (i.e., how many cycles it takes to get data from memory to the processor) and bandwidth (i.e., how many bytes of data can be moved in a cycle). Many current systems attempt to solve the problems of latency and insufficient bandwidth through memory hierarchies, which include various levels of cache memory between main memory and the processor. Although this solution is effective, it is costly in terms of memory involved.
The Itanium 2 can read or write bytes of data to and from memory during every bus cycle; thus, for a 133 MHz bus, the memory bandwidth is 2.1 GBps. The 460GX chipset, which supports the Itanium processor, also has the ability to write an additional 2.1 GBps from I/O to memory, for a total of 4.2 GBps memory bandwidth. The Itanium processor uses a 4 MB L3 (level 3) cache for quick access to large data structures such as texture maps for digital content applications. The L3 cache communicates with the 96 KB L2 cache and the register file, moving data at 12.6 GBps (16 bytes per 800 MHz system clock) and with a 24-cycle latency for floating-point numbers.
The L2 cache feeds data directly into the floating-point registers at a rate of 32 bytes of data per clock tick and with a 9 clock latency. Although the L1 cache is by-passed by floating-point data, it is worth noting that it is divided into a 16 KB instruction cache — L1I — and a 16 KB integer data cache — L1D. Both caches operate on 2 clock latency to provide localized access to integer instructions and data, which is faster than retrieving the data from memory.
Architectural Features and Attributes
IA-64 is a 64-bit architecture. Like other 64-bit architectures (POWER,MIPS,SPARC,EM64T, etc.) IA-64 implements a large physical and virtual address space. The programmer does not need to worry about the size of the data or the code. Historically, earlier 8-bit, 16-bit, and 32-bit architectures imposed constraints that affected the application programmer: most recently, programmers were generally constrained to 2 GiB of data space with a 32-bit architecture. The 64-bit architecture lets the programmer work with a 263 byte data space and a 263 byte code space.
The IA-64 architecture is based on a derivative of VLIW, dubbed Explicitly Parallel Instruction Computing (EPIC). It is theoretically capable of performing roughly 8 times more work per clock cycle than a non-superscalar CISC or RISC architecture due to its Parallel Computing Microarchitecture. However, performance is heavily dependent on software compilers and their ability to generate code which efficiently uses the available execution units of the processor. The Itanium 2 has seen heavy use in compute-bound supercomputers, and large corporate database servers, where parallelism and compile-time optimizations are most effective.
All Itanium 2 processors to date share a common cache hierarchy. They have 16 KiB of Level 1 instruction cache and 16 KiB of Level 1 data cache. The L2 cache is unified (both instruction and data) and is 256 KiB. The Level 3 cache is also unified and varies in size from 1.5 MiB to 24 MiB. In an interesting design choice, the L2 cache contains sufficient logic to handle semaphore operations without disturbing the main ALU. The latest Itanium processor, however, features a split L2 cache, adding a dedicated 1MiB L2 cache for instructions and thereby effectively growing the original 256 KiB L2 cache, which becomes a dedicated data cache.
The Itanium 2 bus is occasionally referred to as the Scalability Port, but much more frequently as the McKinley bus. It is a 200 MHz, 128-bit wide, double pumped bus capable of 6.4 GB/s — more than three times the bandwidth of the original Itanium bus, known as the Merced bus. In 2004, Intel released processors with a 266 MHz bus, increasing bandwidth to 8.5 GB/s. In early 2005, processors with a 10.6 GB/s, 333 MHz bus were released.
Most systems sold by enterprise server vendors that contain 4 or more processor sockets use proprietary Non-Uniform Memory Access (NUMA) architectures that supersede the more limited front side bus of 1 and 2 CPU socket servers.
The Itanium 2 competes in the enterprise server market. Itanium's major competitors include Sun Microsystems' UltraSPARC T1, IBM's Power5, AMD's Opteron, and Intel's own Xeon servers. In general, Itanium competes against Sun, IBM systems, and Opterons for running enterprise-class workloads on large, multi-processor servers in the back-end of corporate datacenters. It competes against Opteron and Xeon-based servers in smaller configurations and in cluster configurations.
The biggest change in the competitive landscape has been the emergence of the x86-64 64-bit architecture, created by AMD and first implemented in the Opteron in 2003. Opteron gained rapid acceptance in the enterprise server space because it provided an easy upgrade from IA-32. Intel eventually responded by implementing its own derivative of the architecture, EM64T, in its Xeon microprocessors in 2005.
Throughout its history, Itanium has had the best floating point performance relative to fixed-point performance of any general-purpose microprocessor. This capability is not needed for most enterprise server workloads. Sun's latest server-class microprocessor, the UltraSPARC T1 acknowledges this explicitly, with performance dramatically skewed toward the improvement of integer processing at the expense of floating point performance (eight integer cores share a single FPU). Thus Itanium and Sun appear to be addressing separate subsets of the market. By contrast, IBM's cell microprocessor, with a single general-purpose POWER core controlling eight simpler cores optimized for floating point, may eventually compete against Itanium for floating-point workloads.
Four computers based on Itanium 2 appeared in top 20 of the November 2006 list of the TOP500 supercomputers:
- #7 Tera-10, Commissariat a l'Energie Atomique (CEA), France. Machine: Bull SMP Cluster, NovaScale 5160. CPU: 8,704 Itanium 2 (1.6 GHz). Connection: Quadrics QsNet II. Main Memory: 26112 GB. Rmax: 42.9 Teraflops.
- #8 Columbia, NASA Ames Research Center United States SGI Altix 3700, CPU: 10160 Itanium 2 (1.5 GHz). Connection: Voltaire Infiniband Rmax: 51.8 Teraflops.
- #18 HLRB II, Leibniz Rechenzentrum, Baveria, Germany. Machine: SGI Altix 4700. CPU: 4,096 Itanium 2 (1.6 GHz). Connection: SGI NUMAlink. Rmax: 24.36 Teraflops
- #19 Tiger4, Lawrence Livermore National Laboratory,United States. Intel CPU 4.096 Itanium2 (1.4GHz). Connection: Quadrics Rmax: 19.94 Teraflops
The best position ever achieved by an Itanium 2 based system in the list was #2, achieved in June 2004 when Thunder (LLNL) entered the list with an Rmax of 19.94 Teraflops. Again in November 2004 Columbia entered the list at #2 with 51.8 Teraflops.
The peak number of Itanium-based machines on the list occurred on the November 2004 list at 16.8%. In November 2006 the number is 7.0%
Itanium 2 processor versions
McKinley was the first version of the Itanium 2 processor, manufactured in an 180 nm process. It was released at speeds of 900 MHz and 1 GHz, with cache sizes of 1.5 MiB and 3 MiB, providing a major perforance improvement over the original Itanium. It added hardware support for the branchlong instruction of the IA-64 instruction set. IA-32 performance, while improved, was still only about 25% as fast as a contemporaneous 2.4Ghz Xeon.
Madison was initially introduced on June 30, 2003. It was initially available in three versions: 1.3 GHz with 3 MiB of cache, 1.4 GHz with 4 MiB of cache and 1.5 GHz with 6 MiB of cache. Manufactured in a 130 nm process, it had a die size of 374 mm². Its power envelope remained unchanged from McKinley at 130 watts. On September 8, 2003, a 1.4 GHz version with 1.5 MiB of cache was released. 1.4 GHz and 1.6 GHz versions with 3 MiB of cache were launched on April 13, 2004. November 8, 2004 saw the release of the first processor in the Madison 9M series, at 1.6 GHz with 9 MiB of cache. On July 18, 2005, more variations of the Madison 9M were introduced, including 1.67 GHz models with a 333 MHz FSB and either 6 MiB or 9 MiB of cache. On introduction, the latter part set a record SPECfp2000 result of 2,801 in a Hitachi, Ltd. Computing blade.
In January 2005 OpenVMS was added to the line up of Operating Systems able to run on these processors.
Hondo was announced as the HP mx2 dual-processor module on February 18, 2003 and started shipping in early 2004. It consists of two Madison cores with 32 MiB of L4 cache and fits in the same space as a normal Itanium 2 CPU. It is only available from HP. Currently the cores run at 1.1 GHz with 4 MiB L3 cache each.
HP-UX, OpenVMS, Windows and Linux for Itanium were able to use the mx2 variant.
Deerfield was released on September 8, 2003. With 1.5 MiB of cache, running at 1 GHz, this was the first low voltage Itanium processor. Its 62 watt power envelope made it more suited for blade and 1U servers.
The Fanwood core debuted on November 8, 2004. Versions include a 1.6 GHz edition with 3 MiB of L3 cache with either 200 MHz or 266 MHz front side bus and a low voltage 1.3 GHz version with 3 MiB L3 cache at 200 MHz.
The Dual-Core Intel Itanium 2 processor 9000 series (code-named Montecito) was released on July 18, 2006. Montecito is the first Itanium processor to have two cores per die. It was originally planned to feature advanced power and thermal management improvements. However, the originally planned Foxton dynamic clock speed feature was removed due to unspecified engineering issues (it is under consideration by Intel for inclusion in future Itanium 2 processor versions). Despite the elimination of this feature, Intel reports that Montecito doubles the performance of its single-core predecessor, while reducing power consumption by approximately 20 percent.  It also adds multi-threading capabilities (two threads per core), a greatly expanded cache subsystem (12 MB per core), and silicon support for virtualization. Manufactured in a 90nm process, Montecito debuted with speeds between 1.4 GHz for a low-power configuration and 1.6 GHz / 12 + 12 MiB L3 at the high end. The front side bus runs at 400 MHz and 533 MHz.
The future of the Itanium family apparently lies in multi-core chips, as the available information about coming generations, such as Montvale and Tukwila shows. (Those are internal code names; the final products will most likely also bear the Itanium brand, possibly as Itanium 3 or perhaps just Itanium 2.).
Montvale is expected to be a revision of Montecito bringing higher clock speeds, larger caches, and a faster FSB.
Tukwila, the first 65 nanometer design, is due in 2008. Tukwila will consist of 4 cores, with each core being multithreaded. It is going to feature a new bus called Common System Interface and an on-die memory controller. Ultimately, CSI is intended to provide socket compatibility with Xeon processors; however, as of October 2005, the CSI roadmap for Xeon processors has been delayed until at least 2009.
Few details are known, other than the existence of the codename.