Reduced instruction set computer
The reduced instruction set computer, or RISC, is a CPU design philosophy that favors a reduced instruction set as well as a simpler set of instructions. The most common RISC microprocessors are DEC Alpha, ARC, ARM, Atmel AVR, MIPS, PA-RISC, PIC, Power Architecture, and SPARC.
The idea was originally inspired by the discovery that many of the features that were included in traditional CPU designs to facilitate coding were being ignored by the programs that were running on them. Also these more complex features took several processor cycles to be performed. Additionally, the performance gap between the processor and main memory was increasing. This led to a number of techniques to streamline processing within the CPU, while at the same time attempting to reduce the total number of memory accesses.
Pre-RISC design philosophy
In the early days of the computer industry, compiler technology did not exist at all. Programming was done in either machine code or assembly language. To make programming easier, computer architects created more and more complex instructions which were direct representations of high level functions of high level programming languages. The attitude at the time was that hardware design was easier than compiler design, so the complexity went into the hardware.
Another force that encouraged complexity was the lack of large memory. Since memory was small, it was advantageous for the density of information held in computer programs to be very high. When every byte of memory was precious, for example one's entire system only had a few kilobytes of storage, it moved the industry to such features as highly encoded instructions, instructions which could be variable sized, instructions which did multiple operations and instructions which did both data movement and data calculation. At that time, such instruction packing issues were of higher priority than the ease of decoding such instructions.
Memory was not only small, but rather slow since they were implemented using magnetic technology at the time. That was another reason to keep the density of information very high. By having dense information packing, one could decrease the frequency with which one had to access this slow resource.
CPUs had few registers for two reasons:
- bits in internal CPU registers are always more expensive than bits in external memory. The available level of silicon integration of the day meant large register sets would have been burdensome to the chip area or board areas available.
- Having a large number of registers would have required a large number of instruction bits (using precious RAM) to be used as register specifiers.
For the above reasons, CPU designers tried to make instructions that would do as much work as possible. This led to one instruction that would do all of the work in a single instruction: load up the two numbers to be added, add them, and then store the result back directly to memory. Another version would read the two numbers from memory, but store the result in a register. Another version would read one from memory and the other from a register and store to memory again. And so on. This processor design philosophy eventually became known as Complex Instruction Set Computer (CISC) once the RISC philosophy came onto the scene.
The general goal at the time was to provide every possible addressing mode for every instruction, a principle known as "orthogonality." This led to some complexity on the CPU, but in theory each possible command could be tuned individually, making the design faster than if the programmer used simpler commands.
The ultimate expression of this sort of design can be seen at two ends of the power spectrum, the MOS Technology 6502 at one end, and the VAX at the other. The $25 single-chip 1 MHz 6502 had only a single general-purpose register, but its simplistic single-cycle memory interface allowed byte-wide operations to perform almost on par with significantly higher clocked designs, such as a 4 MHz Zilog Z80 using equally slow memory chips (i.e. approx. 300ns). The VAX was a minicomputer whose initial implementation required 3 racks of equipment for a single cpu, and was notable for the amazing variety of memory access styles it supported, and the fact that every one of them was available for every instruction.
RISC design philosophy
In the late 1970s researchers at IBM (and similar projects elsewhere) demonstrated that the majority of these "orthogonal" addressing modes were ignored by most programs. This was a side effect of the increasing use of compilers to generate the programs, as opposed to writing them in assembly language. The compilers in use at the time only had a limited ability to take advantage of the features provided by Complex instruction set computer(CISC) CPUs; this was largely a result of the difficulty of writing a compiler. The market was clearly moving to even wider use of compilers, diluting the usefulness of these orthogonal modes even more.
Another discovery was that since these operations were rarely used, in fact they tended to be slower than a number of smaller operations doing the same thing. This seeming paradox was a side effect of the time spent designing the CPUs, designers simply did not have time to tune every possible instruction, and instead tuned only the most used ones. One famous example of this was the VAX's
INDEX instruction, which ran slower than a loop implementing the same code.
At about the same time CPUs started to run even faster than the memory they talked to. Even in the late 1970s it was apparent that this disparity was going to continue to grow for at least the next decade, by which time the CPU would be tens to hundreds of times faster than the memory. It became apparent that more registers (and later caches) would be needed to support these higher operating frequencies. These additional registers and cache memories would require sizeable chip or board areas that could be made available if the complexity of the CPU was reduced.
Yet another part of RISC design came from practical measurements on real-world programs. Andrew Tanenbaum summed up many of these, demonstrating that most processors were vastly overdesigned. For instance, he showed that 98% of all the constants in a program would fit in 13 bits, yet almost every CPU design dedicated some multiple of 8 bits to storing them, typically 8, 16 or 32, one entire word. Taking this fact into account suggests that a machine should allow for constants to be stored in unused bits of the instruction itself, decreasing the number of memory accesses. Instead of loading up numbers from memory or registers, they would be "right there" when the CPU needed them, and therefore much faster. However this required the operation itself to be very small, otherwise there would not be enough room left over in a 32-bit instruction to hold reasonably sized constants.
Since real-world programs spent most of their time executing very simple operations, some researchers decided to focus on making those common operations as simple and as fast as possible. Since the clock rate of the CPU is limited by the time it takes to execute the slowest instruction, speeding up that instruction -- perhaps by reducing the number of addressing modes it supports -- also speeds up the execution of every other instruction. The goal of RISC was to make instructions so simple, each one could be executed in a single clock cycle . The focus on "reduced instructions" led to the resulting machine being called a "reduced instruction set computer" (RISC).
The real difference between RISC and CISC is the philosophy of doing everything in registers and loading and saving the data to and from them. To avoid that misunderstanding, many researchers prefer the term load-store.
Over time the older design technique became known as Complex Instruction Set Computer, or CISC, although this was largely to give them a different name for comparison purposes.
Code was implemented as a series of these simple instructions, instead of a single complex instruction that had the same result. This had the side effect of leaving more room in the instruction to carry data with it, meaning that there was less need to use registers or memory. At the same time the memory interface was considerably simpler, allowing it to be tuned.
However RISC also had its drawbacks. Since a series of instructions is needed to complete even simple tasks, the total number of instructions read from memory is larger, and therefore takes longer. At the time it was not clear whether or not there would be a net gain in performance due to this limitation, and there was an almost continual battle in the press and design world about the RISC concepts.
While the RISC philosophy was coming into its own, new ideas about how to dramatically increase performance of the CPUs were starting to develop.
In the early 1980s it was thought that existing design was reaching theoretical limits. Future improvements in speed would be primarily through improved semiconductor "process", that is, smaller features (transistors and wires) on the chip. The complexity of the chip would remain largely the same, but the smaller size would allow it to run at higher clock rates. A considerable amount of effort was put into designing chips for parallel computing, with built-in communications links. Instead of making faster chips, a large number of chips would be used, dividing up problems among them. However history has shown that the original fears were not valid, and there were a number of ideas that dramatically improved performance in the late 1980s.
One idea was to include a pipeline which would break down instructions into steps, and work on one step of several different instructions at the same time. A normal processor might read an instruction, decode it, fetch the memory the instruction asked for, perform the operation, and then write the results back out. The key to pipelining is the observation that the processor can start reading the next instruction as soon as it finishes reading the last, meaning that there are now two instructions being worked on (one is being read, the next is being decoded), and after another cycle there will be three. While no single instruction is completed any faster, the next instruction would complete right after the previous one. The result was a much more efficient utilization of processor resources.
Yet another solution was to use several processing elements inside the processor and run them in parallel. Instead of working on one instruction to add two numbers, these superscalar processors would look at the next instruction in the pipeline and attempt to run it at the same time in an identical unit. However, this can be difficult to do, as many instructions in computing depend on the results of some other instruction.
Both of these techniques relied on increasing speed by adding complexity to the basic layout of the CPU, as opposed to the instructions running on them. With chip space being a finite quantity, in order to include these features something else would have to be removed to make room. RISC was tailor-made to take advantage of these techniques, because the core logic of a RISC CPU was considerably simpler than in CISC designs. Although the first RISC designs had marginal performance, they were able to quickly add these new design features and by the late 1980s they were significantly outperforming their CISC counterparts. In time this would be addressed as process improved to the point where all of this could be added to a CISC design and still fit on a single chip, but this took most of the late-80s and early 90s.
The long and short of it is that for any given level of general performance, a RISC chip will typically have many fewer transistors dedicated to the core logic. This allows the designers considerable flexibility; they can, for instance:
- increase the size of the register set
- implement measures to increase internal parallelism
- increase the size of caches
- add other functionality, like I/O and timers for microcontrollers
- add vector (SIMD) processors like AltiVec and Streaming SIMD Extensions (SSE)
- build the chips on older fabrication lines, which would otherwise go unused
- do nothing; offer the chip for battery-constrained or size-limited applications
Features which are generally found in RISC designs are:
- uniform instruction encoding (for example the op-code is always in the same bit position in each instruction, which is always one word long), which allows faster decoding;
- a homogeneous register set, allowing any register to be used in any context and simplifying compiler design (although there are almost always separate integer and floating point register files);
- simple addressing modes (complex addressing modes are replaced by sequences of simple arithmetic instructions);
- few data types supported in hardware (for example, some CISC machines had instructions for dealing with byte strings. Others had support for polynomials and complex numbers. Such instructions are unlikely to be found on a RISC machine).
RISC designs are also more likely to feature a Harvard memory model, where the instruction stream and the data stream are conceptually separated; this means that modifying the addresses where code is held might not have any effect on the instructions executed by the processor (because the CPU has a separate instruction and data cache), at least until a special synchronization instruction is issued. On the upside, this allows both caches to be accessed simultaneously, which can often improve performance.
Many of these early RISC designs also shared a not-so-nice feature, the branch delay slot. A branch delay slot is an instruction space immediately following a jump or branch. The instruction in this space is executed whether or not the branch is taken (in other words the effect of the branch is delayed). This instruction keeps the arithmetic and logical unit (ALU) of the CPU busy for the extra time normally needed to perform a branch. Nowadays the branch delay slot is considered an unfortunate side effect of a particular strategy for implementing some RISC designs, and modern RISC designs generally do away with it (such as PowerPC, more recent versions of SPARC, and MIPS).
The first system that would today be known as RISC was not at the time; it was the CDC 6600 supercomputer, designed in 1964 by Jim Thornton and Seymour Cray. Thornton and Cray designed it as a number-crunching CPU (with 74 op-codes, compared with a 8086's 400) plus 12 simple computers called "peripheral processors" to handle I/O (most of the operating system was in one of these). The CDC 6600 had a load-store architecture with only two addressing modes. There were eleven pipelined functional units for arithmetic and logic, plus five load units and two store units (the memory had multiple banks so all load-store units could operate at the same time). The basic clock cycle/instruction issue rate was 10 times faster than the memory access time.
Another early load-store machine was the Data General Nova minicomputer, designed in 1968.
The most public RISC designs, however, were the results of university research programs run with funding from the DARPA VLSI Program. The VLSI Program, practically unknown today, led to a huge number of advances in chip design, fabrication, and even computer graphics.
UC Berkeley's RISC project started in 1980 under the direction of David Patterson, based on gaining performance through the use of pipelining and an aggressive use of registers known as register windows. In a normal CPU one has a small number of registers, and a program can use any register at any time. In a CPU with register windows, there are a huge number of registers, e.g. 128, but programs can only use a small number of them, e.g. 8, at any one time. A program that limits itself to 8 registers per procedure can make very fast procedure calls: The call, and the return, simply move the window to the set of 8 registers used by that procedure. (On a normal CPU, most calls "flush" the contents of the registers to RAM to clear enough working space for the subroutine, and the return "restores" those values).
The RISC project delivered the RISC-I processor in 1982. Consisting of only 44,420 transistors (compared with averages of about 100,000 in newer CISC designs of the era) RISC-I had only 32 instructions, and yet completely outperformed any other single-chip design. They followed this up with the 40,760 transistor, 39 instruction RISC-II in 1983, which ran over three times as fast as RISC-I.
At about the same time, John L. Hennessy started a similar project called MIPS at Stanford University in 1981. MIPS focused almost entirely on the pipeline, making sure it could be run as "full" as possible. Although pipelining was already in use in other designs, several features of the MIPS chip made its pipeline far faster. The most important, and perhaps annoying, of these features was the demand that all instructions be able to complete in one cycle. This demand allowed the pipeline to be run at much higher speeds (there was no need for induced delays) and is responsible for much of the processor's speed. However, it also had the negative side effect of eliminating many potentially useful instructions, like a multiply or a divide.
The earliest attempt to make a chip-based RISC CPU was a project at IBM which started in 1975, predating both of the projects above. Named after the building where the project ran, the work led to the IBM 801 CPU family which was used widely inside IBM hardware. The 801 was eventually produced in a single-chip form as the ROMP in 1981, which stood for Research (Office Products Division) Mini Processor. As the name implies, this CPU was designed for "mini" tasks, and when IBM released the IBM RT-PC based on the design in 1986, the performance was not acceptable. Nevertheless the 801 inspired several research projects, including new ones at IBM that would eventually lead to their POWER system.
In the early years, the RISC efforts were well known, but largely confined to the university labs that had created them. The Berkeley effort became so well known that it eventually became the name for the entire concept. Many in the computer industry criticized that the performance benefits were unlikely to translate into real-world settings due to the decreased memory efficiency of multiple instructions, and that that was the reason no one was using them. But starting in 1986, all of the RISC research projects started delivering products. In fact, almost all modern RISC processors are direct copies of the RISC-II design.
Berkeley's research was not directly commercialized, but the RISC-II design was used by Sun Microsystems to develop the SPARC, by Pyramid Technology to develop their line of mid-range multi-processor machines, and by almost every other company a few years later. It was Sun's use of a RISC chip in their new machines that demonstrated that RISC's benefits were real, and their machines quickly outpaced the competition and essentially took over the entire workstation market.
John Hennessy left Stanford (temporarily) to commercialize the MIPS design, starting the company known as MIPS Computer Systems. Their first design was a second-generation MIPS chip known as the R2000. MIPS designs went on to become one of the most used RISC chips when they were included in the PlayStation and Nintendo 64 game consoles. Today they are one of the most common embedded processors in use for high-end applications.
IBM learned from the RT-PC failure and went on to design the RS/6000 based on their new POWER architecture. They then moved their existing AS/400 systems to POWER chips, and found much to their surprise that even the very complex instruction set ran considerably faster. POWER would also find itself moving "down" in scale to produce the PowerPC design, which eliminated many of the "IBM only" instructions and created a single-chip implementation. Today the PowerPC is one of the most commonly used CPUs for automotive applications (some cars have over 10 of them inside). It was also the CPU used in most Apple Macintosh machines sold until 2006. Starting in February 2006, Apple switched their PowerPC products for Intel x86 processors.
Almost all other vendors quickly joined. From the UK similar research efforts resulted in the INMOS transputer, the Acorn Archimedes and the Advanced RISC Machine line, which is a huge success today. Companies with existing CISC designs also quickly joined the revolution. Intel released the i860 and i960 by the late 1980s, although they were not very successful. Motorola built a new design called the 88000 in homage to their famed CISC 68000, but it saw almost no use and they eventually abandoned it and joined IBM to produce the PowerPC. AMD released their 29000 which would go on to become the most popular RISC design of the early 1990s.
Today the vast majority of all CPUs in use are RISC CPUs, and microcontrollers. RISC design techniques offers power in even small sizes, and thus has come to completely dominate the market for low-power embedded CPUs, which are by far the largest market for processors: while a family may own one or two PCs, their car(s), cell phones, and other devices may contain a total of dozens of embedded processors. RISC had also completely taken over the market for larger workstations for much of the 90s. After the release of the Sun SPARCstation the other vendors rushed to compete with RISC based solutions of their own. Even the mainframe world is now completely RISC based.
However, despite many successes, RISC has made few inroads into the desktop PC and commodity server markets, where Intel's x86 platform remains the dominant processor architecture (Intel is facing increased competition from AMD, but even AMD's processors implement the x86 platform, or a 64-bit superset known as x86-64). There are three main reasons for this. One, the very large base of proprietary PC applications are written for x86, whereas no RISC platform has a similar installed base, and this meant PC users were locked into the x86 despite a lack of performance. The second is that, although RISC was indeed able to scale up in performance quite quickly and cheaply, Intel took advantage of its large market by spending vast amounts of money on processor development. Intel could spend many times as much as any RISC manufacturer on improving design and manufacturing, making up for flaws inherent in the basic x86 architecture. The same could not be said about smaller firms like Cyrix and NextGen, but they all realized that they could apply RISC design philosophies and practices to Intel's architecture. The first x86 CPU to deploy RISC techniques was the NextGen Nx586, released in 1994, and it did this by expanding the majority of the CISC instructions into multiple simpler RISC operations. Internally the Nx586, Intel P6, AMD K5 and Cyrix 6x86 are RISC machines that emulate a CISC architecture.
Consumers are interested in speed, cost per chip, and compatibility with existing software rather than the cost of development of new chips. This has led to an interesting chain of events. As the complexity of developing ever more advanced CPUs rises, the cost of both development and fabrication of high-end CPUs has exploded. The cost gains given by RISC are now dwarfed by the high costs of developing any modern CPU. Today, only the biggest chip makers are able to make high performing CPUs. The result is that virtually all RISC platforms with the exception of IBM's Power Architecture have greatly shrunk in scale of development of high performing CPUs (like SPARC and MIPS) or were abandoned (like Alpha and PA-RISC) during the 00s. As of 2004, x86 chips are the fastest CPUs in SPECint displacing all RISC CPUs, and the fastest CPU in SPECfp is the IBM Power 5 processor.
Still, RISC designs have led to a number of successful platforms and architectures, some of the larger ones being:
- MIPS's MIPS line, found in most SGI computers and the PlayStation, PlayStation 2, PlayStation Portable, and Nintendo 64 game consoles
- IBM's and Freescale's (formerly Motorola SPS) Power Architecture, used in all of IBM's supercomputers, midrange servers and workstations, in Apple's Power Macintosh computers, in Nintendo's Gamecube and Wii, Microsoft's Xbox 360 and Sony's PlayStation 3 game consoles, and in many embedded applications like printers and cars.
- Sun's SPARC and UltraSPARC, found in all of their later machines
- Hewlett-Packard's PA-RISC, also known as HP/PA
- DEC Alpha, still used in some of HP's workstation and servers.
- XAP processor used in many wireless chips, e.g. Bluetooth
- ARM — Palm, Inc. originally used the (CISC) Motorola 680x0 processors in its early PDAs, but now uses (RISC) ARM processors in its latest PDAs. Apple Computer uses the ARM 7TDMI in its iPod products. Nintendo uses an ARM7 CPU in the Game Boy Advance and both an ARM7 and ARM9 in the Nintendo DS handheld game systems. The small Korean company Game Park also markets the GP32, which uses the ARM9 CPU. Also, many cell phones from, for example, Nokia are based on ARM designs.
- Hitachi's SuperH, originally in wide use in the Sega Saturn and Sega Dreamcast, now at the heart of many consumer electronics devices. The SuperH is the base platform for the Mitsubishi - Hitachi joint semiconductor group. The two groups merged in 2003, dropping Mitsubishi's own RISC architecture, the M32R.
Over many years, RISC instruction sets have tended to grow in size. Thus, some have started using the term "load-store" to describe RISC processors, since this is the key element of all such designs. Instead of the CPU itself handling many addressing modes, a load-store architecture uses a separate unit dedicated to handling very simple forms of load and store operations. CISC processors are then termed "register-memory" or "memory-memory".