Features

PowerPC on Apple: An Architectural History, Part I

The first part of a three-part series on the PowerPC CPUs found in Macintosh …

Jon Stokes – Aug 3, 2004 11:15 pm | 0

Story text

Size Width * Links

* Subscribers only
Learn more

Introduction

After I completed my recent architectural history of the Pentium product line (Part I, Part II), I got some requests from Apple fans to do a similar treatment of the PowerPC family of processors. When I agreed to look into the task, what I grasped was that the PowerPC family tree is more like a family jungle, with different variants on different processors combining with other lines to give rise to yet more processors, all for an array of markets that ranges from mainframes to routers to game consoles.

This being the case, the first decision that I made was to focus my coverage exclusively on PPC chips that have seen use in shipping Apple products. I stress the word "shipping" in that previous sentence, because there are a few lifeless branches on the aforementioned family tree that never quite sprouted. So even though Mac fans like to fantasize about What Might Have Been had this or that wonderchip seen the light of day, I'm not going to spend any time in this article with lost lore and apocryphal tales, as fascinating as such things certainly are.

And while I'm on the subject about what won't be in the article, I should take a moment to address another point that's very important to me as someone who started writing about CPUs in the context of the Mac vs. PC flame wars of the late 90s and who has seen way more than I care to of the ugly side of platform zealotry. This article is not an attempt to demonstrate the innate superiority of The Mac Way, or to prove that RISC 0wn3z lame old CISCy x86, or to relive any of the platform debates of yesteryear. I keep cross-platform comparisons to a minimum, so if you're looking for religion, look elsewhere. And following on this point, if you send me feedback in all caps and/or accuse me of being a bigoted Apple hater and/or suggest that if I "got laid" I might develop a proper appreciation for the Mac's superiority (why do true-blue Mac zealots always do this?) and/or flame me in any other way for any sins (real or imagined) against your pet platform (Mac or PC), then don't expect me to read the entire email before deleting it, and don't take my failure to reply as evidence that you must be right.

Finally, a word of warning to readers who might expect this article to have the same kind of larger narrative contours that the Pentium series had. The history of Apple's use of the PowerPC is the history of three companies: Apple, IBM, and Motorola. As such, the kind of story that I was able to tell about the Pentium brand would be a much more complicated tale involving multiple characters were I to try and duplicate that feat with the PowerPC line, so I'm not even going to bother trying. Besides, the fact that I'm a relative late-comer to the Mac platform means that I didn't follow the ins and outs of the AIM/PPC saga first-hand, so I'm not the most qualified person to write the kind of detailed history of it that does indeed deserve to be written.

In fact, I didn't even start following the platform wars until the G3 was well established. This being the case, the present series is a straight-ahead technical look at the PPC chips that have formed the heart of Apple's product line ? no more, and no less. Thus the processors ? and not Apple as a company or Apple's computers as a whole ? are the main focus of this series. So even though this article is organized around processors that Apple has used, Apple is a bit player in this story; the PowerPC processors are the stars of the show.

Basic processor terminology

If you've not read much of my writing before and you're unfamiliar with some key microprocessor concepts, then you may have some trouble getting through this article. If you want to take a short detour and familiarize yourself with some fo the key concepts that I'll be using in this article, consider checking out some of the following links. These links point to specific pages of past articles that explain certain key microprocessor concepts in more detail than I'll go into, here.

Basic instruction flow: P4 vs. G4e, Understanding the MP (i.e. terms like "front end", "execution core", "dispatch", and "issue").
Pipelining: Understanding the MP
Branch instructions: Understanding the MP
Branch prediction: P4 vs. G4e
The PowerPC Condition Register: Inside the PPC 970
Bandwidth and latency: Understanding Bandwidth and Latency
Instruction window: The Pentium

Check out some of the links in the list above to get yourself oriented, and keep the list handy as you read this series in case you need to refer back to it.

A brief history of POWERPC

This is not the place to recap the sordid history of the erstwhile AIM alliance, but a few notes about the origins of the PowerPC line are in order here at the outset of our discussion.

In the beginning was POWER (Performance Optimization With Enhanced RISC), IBM's RISC architecture developed for use in mainframes and servers. There a was also Motorola, whose 68000 (a.k.a. the 68K) processor formed the core of Apple's desktop computing line, and whose more advanced 88000 processor wasn't making much headway in the market due to a lack backwards compatibility with the 68K.

To make a long story very short, IBM needed a way to turn POWER into a wider range of computing products for use outside the server closet, Motorola needed a high-end RISC microprocessor in order to compete in the RISC workstation market, and Apple needed a CPU for its personal computers that would be both cutting-edge and backwards compatible with the 68K.

Thus the AIM (Apple, IBM, Motorola) alliance was born, and with it was also born a subset of the POWER architecture dubbed PowerPC. PowerPC processors were to be jointly designed and produced by IBM and Motorola with input from Apple, and were to be used in Apple computers and in the embedded market.

PowerPC 601

PowerPC 601 summary table

Introduction date: March 14, 1994
Process: 0.60 micron
Transistor Count: 2.8 million
Die size: 121mm²
Clock speed at introduction: 60-80MHz
Cache sizes: 32KB unified L1
First appeared in: Power Macintosh 6100/60

In 1993, AIM kicked off the PowerPC party by releasing the 32-bit PowerPC 601 at an initial speed of 66MHz. The 601, which was based on IBM's RISC Single Chip processor (RSC), combined IBM's POWER architecture with the 60x bus developed by Motorola for use with their 88000, and was designed to serve as a "bridge" between POWER and PowerPC. IBM describes this bridging aspect of the chip as follows:

The PowerPC 601 processor provides a bridge between the POWER and PowerPC Architectures by supporting most of the PowerPC and POWER instructions. The PowerPC 601 executes all compiler-generated user-level POWER instructions. The implementation also supports all but a few of the 32-bit PowerPC instructions. Existing binaries, which perform well on the PowerPC 601, may show a degradation on future PowerPC implementations which use emulation. However, the PowerPC 601 bridge allows time for software developers to recompile to a common set of instructions.

Even though the joint IBM-Motorola team in Austin, TX only had 12 months to get this chip off the ground, it was a very nice and full-featured RISC design for its time. This section will take a look at the architecture of the 601, as a way of laying the groundwork for a discussion of the PowerPC chips that followed it.

If you read my two-part architectural history of the Pentium line, then you know how complex the different Pentiums' front-ends tend to be. There's none of that with the 601, which has a classic four-stage RISC integer pipeline:

Fetch
Decode/Dispatch
Execute
Writeback

The fact that PowerPC's RISC instructions are all the same size means that the 601's instruction fetch logic doesn't have the instruction alignment headaches that plague x86 designs, which means that the fetch hardware can be simpler and faster. Back when transistor budgets were tight, this kind of thing could make a big difference.

Figure 1: PowerPC 601 architecture

As you can see from the diagram above, up to 8 instructions per cycle can be fetched directly into an eight-entry instruction queue (IQ), where they're decoded before being dispatched to the execution core. Get used to seeing this instruction queue, because it shows up in every single PPC model that we'll discuss in this series, all the way down to the PPC 970.

The instruction queue is used mainly for detecting and dealing with branches. The 601's branch unit scans the bottom four entries of the queue, identifying branch instructions and determining what type they are (conditional, unconditional, etc.). In cases where the branch unit has enough information to resolve the branch right then and there (e.g. in the case of an unconditional branch, or a conditional branch whose condition is dependent on information that's already in the condition register) then the branch instruction is simply deleted from the instruction queue and replaced with the instruction located at the branch target.

This branch-elimination technique, called branch folding, speeds performance in two ways. First, it eliminates an instruction (the branch) from the code stream, which frees up dispatch bandwidth for other instructions. Second, it eliminates the single-cycle pipeline bubble that usually occurs immediately after a branch. (For more on branch bubbles and branch folding, go to this page of my P4 vs. G4e article and read the text under the very last diagram.

Non-branch instructions sit in the instruction queue while the dispatch logic examines the four bottommost entries to see which three of them it can send off to the execution core on the next cycle. The dispatch logic can dispatch three instructions per cycle out-of-order from the bottom four queue entries, with a few restrictions that I won't go into here. (Actually, I should mention one because it's pretty important: integer instructions can only be dispatched from the bottommost queue entry.)

You'll notice that the 601 has no equivalent to the Pentium Pro's reorder buffer (ROB) for keeping track of the original program order. IBM's whitepaper on the 601 just says that the instructions are tagged with what amounts to metadata so that the write-back logic can commit the results to the register file in program order. This technique of tagging instructions with program-order "metadata" works fine for a simple design like the 601 with a very small instruction window, but later PPC designs will require dedicated structures for tracking larger numbers of in-flight instructions and making sure that they retire in-order.

From the dispatch stage instructions go into one of three different execution units: the integer unit, the floating-point unit, and the branch unit. Let's take a look at each of these units in turn.

The 601's integer unit

The 601's 32-bit integer unit is a straightforward fixed-point ALU that's responsible for all the integer math ? including address calculations ? on the chip. While contemporary x86 designs, like the original Pentium, needed dedicated address adders to keep all of the address calculations associated with x86's multiplicity of addressing modes from tying up the execution core's integer hardware, the 601's load-store, RISCy nature meant that it could feasibly handle memory traffic and regular ALU traffic with a single integer execution unit.

So the 601's integer unit handles the following memory-related functions, most of which are moved off into a dedicated load-store unit in subsequent PPC designs:

Integer and floating-point load-address calculations
Integer and floating-point store-address calculations
Integer and floating-point load-data operations
Integer store-data operations

Cramming all of these load-store functions into the 601's exactly single integer ALU didn't help the chip's integer performance, but it was good enough to keep up with the Pentium in this area despite the fact that the Pentium had two integer ALUs. I imagine that most of this integer performance parity came from the 601's huge 32K unified L1 cache (compare the Pentium's 8K split L1).

A final point worth about the 601's integer unit is that multicycle integer instructions (e.g., integer multiplies and divides) are not fully pipelined. So when an instruction that takes, say, five cycles to execute enters the IU, it ties up the entire IU for the whole five cycles. Thankfully, the most common integer instructions are single-cycle instructions.

At any rate, while the 601 may not have been smoking its competitors in integer performance, floating-point was another story.

The 601's floating-point unit

With its single floating-point unit, which handled all floating-point calculations and floating-point store-address operations, the 601 was a very strong floating-point performer.

The 601's floating-point pipeline was six stages long, and included the four basic stages outlined above, but with an extra decode stage and an extra execute stage. What really set the chip's floating-point hardware apart was the fact that not only were almost all single-precision operations fully pipelined, but most double-precision (64-bit) floating-point operations were, as well. This meant that for single-precision operations (with the exception of divides) and most double-precision operations, the 601's floating-point hardware could turn out one instruction per cycle with a two-cycle latency.

Another factor in the 601's floating-point dominance was that its integer unit handled all of the memory traffic, with the exception of floating-point stores. This meant that during long stretches of floating-point-only code, the integer unit acts like a dedicated load-store unit (LSU) whose sole purpose is to keep the FPU fed with data.

Such an FPU + LSU combination rocked for two reasons: first, integer and floating-point code are rarely mixed, so it didn't matter for performance if the integer unit is tied up with floating-point related memory traffic. Second, floating-point code is often data-intensive, with lots of loads and stores, and thus high levels of memory traffic to keep a dedicated LSU busy.

When you combine both of these factors with the 601's hefty 32K L1 cache, you have a floating-point force to be reckoned with in 1994 terms.

The 601's branch execution unit

The 601's branch unit (BU) works in combination with the instruction fetcher to steer the front-end of the processor through the code stream by executing branch instructions and predicting branches. Regarding the latter function, the 601's BU uses a simple static branch predictor to predict conditional branches. We'll talk a bit more about branch prediction and speculative execution in the next major section, though.

The sequencer unit

The 601 contained a peculiar holdover from the IBM RSC chip, called the sequencer unit. The sequencer unit, which I'll admit is a bit of a mystery to me, appears to be a small, CISC-like processor with its own 18-bit instruction set, 32-word RAM, microcode ROM, register file, and execution unit, all of which are embedded on the 601. Its purpose is to execute some legacy instructions particular to the older RSC; to take care of housekeeping chores like self-test, reset, and initialization functions; and to handle exceptions, interrupts, and errors.

The inclusion of the sequencer unit on the 601 was quite obviously the result of the time crunch that the 601 team was under in bringing the first PowerPC chip to market; IBM admits this much in their 601 whitepaper. The team started with IBM's RSC chip as its basis, and began redesigning it to implement the PowerPC ISA. Instead of throwing out the sequencer unit, a component which played a major role in the functioning of the original RSC, they simply scaled back its size and functionality for use in the 601.

I don't have any exact figures, but I think it's safe to say that this embedded subprocessor unit took up a decent amount of die space on the 601, and that the design team would have thrown it out if they would've had more time. Subsequent PowerPC processors, which don't have to worry about RSC legacy support, implement all of the (non-RSC-related) functions of the 601's sequencer unit by spreading them out into other functional blocks.

601 conclusions

The 601 could spend a ton of transistors (at least, a ton for its day) on a 32K cache because its front end was so much simpler than that of the Pentium. This was real advantage for using RISC at that time. The chip made its debut in the PowerMac 6100 to good reviews, and it put Apple in the performance lead over its x86 competition. The 601 was definitive in firmly establishing the cult of Apple as a high-end computer maker.

Nonetheless, the 601 did leave some room for improvement. The sequencer unit that it inherited from its mainframe ancestor took up valuable die space that could've been put to better use. With a little more time to tweak it, the 601 could've been closer to perfect. But near perfection would have to wait for the one-two punch of the 603e and 604.

(Incidentally, those of you who've read my PPC 970 coverage may find some features of the 601's story familiar. A high-performance processor is adapted from an IBM high-end server chip for use in an Apple desktop. The chip is rushed to market but delivers solid performance in spite of a few flaws related to the time crunch that the design team was under. And if you haven't read my PPC 970 coverage, then you'll be seeing this story again.)

The PowerPC 603 and 603e

? PowerPC 603 vitals PowerPC 603e vitals

Introduction date May 1, 1995 October 16, 1995

Process 0.50 micron 0.50 micron

Transistor count 1.6 million 2.6 million

Die size 81mm² 98mm²

Clock speed at introduction 75MHz 100MHz

L1 cache size 16K split L1 32K split L1

First appeared in Macintosh Performa 5200CD Macintosh Performa 6300CD

While the Austin team was putting the finishing touches on the 601, a team in Sommerset had already begun working on the 601's successor, the 603. The 603 is a significantly different design than the 601, so it's less of an evolutionary shift than it is a completely different processor.

The 603 was designed with low power in mind, because Apple needed a chip for its PowerBook line. As a result, the processor had a very good performance-per-Watt ratio on native PowerPC code, and in fact was able to match the 601 clock-for-clock though it had about half the number of transistors as the older processor. But the 603's smaller 16K split L1 cache meant that it stunk at emulating the legacy 68K code that formed a large part of Apple's OS and application base.

Thus the 603 was relegated to the very lowest end of Apple's product line (the Performas, beginning with the 6200; and the all-in-ones designed for the .edu market, beginning with the 5200), until a tweaked version (the 603e) with an enlarged, 32K split cache was released. The 603e performed better on emulated 68K code, so it saw widespread use in the PowerBook line.

Figure 2: PowerPC 603 architecture

Note that the 604 was also released at the same time as the original 603. The 604, which was intended for in Apple's high-end products just like the 603 was intended for low-end products, was yet another brand-new design. We'll cover in 604 in the next section, though.

This section will take a quick look at the architecture of the 603e, because it was the version of the 603 that saw the most widespread use.

The 603e's execution core

Like the 601, the 603e sports the classic RISC four-stage pipeline. But unlike the 601, which can decode and dispatch up to three instructions per cycle to its execution core, the 603e is capable of decoding and dispatching only two instructions per cycle. The 603e's dispatch logic takes a maximum of two instructions per cycle from the bottom of the instruction queue and passes them to the execution core, where they're executed by one of five execution units:

Integer Unit
Floating-point Unit
Branch Unit
Load-Store Unit
System Unit

You'll notice that the above list contains two more units than the analogous list for the 601: the load-store unit (LSU) and the system unit. The 603e's load-store unit takes over all of the address calculating labors that the older 601 foisted onto its lone integer ALU. Because the 603e has a dedicated LSU for performing address calculations and executing store-data operations, its integer unit is freed up from having to handle memory traffic and can therefore focus solely on integer arithmetic. This helps to improve the 603e's performance on integer code.

The 603e's dedicated system unit also takes over some of the functions of the 601's integer unit in that it handles updates to the PowerPC condition register. We'll talk more about the condition register in the article in this series that covers the 970, so don't worry if you don't know what it is. The 603e's system unit also contains limited integer adder, which can take some of the burden off of the integer ALU by doing certain types of addition. (Note that the original 603's system unit lacked this feature.)

The 603e's basic floating-point pipeline differs from that of the 601 in that it has one more execute stage and one less decode stage. So most floating-point instructions have a three-cycle latency (and a one-cycle throughput) on the 603e, vs. a two-cycle latency on the 601. This three-cycle latency/one-cycle throughput design wouldn't bad at all if it weren't for one serious problem: at its very fastest, the 603e's FPU can only execute three instructions every four cycles. In other words, after every third single-cycle floating point instruction, there's a mandatory pipeline bubble. I won't get into the reason for this, but the 603e's FPU takes a non-trivial hit to performance for this three instructions/four cycles design.

603e's floating-point unit isn't all bad news, though. It can do both single- and double-precision floating-point multiply-add (fmadd) operations very rapidly, with a four cycle latency and a one-cycle throughput. The fmadd is a core DSP (digital signal processing) calculation, so 603e's fast fmadd capabilities make it a good DSP chip. Also, it improves on the 601's double-precision floating-point capabilities somewhat.

The 603e's front end and branch prediction

Up to 2 instructions per cycle can be fetched into the 603e's six-entry instruction queue. From there, a maximum of two instructions per cycle (one fewer than the 601) can be dispatched to the reservation stations in the 603e's execution core.

Not only does the 603e dispatch one less instruction per cycle to its execution core than does the 601, but its overall approach to superscalar and out-of-order execution differs from that of the 601 in another way, as well. The 603e uses a dedicated completion unit, which contains a five-entry completion queue (analogous to the P6's reorder buffer) for keeping track of the program order of in-flight instructions. When instructions execute out of order, the completion unit refers to the information stored in the completion queue and puts the instructions back in program order before retiring them. The 603e's completion queue is able to retire at most two instructions per cycle in program order.

The added instruction-tracking hardware, in the form of the completion queue and reservation stations, is needed because in spite of its narrower dispatch, the 603e's higher number of execution units means that it can have more instructions in-flight than the 601. This higher number of in-flight instructions makes it infeasible to manage instruction ordering after the manner of the 601.

To use a term that figured prominently in my architectural history of the Pentium, the 603 is the first PPC processor to feature a full-blown instruction window, complete with a reorder buffer (ROB) and reservation stations. We'll talk more about the concept of the instruction window, and about the structures that make it up (the ROB and the reservation stations) in the next section on the 604. For now, it suffices to say that the 603's instruction window is quite small compared to that of its successors ? three of its four reservation stations are only single-entry, and one is double-entry (the one attached to the load-store unit).

The 603 and 603e followed the 601 in their ability to do speculative execution by means of a simple, static branch predictor. Like the static predictor on the 601, the 603e's predictor marks forward branches as not taken and backward branches as taken. This static branch predictor is simple and fast, but it's only mildly effective compared to even a weakly-designed dynamic branch predictor. If a PPC user wanted dynamic branch prediction, then they had to upgrade to the 604.

603e conclusions

With its stellar performance-per-watt ratio, the 603e was a great little embedded processor, and it would've made a decent lower-end to mid-range desktop processor as well if it weren't for Apple's legacy 68K code base. The 603e's tweaks and larger cache size helped with the legacy problems somewhat, but the chip still played second fiddle in Apple's product line to the larger, much more powerful 604.

The PowerPC 604

? PowerPC 604 vitals PowerPC 604e vitals

Introduction date May 1, 1995 July 19, 1996

Process 0.50 micron 0.35 micron

Transistor count 3.6 million 5.1 million

Die size 197mm² 148mm²

Clock speed at introduction 120MHz 180-200MHz

L1 cache size 32K split L1 64K split L1

First appeared in Power Mac 9500/120 PowerComputing PowerTower Pro 200 (Power Mac 9500/180 on August 7, 1996)

At the same time that the 603e was making its towards the market, the 604 was in the works as well. The 604 was to be Apple's high-end PPC desktop processor, so its power and transistor budget was much higher than that of the 603e. A quick glance at a diagram of the 604 will show some obvious ways that it differs from its lower-end sibling.

Figure 3: PowerPC 604 architecture

The 604's pipeline and execution core

The 604's pipeline is deeper than that of the 601 and the 603, and consists of the following six stages:

Fetch
Decode
Dispatch
Execute
Complete
Write-back

I've italicized the dispatch and complete stages because they're two that have been added to the classic RISC 4-stage pipeline that the PPC 601 and PPC 603 use. These two new stages are also the hallmark of out-of-order (OOO) execution and of the instruction window that makes OOO execution possible. I'll explain just how these two new stages work in the section on the instruction window below, but for now all you need to understand is that this lengthened pipeline enables the 604 to reach higher clock speeds than its predecessors. Because each pipeline stage is simpler it takes less time to complete, which means that the CPU's clock cycle time can be shortened. (For a more detailed discussion of pipelining, see Understanding the Microprocessor and my series on the P4 vs. the G4e).

Another factor that really sets the 604 apart from the other 600-series PPC designs its wider execution core. The 604 can execute up to six instructions per clock cycle in the following six execution units:

Branch unit
Load-store unit
Floating-point unit
Three integer units

Two simple integer units (SIUs)
One complex integer unit (CIU)

Unlike the other 600-series designs, the 604 has multiple integer units. This division of labor, where multiple fast integer units execute simple integer instructions and one slower integer unit executes complex integer instructions, will be familiar to anyone who read my architectural history of the Pentium line (or my articles on the PPC 970). The two simple integer units act like express checkout lanes in the supermarket. Any integer instruction that takes only a single cycle to execute can pass through one of the two SIUs. On the other hand, integer instructions that take multiple cycles to execute, like integer divides, must pass through the slower CIU.

Like the 603e, the 604 has register renaming, a technique that is facilitated by the 12-entry register rename file attached to the 32-entry general-purpose register file. These rename buffers allow the 604's execution units more options for avoiding false dependencies and register-related stalls.

The 604's floating-point unit does most single- and double-precision operations with a three-cycle latency, just like with the 603e. Unlike the 603e, though, the 604's floating-point unit is truly pipelined for almost all instructions, even double-precision multiplies. Finally, the 604's 32-entry floating-point register file is attached to a 12-entry floating-point rename register buffer.

The 604's load-store unit (LSU) is also similar to that of the 603e. Like the 603e's LSU it contains an adder for doing address calculations and it handles all load-store traffic, but unlike the 603e it is connected to deeper load and store queues and allows a little more flexibility for the optimal reordering of memory operations.

You'll notice that the above list of execution units is missing a unit that was present on the 603e: the system unit. The 603e's system unit handled updates to the PPC condition register, a function that was handled by the integer execution unit on the older 601. The 604 moves the responsibility of dealing with the condition register onto the branch unit. So the 604's branch unit handles all condition register logical operations, in addition to its normal task of executing branch instructions. What does this do for performance? It probably doesn't have a huge impact, but whatever impact it does have is significant enough to where the 604's immediate successor, the 604e, added a dedicated functional unit for CR logical operations.

The 604's front-end

The 604's branch unit also features a dynamic branch prediction scheme that's a vast improvement over the 603e's static branch predictor. The 604 has a large, 512-entry branch history table (BHT) with two bits per entry for tracking branches, coupled with a 64-entry branch target address cache (BTAC, which is the equivalent of the Pentium's BTB).

As always, the more transistors you spend on branch prediction, the better performance will be, so the 604's branch unit helps it quite a bit. Still, in the case of a mispredict, the 604 longer pipeline will have to pay a higher price than its shorter-pipelined predecessors in terms of performance. Of course, the bigger performance loss associated with a mispredict is also why the 604 spends those extra resources on branch prediction.

The rest of the 604's front end looks like a combination of the best features of the 601 and the 603e. Like the 601, the 604's instruction queue is eight entries deep. Instructions are fetched from the L1 cache into the queue where they're decoded before being dispatched to the execution core. The 604's dispatch logic can dispatch up to four instructions per cycle (up from two on the 603e and three on the 601) from the bottom four entries of the instruction queue to the execution core. Though Motorola's 604 whitepaper doesn't explicitly say, I assume that the instructions are dispatched in-order.

If you looked closely at the diagram of the 604, you probably noticed reservation stations attached to the tops of the execution units. Dispatched instructions go from the IQ into these reservation stations, where they wait for their operands to become available. Once all of the operands for an instruction are available, it's eligible to issue to the attached execution unit.

The 604's five reservation stations are relatively small, two-entry affairs, but they make up the heart of the 604's instruction window because they allow instructions to issue out of program order. Their small size as compared to similar structures on the P6 and P4 are due to the fact that the 604's pipeline is relatively short. Pipeline stalls aren't quite as devastating for performance on a machine with six-stage pipeline as they are on a machine with a twelve- or twenty-stage pipeline, so the 604 doesn't need as large of an instruction window as its superpipelined counterparts.

As with the P6 architecture, the reservation stations aren't the only structures that make up the 604's instruction window. The 604 has a 16-entry reorder buffer (ROB) which performs the same function as the P6 core's much larger 40-entry ROB. In my first article on the Pentium, I described the P6's ROB as follows:

The ROB is like a large log book in which the P6 can record all the essential information about each instruction that enters the execution core. The primary function of the ROB is to ensure that instructions come out one end of the out-of-order execution core in the same order in which they entered it. Iin other words, it's the reservation station's job to see that instructions are executed in the most optimal order, even if that means executing them out of program order, and it's the reorder buffer's job to ensure that the finished instructions get put back in program order and that their results are written to the architectural register file in the proper sequence. To this end, the ROB stores data about each instruction's status, operands, register needs, original place in the program, etc.

This description fits the 604's ROB and reservation stations, as well.

The ROB, which corresponds to the simpler completion queue on older PPC processors, is the reason for the two extra pipeline stages I described at the beginning of our discussion of the 604 (the dispatch and completion stages). In the first of these two stages, the dispatch stage, not only are instructions sent to execution core, but entries for the dispatched instructions are allocated an entry in the ROB and a set of rename registers. In the second of these two stages, the completion stage, the instructions are put back in program order so that their results can be written back to the register file in the subsequent write-back stage.

If you want to know more about instruction windows, ROBs, and reservation stations in general, then this page of the first Pentium article is a good place to start.

At any rate, the 604's ROB is much smaller than the P6's ROB for the same reason that the 604's reservation stations are smaller: the 604 has a much more shallow pipeline, which means that it needs a much smaller instruction window for tracking fewer in-flight instructions in order to achieve the same performance.

The tradeoff for this lack of complexity and lower pipeline depth is a lower clock speed. The six-stage 604 debuted in May of 1995 at 120MHz, while the twelve-stage Pentium Pro debuted later that year (November, 1995) at speeds ranging from 150 to 200MHz. The Pentium Pro had a clear clock speed advantage over the 604, but it wasn't what you might expect given the fact that its pipeline was twice as long.

604 conclusions

With a 32K split L1 cache, 604 had a much heftier cache than its predecessors, which it needed to help keep its longer pipeline fed. The larger cache, higher dispatch and issue rate, wider execution core, and deeper pipeline made for a solid RISC performer that was easily able to keep pace with its x86 competitors.

Still, the PPro was no slouch and its performance was scaling well with process shrinks and die improvements. Apple needed more power from AIM to keep the pace, and more power is what they got with a minor core revision that came to be called the 604e.

The PowerPC 604e

The 604e built on gains made by the 604 with a few core changes that included a doubling of the L1 cache sizes (32K instruction/32K data) and the addition of a new functional unit: the condition register unit (CRU).

The previous 600-series processors have moved the responsibility for handling condition register logical operations back and forth among various units (the integer unit in the 601, the system unit in the 603/603e, and the branch unit in the 604), and now with the 604e they get a unit of their own. The 604e sports a functional block in its execution core that's dedicated to handling condition register logical operations, which means that these not uncommon operations don't tie up execution units ? like the integer unit or the branch unit ? that have other, more serious work to do.

The 604e's branch unit, now that it was free from having to handle CR logical operations, got a few expanded capabilities that I won't detail here. The 604e's caches, in addition to being enlarged, also go additional copy-back buffers and some other enhancements.

The 604e was ultimately able to scale up to 350MHz once it moved from an 0.35 to an 0.25 micron process.

Conclusion to Part I

The 600-series saw the PPC line go from the new kid on the block to a mature RISC alternative that brought Apple's PowerMac workstation to the forefront of personal computing performance. While the initial 601 had a few teething problems, the line was in great shape after the 603e and 604e made it to market. The 603e was a great mobile chip that worked well in Apple's laptops, and even though it had a more limited instruction dispatch/retire bandwidth and a smaller cache than the 601 it still managed to beat its predecessor because of its more efficient use of transistors.

The 604 doubled the 603's instruction dispatch and retire bandwidth, and it sported a wider execution core and a larger instruction window that enabled its execution core to grind through more instructions per clock. Furthermore, its pipelined was deepened in order to increase the number of instructions per clock and to allow for better clock speed scaling. The end result was that the 604 was a strong enough desktop chip to keep the PowerMac comfortably in the performance game.

It's important to remember, though, that the 600-series reigned at time when transistor budgets were still relatively small by today's standards, so the PowerPC architecture's RISC nature gave it a definite cost, performance, and power consumption edge over the x86 competition. This is not to say that the 600-series was always in the performance lead; it wasn't. The performance crown changed hands a number of time during this period.

My point is that a RISC ISA was a strong mark in the platform's favor. If you read my history of the Pentium, though, then you know where this story is headed. As Moore's Curves drove transistor counts and MHz numbers ever higher, the relative cost of legacy x86 support began to go down and the PowerPC ISA's RISC advantage started to wane.

In conclusion, the 600-series was a success, and things were looking good for Apple as IBM's next major desktop processor, the 750 a.k.a., "G3" began to make its way toward the market. The next article will cover the G3 and G4, and will take us into that infamous period when the G4's clock speed stagnated for long enough to drive even the most diehard Mac users into despair. In this respect, it's sort of the The Empire Strikes Back of what will probably be a trilogy of PPC articles, so stay tuned.

Bibliography and Suggested Reading

Author's Note: I'd like to thank Eric "iPalindrome" Bangeman and the rest of the Macintosh Achaia folks who pitched in with suggestions, information, and feedback on the draft of this article.

PowerPC Compiler Writer's Guide
PowerPC 601 RISC Microprocessor User's Manual, IBM and Motorola
PowerPC 601 Microprocessor Whitepaper , IBM
PowerPC 601 RISC Microprocessor Technical Summary, IBM and Motorola
PowerPC 603e RISC Microprocessor User's Manual, IBM and Motorola
PowerPC 603 Microprocessor Whitepaper , IBM
Sonya Gary, Carl Dietz, Jim Eno, Gianfranco Gerosa, Sung Park, Hector Sanchez, The PowerPC 603 Microprocessor: A Low-Power Design for Portable Applications, Motorola
PowerPC 604 RISC Microprocessor User's Manual, IBM and Motorola
EveryMac.com
David K. Every, G1 - the First Generation , Mackido.com
David K. Every, G2- the Sequel , Mackido.com
Dan Knight, Future of the PowerPC: Part 1: PowerPC Past and Present , Low End Mac
Gareth Knight, What versions of the PowerPC are available? , Amiga History Guide
Paul DeMone, Apple's Power Failure , Realworldtech.com

Revision History

Date Version Changes

8/04/2004 1.0 Release

Jon Stokes

0 Comments