32-bits is Not Enough
For two very interesting years of my life, I worked for a small company in Silicon Valley called Pacific Data Images. The history of PDI goes back some twenty-odd years and includes some of the earliest commercially producted computer generated imagery, but their most recent claim-to-fame is the production of the movie Shrek. When I joined the company in early 1999, they were just beginning the transition from SGI manufactured, MIPS-based machines to commodity manufactured, x86-based PCs running Linux as their machine of choice for the cluster of machines colloquially known as their "render farm".
The render farm, or just the Farm, was simply the collection of machines contained in a locked, well air-conditioned room on the first floor of the PDI building. During the course of Shrek's production, the Farm grew from a few hundred machines to over a thousand. The newest computers were dual-CPU Pentium III's with a couple gigs of RAM, each. The motivation from moving away from SGI machines to Linux
PCs was clear. You could get equivalent CPU power for just a fraction of the price. However, these nice Linux boxes weren't any better than their Irix based equivalents in one respect: 32-bits of addressing was all you could get.
To understand why 32-bits just isn't enough in all circumstances, you've got to understand a little about how your operating systems and how applications use memory. While I'm explaining
this, remember that our goal is to produce the complex imagery you see in Shrek. First of all, you've got a lot of data. If you find yourself watching the movie, take a look at the trees. Notice the grass. The director's didn't really want you to focus your attention on such things the first time you saw the movie, but try to get an estimate of the polygon count for these things.
You're probably thinking that those leaves aren't modelled by hand. You'd be right; they're generated procedurally, so they're not stored exlicity on disk, but when you want to generate an image, they do make an impact. Remember, even if we stream the polygons through memory without explictly storing all the vertices at once, we've still got to anti-alias these things -- which means that we can't simply store just one RGB triple for each pixel -- and remember that a film image is about two-thousand pixels wide. You're starting to get some idea. Remember also that we're probably loading lots of very-high resolution texture maps for every object in the shot, and the characters alone are several million polygons.
What does this all add up to? A lot of RAM. We're talking, "We're not kidding Mr. Gates, we're having a hard time fittin' everything in two-gigs, let alone your laughable 640k" kind of memory usage.
At this point the astute reader will note that I said the machines only had two gigs of RAM, and a 32-bit architecture should give you four gigs of address space. Figuring that your performance would only be decent if your working set is less than twice as big as the amount of physical RAM on-hand, there shouldn't be any problems, right? Right?
When Linux actually runs a collection of tasks, it gives each process its own four gig address space. So far, so good. The problem is that Linux splits up the address space into a myriad of regions. First, the kernel reserves a portion of memory for its own use, leaving the rest for the process to use. In a default configuration, the 2.2 Linux kernel claimed the upper two gigs of address space for its own use, leaving the lower two gigs to user space. That's half the address space right there! Furthermore, the lowest point of the application's space is reserved as unmappable, because you want your NULL pointer dereferences to produce exceptions. The executable image is loaded a bit above the reserved NULL space at the bottom of memory. Above that, statically linked shared libraries are loaded. Above that, we've got the sbrk memory heap. Next is the region for mmaped pages of memory. Finally, at the top of user memory space and below the kernel's segment of address space, we've got the main stack of the process.
With all that sectioning off of memory, you'd be lucky to allocate one gig of memory for your process to use. You can tune things somewhat, but the problem of limited memory allocation is exacerbated by the two strategies used by the default implemenation of malloc() in Linux's standard C library. Small memory allocations use the sbrk heap, which slowly grows just above the executable's image in memory. Large memory allocations call mmap() and are mapped in higher up in the address space. This dichotomy makes sense; allocating small amounts of memory with mmap would be very wasteful, and allocating large regions with sbrk() would prevent the process from giving memory back to the operating system for other applications after it is finished using it, but the downside is that you've got two memory regions with a relatively small maxiumum size, and attempting to increase the size of one decreases the size of the other. If you are running a variety of programs, some of which allocate thousands and thousands of small amounts of memory, and others which allocated a few very large chunks of memory, then you're in trouble, because you can't trade off one kind of memory for another.
What's the effect of all this? Well, your applications run out of memory space, malloc() starts returning NULL pointers and you probably experience a crash soon thereafter. There were certainly more than a few frames of Shrek which crashed because they were out of address space. Note how this is different from running out of memory.
Maybe you're able to discount all this as a problem only experienced in certain high-end niches, but I think that applications and the data they work on are getting bigger all across the industry, and it won't be long before this problem starts to show up when you are running Photoshop at home, doing video editing of the family trip to the Grand Tetons, or doing page-layout for that indie punk rock magazine you publish. You will see the problem, and you'll be pissed off because your applications are crashing just when you added a bit more data. You'll start to want more address space. You'll start to want a 64-bit architecture.
A Brief History of x86
In the beginning, there was the 8088. Intel said, "Let there be a one-chip CPU." And there was a one-chip CPU, and it was good.
Although I'm too young to remember it, I'm sure the 8088 was a nice CPU when it was introduced. You've got eight semi-general purpose 16-bit registers, AX, BX, CX, DX, SI, DI, BP and SP. I call them "semi-general purpose", because some instructions of the 8088 only operated on some registers, so BX is more useful for multiplication than other registers, CX is more useful for a loop counter, and SP is always the stack pointer, but mostly you can use those registers as you please.
As far as address space goes, you'd expect the 8088, with 16-bit registers, to only be able to address 64k of memory, but Intel used something slightly clever and slightly annoying: segmented memory addressing. This meant that addresses for the 8088 consisted of two numbers, a segment and an offset, and looked something like 0080:0CD1. Internally, the 8088 would multiply the first number by 16 and then add the second number to find the actual desired memory location. Instead of 64k of address space, you've got about one megabyte of addressable space. You've got to use some of the address space for the ROM BIOS and video memory and other hardware needs. If you started the hardware region of memory at, say, A000:0000 and left everything below that as usable by the operating system and user applications, you'd have 640k of RAM availble. And that's exactly what happened.
The 8088 was fine for operating systems like DOS, but to run something like a real Unix operating system, a few CPU features were missing. The most important of these was memory protection. Intel's answer to this problem? The 286. Memory protection allows the CPU to expose certain areas of memory to some programs while protecting other areas of memory at the same time. When the 286 ran in protected mode, instead of using a segment and offset, memory addressing would use a selector and an offset. The selector was a 16-bit number stored in the segment register, but to get the base address, the CPU no longer multiplied this number by 16, but instead used this number as an index into a table of memory protection entries maintained by the operating system. Furthermore, the CPU could set up permissions on the memory protection entries so that a process could access some entries, but not others, based on its permission level.
There are a couple of interesting implications about memory protection. First, notice that the base address stored in the memory protection table doesn't necessarily have to be the same size as the registers of the CPU, so a CPU could theoretically use more RAM than address space by divying up sections of RAM between processes through clever use of the memory protection table. Second, we are actually providing support for security at the CPU level. Without memory protection, malicious programs would be able to subvert Unix-style file system permissions by simply patching the OS kernel in memory to skip file permission checks. Memory protection prevents such attacks from succeeding.
Memory protection was nice, but the 286 still had the problem that it used 16-bit registers for offsets, meaning that you could only access 64k at a time without changing your memory selectors for every pointer you followed. Intel fixed this with the 386. The 386 widened all the existing 16-bit registers to 32-bit. Our old friend AX? Now we'll call him EAX if we want all 32 bits and AX if we want only the lower 16-bits. Similarly, we've got EBX, ECX, EDX, ESI, EDI, EBP and ESP. Wow, suddenly we've got four gigs of address space instead of 64k. Neat, isn't it? Of course, we've still got memory protection too.
Even though your PC from Dell boots up in the same 16-bit Real Mode used by the 8088, nearly any operating system you are likely to be running on your Pentium family CPU is using the 32-bit protected mode introduced with the 386. Linux uses it. FreeBSD uses it. Windows 2000 uses it. Windows XP uses it. BeOS uses it. AtheOS uses it. The fact is, as far as most assembly programmers are concerned, the Pentium III isn't anything other than a 386 tricked out for maxiumum speed. Intel has made a few additions with MMX and SSE, but the core instruction set has remained exactly the same.
And Now for Something Completely Different
Perhaps you've heard about Intel's Itanium CPU. The design of the Itanium CPU is a joint project between Intel and HP, and is their answer for the 64-bit era. The instruction set it uses is caled IA-64, but it really doesn't have much in common with the IA-16 instruction set of the 8088 and the IA-32 instruction set of the 386. In fact, its quite a significant departure from what has come before.
First of all, instead of the familiar eight registers of the x86 architecture, we've now got 128 general purpose registers. The first 32 of these registers are global registers, accessable in the same way throughout the lifetime of a process. The remaining 96 registers are managed through the ALLOC instruction, which will allow a function or method to reserve a particular number of these registers for its use, shifting an internel index such that the contents of some of the registers of the calling function or method are hidden from view and the ones which remain visible are used to pass function arguments.
If that wasn't a big enough change, the Itanium is also a VLIW (Very Long Instruction Word) CPU. This means that CPU instructions aren't the same atomic units they once were. Now CPU instructions come in bundles of three. Each instruction bundle also happens to be exactly 128-bits long, whereas old x86 instructions were of varying length and aligned willy-nilly throughout memory. Furthermore, instructions are organized into groups which can potentially be executed in parallel. Need to add some numbers and multiply some others? No problem, with IA-64 you can tell the CPU it can do both at once, without waiting for one or the other to complete. To further complicate things, the group boundary can actually be in the middle of an instruction bundle!
As you can see, code for Itanium CPUs is at least as different from plain-old x86 code as code for, say, PowerPC CPUs would be. This has both advantages and disadvantages. On the up-side, CPUs don't have to spend a lot of time decoding the archaic and baroque x86 instruction format anymore. On the downside, all the work spent on getting gcc to generate good x86 code will be useless for Itanium CPUs. Kaffe and Mono's x86 JITers? They will have to be rewritten. Optimized graphics libraries and drivers? They will need to be reoptimized for Itanium's architecture. VMware's instruction stream analyzer? It'll need a complete rewrite.
Now, the Itanium does actually have an x86 backward's compatibility mode on the CPU, but that really makes the Itanium like two CPUs in one. You'd flip a few bits in memory, and suddenly you are using completely different circuitry to decode and execute your instructions. Running x86 applications on an Itanium is analagous to running Windows applications under Linux with WINE -- maybe it works, but only because you've duplicated one completely different environment inside another.
Back to the Past
Obviously there's a lot of work involved in transitioning from the world of x86 to the Itanium world. Maybe there's another way? That's exactly what AMD is hoping. With the introduction of their Hammer CPU architecture, AMD is introducing the x86-64 instruction set.
Remember, where IA-64 is completely different from IA-32, x86-64 is more of the same. We've got the same old eight registers, and AMD has chosen to add eight more, for a total of sixteen. The instruction opcodes are the same familiar ones from the 8088. There aren't any bundles or groups here. Assembly programmers won't need to learn many new techniques to write streamlined code for x86-64, because all their old knowledge will be immediately applicable.
The clear advantage of x86-64 is that developers can get applications working on it much more quickly than with IA-64. No need to write a new back-end for your compiler; that old one will do, with a few 64-bit modifications. That graphics library you've got in the corner which emits x86 code based on the current state? A few modifications to it, and you'll be fine. Oh, you wanted streamlined performance and explicit parallelism in your instruction stream? Sorry, x86-64 is the same thing we've been dealing with for the past twenty-five years, warts and all.
And the winner is...
It's too early to tell who will win the battle of the 64-bit CPUs, but it is clear that the conflict will explode in the next few years. AMD has pragmatism on their side, but Intel has the theoretical performance advantage and the 500-pound gorilla advantage. If the situation were to be reversed, with Intel going the route of backwards compatibility and AMD coming up with the new-fangled instruction set, my money would be on Intel for certain, but as it is, it's very tough for me to make a call.
Will the performance advantage of VLIW make good, or will AMD be crowned the new CPU king?
Only time will tell.