Take this article with a grain of salt.
Mistakes I've found so far:
- Citing premature P4 vs. Thunderbird benchmarks.
Tom's initial benchmarks showed that the Thunderbird beat the pants off the P4. Later benchmarks showed that that was just because nobody has a decent compiler for the P4 yet. Revised benchmarks with optimized code on both fronts puts the P4 marginally ahead (though at a much higher cost).
The critical importance of shipping compilers on time and distributing them far and wide is a big concern, but was completely ignored in the article.
- Citing the P3 as being much slower than the Athlon.
Simply not true. Again, this was in part a compiler issue. Code compiled for the P3 *with* SSE enabled vs. code compiled for the Athlon *with* 3Dnow enabled gives you more or less a tie. Both sides had solid chips.
- Claiming that the PIII was a PII with a marketting gimmick added.
Um, no. SSE was what MMX was *supposed* to be - a SIMD extension to the instruction set that was actually *useful*. Tom, Sharky, et. al. were getting about a 25% speedup in games vs. the PII, among other things. No other architectural changes? Well, if the new chip _works_, and performs better (due to SSE), where's the problem?
- Sketchy understanding of RISC vs. CISC tradeoffs.
The big advantage to RISC isn't that it's easier to increase the clock speed - it's that it's *MUCH* easier to pipeline the chip and to introduce superscaling. This is why virtually all chips - including the x86 chips - have RISC-like cores, regardless of the external instruction set.
- Sketchy understanding of instruction load buffers.
Grouping of instructions in later processors wasn't as important as the author thinks it is. In practice, you fetch many more than just the next two or three instructions. You keep fetching instructions 2 or 3 at a time no matter how many you execute, until you fill up the scheduling window, which is typically 16 or more instructions high. Instructions are selected and executed from this window arbitrarily (as long as dependencies can be satisfied).
Yes, brain-deadness in the decoders in the later Intel chips make this a problem if you've been running the chip flat-out for several clocks, but I've seen nothing to indicate that he knows about the scheduling window at all.
- Simplistic assumptions re. L1 cache.
The author takes the "larger is always better" tack, but doesn't seem to realize that larger is usually also slower. Choice of cache associativity is also a difficult decision, involving substantial tradeoffs between hit latency and miss probability. This issue is completely ignored in the cache discussion, even though Intel and AMD have made very different decisions regarding it.
Most of the historical timeline in the article, OTOH, looks pretty accurate. There are also a few technical details about the later Intel chips that I hadn't seen before.
The architectural flaws the author notes int he P4 are, by and large, valid; however, he fails to offer conclusive proof that the chip's performance is really that abysmal (instead, he cites known-bad benchmarks).