12 stage fetch/exec pipeline... mmm...For those who don't understand the fuss about the # of pipeline stages, as a general rule :
fewer # of pipeline stages = higher average instructions per cycle (IPC)
The reason this happens is because of execution branches. Instruction execution happens near the end of a CPU's pipeline. When the CPU is executing instructions that are sequential, it can keep one instruction in every stage of the pipeline. Every clock cycle causes one or more instructions to execute, and everything in the pipeline gets shifted forward.
When a branch happens, though, the CPU has to discard everything in the pipeline and start anew. This means the CPU stalls for as many cycles as the pipeline is long - no instructions execute until the pipeline is full again.
The CPU *tries* to predict ahead of time where a program will go next to prevent these stalls. If you're lucky, the CPU guesses correctly, and a branch DOESN'T stall the CPU. But branch prediction can only do so much. In a normal environment EVERY CPU stalls thousands or even millions of times a second.
On the other hand, having more stages in the pipeline means that each stage has to do less work, which means you can crank the clock speed higher than you could otherwise.
I believe the Athlon has a 9-stage vs. 20 for the P4. This is one of the main reasons the P4 is a slug - even though it's execution engine runs at 2X clock speed, every mispredicted branch causes a minimum 20-cycle stall, vs. only 9 cycles on the Athlon (and 12 on the Hammer).
So it looks like the Hammer will almost certainly be about the same speed as the Athlon - probably a tad slower at integer ops, and a tad faster at FP. It will definatley be a hell of a lot faster than the P4, clock for clock.