This next graph, showing the difference between 64 and 32 bit performance, may be more surprising to people:
The gap is even larger for single precision. There are a couple of reasons for the 64 bit performance lead. The first won't go away: under 64-bit mode, you have access to double the number of integer and floating point registers. I would guess that somewhere between one eighth and one quarter of this gap is due to this. The rest of the gap is that I have not applied all the optimizations tricks to the 32 bit code that I did to the 64 bit. Since I am in a perpetual shortage of time, I simply didn't do much special tuning for the 32 bit mode. In fact, it is using a kernel I wrote for the Pentium 4.
I'm pretty happy with this decision, except for the usual gall: Windows. Under windows right now, all people have is the 32 bit mode, and so they get only the crappy performance. Still, Windows (and Windows/cygnus) should eventually get to 64 bit mode, and then you'll see the better curves again.