Pentium IV vs Athlon. Linrad CPU usage.
(Jan 5 2003)

Intel vs AMD

Two similar computers were tested by running the different FFT implementations of Linrad at different sizes. The Pentium IV is the faster computer, but the difference depends strongly on the FFT implementation. The Pentium IV computer was equiped as follows:
Motherboard: ASUS P4S533 SIS645, Socket 478, P4, DDR, ATX.
Processor: Intel P4 1.8GHz 400MHz 512kB Northwood socket 478.
Memory: 1024 MB DIMM PC2700 333 MHz.

The Athlon computer was equiped as follows:
Motherboard: Soltek SL75-DRV5, chipset VIA KT-266.
Processor: AMD Athlon XP2000+ 1667 MHz.
Memory: 256 MB DDR ram.

The computers were tested when processing two channels of 96 kHz bandwidth.

The numbers given in the tables below is the CPU load as given i percent by the speed test routines that follow the parameter selection screens 1 and 2.

Small FFT with floating point arithmetics

By selecting a bandwidth for the first FFT of 100Hz with a sin to power 3 window, the fft1 size is set to 4096. The total memory needed to keep the transforms is then 65536 bytes. Everything should be possible to keep within the cache for both computers.

The CPU load is as follows:


FFT version               PentiumiV      Athlon        Ratio
0 Radix 2 DIF C             7.32          8.67         1.18
1 Radix 2 DIF asm           7.21          8.53         1.18
2 Twin radix 2 DIF asm      6.49          8.62         1.33
3 Radix 4 DIT C             7.03          9.26         1.32
4 Twin radix 4 DIT C        6.41          8.29         1.29
5 Twin radix 4 DIT SIMD     4.35          7.28         1.67
The SIMD (Single Instruction Multiple Data) instructions make a significant improvement on the Pentium IV but much less so on the Athlon computer.

Medium size FFT with floating point arithmetics

By selecting a bandwidth for the first FFT of 25Hz with a sin squared window the fft1 size is set to 16384. The total memory needed to keep the transforms is then 262144 bytes. The Pentium IV has a big enough cache but the Athlon suffers from having only 256kB of cache. There are sine/cosine tables and a few other things for the processor to keep in the cache besides the fft data. It is interesting to note that the twin routines are faster than running single routines twice despite the fact that the single routines only have to keep 131042 bytes of transform data.

The CPU load is as follows:


FFT version               PentiumiV      Athlon        Ratio
0 Radix 2 DIF C             12.33         17.44        1.41
1 Radix 2 DIF asm           12.09         17.23        1.43
2 Twin radix 2 DIF asm      11.53         14.78        1.28
3 Radix 4 DIT C              9.95         19.70        1.98
4 Twin radix 4 DIT C         8.43         13.96        1.66
5 Twin radix 4 DIT SIMD      5.70         13.09        2.30
The SIMD (Single Instruction Multiple Data) instructions are efficient on the Pentium IV, they make the radix 4 decimation in time run 32% faster. On the Athlon the SIMD instructions only improve by 6%. For the medium size floating point FFTs the Pentium IV runs significantly faster. By a factor of 2.3.

Integer arithmetics using MMX instructions

The second FFT version 2, twin radix 4 DIT was run at some different sizes with sine squared windows on the two computers.

Here are the results:


FFT size      Memory    PentiumiV      Athlon        Ratio
 32768        262144      6.17          7.6          1.23
 65536        524288      7.43          8.6          1.16
262144       2097152      12.7         13.6          1.07
For very large transform sizes the cache size is not important. The Athlon is nearly as fast as the Pentium IV.