Multithreaded SSE Timings

The N-body code has been posted to github; it has not yet been ported to Linux, but that will not be difficult to do. It does not use graphics, it is just a Win32 console application that responds to keyboard input (ironically, other than porting the threading code, the keyboard support is the hard part about the Linux port of this app).

So far there are 8 formulations:

1) CPU_AOS (array-of-structures implementation – the gold standard to which we compare other implementations),
2) CPU_SOA (structure-of-arrays implementation) – a stepping stone to the SSE implementation (and surprisingly faster than the AOS version),
3) CPU_SSE (SSE implementation) – I blogged about this a few weeks ago,
4) CPU_SSE_threaded( multithreaded SSE implementation) – multithreaded SSE implementation that uses N threads on an N-core processor,
5) GPU_AOS – a straight port of the CPU_AOS code,
6) GPU_Shared –  GPU_AOS, optimized to use shared memory per the original CUDA N-body paper,
7) GPU_Atomic – a Kepler-specific version that uses atomics to do half as many body-body interaction computations, (but isn’t faster – I plan to blog about that later),
8) GPU_Shuffle – a Kepler-specific version that uses the new warp shuffle instruction instead of shared memory.

Here are some initial timing results for 16K bodies:

CPU_AOS – 4494 ms (60.0 Minteractions/s)
CPU_SOA – 3210 ms (83.6 Minteractions/s)
CPU_SSE – 612.6 ms (438 Minteractions/s)
CPU_SSE_threaded – 192 ms (1400 Minteractions/s)
GPU_AOS – 62 ms (4300 Minteractions/s)
GPU_Atomic – 93 ms (2886 Minteractions/s)
GPU_Shared – 55 ms (4881 Minteractions/s)
GPU_Shuffle – 58.7 ms (4573 Minteractions/s)

The fastest GPU performance is about 3.5 faster than the multithreaded SSE implementation. Another way to look at it: the SSE implementation will run on any CPU built since 2001 or so, and it’s only a factor of 3.5 slower. Then again, the code is super-gnarly compared to the CUDA code, especially the threading code. (Don’t take my word for it – take a look for yourself.) So here again, CUDA “wins” as much for programmability as for the performance win.

The interesting thing is that this data is gathered on a rather old CPU (a 2.8 GHz Intel i7 “Nehalem”), and the GPU is a relatively recent GK104. Once I get the code running on a contemporary CPU (say, a Sandy Bridge) I’ll report timings on that platform.

I did take a look at an AVX port, but that seemed even more off-topic than the SSE port, which was a stretch. That project will have to wait until the book’s done.

Next up: a Linux port and multi-GPU support.

Leave a Reply