N-Body: Multithreaded SSE

The multithreaded variant of SSE N-body is complete, and I’ve had the opportunity gather some timing information.

Three variants of N-body were timed: single-threaded SSE, multithreaded SSE, and the shared memory (fastest so far) GPU formulation.

Three platforms were tested, two of them on Amazon EC2:

  • cg1.4xlarge (2xXeon 5570 “Nehalem” with 2xTesla M2050 GPUs)
  • cc2.8xlarge (2xXeon 2670 “Sandy Bridge”)
  • GeForce GTX 680.

If I thought operating system mattered, I wouldn’t put these timings next to one another – the EC2 instances are running Amazon Linux but my GeForce GTX 680 is running Windows 7. With that caveat, the timings for N=4096, N=8192, N=16384 are as follows:

N cg1 (SSE) cc2 (SSE) cg1 (MT) cc2 (MT) cg1 (GPU) GK104
4096 36.24 40.31 4.78 2.79 1.35 0.66
8192 144.50 161.32 20.13 9.45 3.83 1.60
16384 576.75 649.83 72.78 39.46 12.66 6.05

Times are in milliseconds. The only difference between the SSE and MT timings is that the MT timing are done with multiple threads.

The first thing that struck me about the timing is that on a single-threaded SSE workload, cc2.8xlarge is slower than cg1.4xlarge! This result is surprising since Intel generally takes performance compatibility very seriously – they have a business incentive to deliver higher performance on old workloads, since it removes the need for customers to update their software to benefit from newer hardware – but can be explained if Intel delivered the same single-threaded performance clock-for-clock: the clock rate of the CPUs in cc2.8xlarge is 10% slower.

So the bulk of the benefit of cc2.8xlarge over cg1.4xlarge (or its GPU-less equivalent, cc1.4xlarge) is from an increased core count*. And on that front, Sandy Bridge delivers: scaling is excellent in the multithreaded case, with speedups of 14.4-17x on cc2.8xlarge versus 7.2-7.9x on cg1.4xlarge. We could double the core count again and probably still see excellent scaling.

No surprise that Intel expects us to port to AVX to get the full benefit of Sandy Bridge. But even if that delivers the expected doubling of performance, the multithreaded version would only be within 3.26x of the 16K bodies case. NVIDIA has their own doubling of performance in the offing, in the form of GK110.

And, I haven’t pushed the multi-GPU version yet. That will scale nicely in the number of GPUs.

* The number of threads selected is based on the number of cores available – in Linux, it is implemented in chThread.h as follows:

inline unsigned int
processorCount()
{
    return sysconf( _SC_NPROCESSORS_ONLN );
}

The values are 16 and 8 for cc2.8xlarge and cg1.4xlarge, respectively.

Leave a Reply