N-Body: FP Atomics v. Recomputation

CUDA developers who are recent refugees from the land of CPU programming often have to learn the hard way that sensible CPU optimizations don’t always work well on GPUs. Lookup tables, for example, are to be avoided. (Okay, not the best example since it is also true on modern CPUs.) Here’s a better one: registers are so precious that it’s often better to recompute results than to store them. GPUs have brought brute force back into style!

But even experienced GPU coders need the occasional reminder that their sheer computational power often can overwhelm the benefits of small-scale optimizations. This week, as I continued developing the N-body chapter, I wanted to try out the idea of taking advantage of the fact that gravitational forces are reciprocal in character: for any two bodies i and jfji = − fij. By applying this optimization, only half as many of the expensive inner-loop body-body computations (estimated at 20 FLOPS) need to be performed.

The original N-body paper dismisses the idea in a footnote: “…this optimization has an adverse effect on parallel evaluation strategies (especially with small N), so it is not employed in our implementation.”

At first blush, subtracting the 3D force from the correct force vector (3 FLOPS) seems a lot cheaper than the computation needed for a body-body interaction: taking a 3D vector difference, computing a dot product and reciprocal square root, and doing a number of multiply-adds. But there is a problem: another thread is concurrently computing the acceleration sum from which the body-body gravitational force values must be subtracted. Some form of synchronization is needed.

To address that problem, the Kepler architecture gives us a tool that wasn’t available when the original N-body paper was written: fast atomics. (Fermi-class hardware added the ability to do atomic floating point add, but it was prohibitively slow until Kepler.) Multiple threads can using atomic add on the same memory locations and the hardware will enforce mutual exclusion to prevent race conditions from corrupting the result. Using atomic add, a few small modifications enable the inner loop to perform half as many body-body interactions and add the negative force to the correct sum:

 <        for ( int j = 0; j < N; j++ ) {
 >        for ( int j = 0; j < i; j++ ) {
             float4 body = ((float4 *) posMass)[j];
 
             T ax, ay, az;
             bodyBodyInteraction( ax, ay, az, myX, myY, myZ, body.x, body.y, body.z, body.w, softeningSquared);
 
             acc[0] += ax; acc[1] += ay; acc[2] += az;
 
 >            float *f = &force[3*j+0];
 >            atomicAdd( f+0, -ax );
 >            atomicAdd( f+1, -ay );
 >            atomicAdd( f+2, -az );
 
         }

By the way, the address arithmetic is written that way because the compiler generates much better code than if you write the equivalent (and more readable):

             atomicAdd( &force[3*j+0], -ax );
             atomicAdd( &force[3*j+1], -ay );
             atomicAdd( &force[3*j+2], -az );

A subtle benefit of this approach is that the atomics hardware complements the floating point hardware in the Streaming Multiprocessors. Unfortunately though, I can report that the atomics-based implementation isn’t any faster on GK104: to process 8192 bodies (about 67M interactions), the three implementations (atomics, naive, shared) perform as follows:

Atomics: 90 ms
Naive: 62 ms
Shared: 55 ms

This negative result can be explained if you cast it in the light of GPUs’ predilection for brute force: given a choice between 3 FLOPS and 3 atomic adds to disparate memory locations, or 20 FLOPS done on uniformly-referenced, read-only memory, the brute force 20-FLOP solution still wins.

Lesson learned.

It is possible that the global memory references by this kernel are not ideal (especially if it’s camping on a subset of the hardware that implements atomics), and it might be faster to conserve external bandwidth, perhaps taking inspiration from a tiled matrix transpose implementation – but the optimized implementation would have to outperform a shared memory implementation that’s almost twice as fast.

As with all CUDA handbook source code, this kernel is open source and available on github.

Leave a Reply