All posts by admin

N-Body: CUDA Versus Old-School SSE

October 12, 2012Uncategorizedadmin

This post will discuss SIMD instruction sets, peak floating point performance for CPUs, programming models, and ease-of-use, as applied to the flagship N-body application for CUDA.

My involvement in SIMD instruction sets dates back to the mid-1990s, when x86 vendors were adding MMX (all vendors), SSE (Intel) and 3DNow! (AMD, Cyrix et al.) to the x86 architecture. These SIMD instruction sets reflect a trend that had been developing in CPUs for some time: that because most of the die area in a CPU is dedicated to cache, building more-capable execution units incurs design and validation costs, but does not greatly increase manufacturing costs.

I was the development lead for Direct3D 6.0, and one of our goals was not only to optimize the geometry pipeline, but to incorporate SSE and 3DNow! optimizations from Intel and AMD, respectively. By closely collaborating with the CPU vendors, we gave developers a strong incentive to use Direct3D’s built-in geometry pipeline, and the CPU vendors got an easy way to deliver the benefits of their new instructions to end users.

SSE (Streaming SIMD Extensions) and 3DNow! are SIMD (single instruction, multiple data) instruction sets that can perform multiple floating point operations in a single instruction. SSE added a new set of 128-bit registers, plus instructions that could perform 4×32-bit floating point operations per instruction. (I am not going to further describe 3DNow! because all x86 vendors implemented SSE many years ago.)

Programming languages do not contain native language support for this type of packed data, so when they designed SSE, the x86 vendors were relying on compiler developers to build vectorizing compilers that could automatically identify parallelism opportunities in scalar code and compile it into high-quality, SIMD-optimized code. As a stopgap, they worked with compiler vendors to meet developers halfway and identify special data types and “intrinsics” (special functions that correspond to machine instructions) to operate on them. Developers using intrinsics must be intimately familiar with the instruction set, but the tasks of register allocation or instruction scheduling are offloaded onto the compiler. I don’t think intrinsics were ever intended to be anything other than a stopgap, but unfortunately, almost 15 years later, it looks like they are still the easiest way to write high-performance floating point code.

When I decided to focus on gravitational simulation for my N-body chapter, I also decided to see how fast a modern CPU can perform this computation. Many CUDA-versus-CPU performance comparisons are heavily biased toward CUDA, by dint of the CPU code being poorly optimized.

NVIDIA’s CPU implementation of N-body, used as a reference for correctness, is no exception. Unless a smart vectorizing compiler can translate the C code, the CPU reference implementation makes no use of the SSE or AVX instruction sets. That’s about a 4x (for SSE) or 8x (for AVX) performance disadvantage.

The body-body interaction code (which NVIDIA estimates to be 20 FLOPS) is as follows:

template <typename T>
__host__ __device__ void bodyBodyInteraction(
    T accel[3],
    T x0, T y0, T z0, T mass0,
    T x1, T y1, T z1, T mass1,
    T softeningSquared)
{
    T dx = x1 - x0;
    T dy = y1 - y0;
    T dz = z1 - z0;
    T distSqr = dx*dx + dy*dy + dz*dz;
    distSqr += softeningSquared;
    T invDist = (T)1.0 / (T)sqrt((double)distSqr);
    T invDistCube =  invDist * invDist * invDist;
    T s = mass1 * invDistCube;
    accel[0] += dx * s;
    accel[1] += dy * s;
    accel[2] += dz * s;
}

This function (templated to provide for both float and double precision) is invoked between every pair of bodies. For each body, the gravitational influence of every other body is added up, then these forces are integrated to change the bodies’ positions and velocities for the next timestep.

Now, if you think a bit about how to accelerate this function using a SIMD instruction set, you realize that what you really want to do is execute it four times in parallel: instead of

T dx = x1 - x0;

you want to write:

__m128 dx = _mm_sub_ps( x1, x0 );

where four bodies’ X positions are subtracted from your body’s X position, and so on. But the memory layout used by NVIDIA’s sample isn’t amenable to this formulation: it must be modified to use an SOA (structure of arrays) representation instead of the default AOS (array of structures) representation. In the case of N-body, where each body has a 3D position (x, y, z) and a mass, the N bodies can be represented as:

typedef struct _body {
    float x, y, z;
    float mass;
} body;

An array of structures (AOS):

struct _body bodies[N];

then contains the input data. The structure of arrays (SOA) representation splits each member of the structure into its own array:

float *bodiesX;
float *bodiesY;
float *bodiesZ;
float *bodiesMass;

It’s easy enough to transform between the two representations such that these identities hold true:

bodies[i].x == bodiesX[i]
bodies[i].y == bodiesY[i]
bodies[i].z == bodiesZ[i]
bodies[i].mass == bodiesMass[i]

Since programmers think of the X/Y/Z/Mass tuples together, the AOS representation seems more intuitive. But for SSE and AVX, the advantage of the SOA representation quickly becomes obvious; since instructions such as SUBPS (subtract packed single-precision float) operate on corresponding floats in the 128-bit registers, it works much better if the input data is streaming into the CPU in the form of packed X’s, Y’s and Z’s as opposed to structures that have to be rearranged in registers. Intel, in fact, was actively evangelizing SOA representations in the early days of SSE, since it enabled 4×4 matrix multiplication of vertices (needed for the geometry pipeline) without rearranging them. As an additional bonus, if the vertices were 3D, there was no need to deal with the inconvenient 12-byte vertex size. Truly general-purpose applications require extra work due to SSE’s alignment constraints (it is fastest to load and store on 16-byte boundaries), and its predilection for doing pretty much 4 of anything at a time necessitates careful handling of boundary conditions.

Once the data is rearranged as SOA, you can reformulate bodybodyInteraction() using SSE intrinsics to operate on four partial sums (gravitational force contributions to a given body) as follows:

inline void
bodyBodyInteraction(
    __m128& a0, __m128& a1, __m128& a2,
    const __m128& x0, const __m128& y0, const __m128& z0,
    const __m128& x1, const __m128& y1, const __m128& z1, const __m128& mass1,
    const __m128& softeningSquared)
{
    __m128 dx = _mm_sub_ps( x1, x0 );
    __m128 dy = _mm_sub_ps( y1, y0 );
    __m128 dz = _mm_sub_ps( z1, z0 );
    __m128 distSq = _mm_add_ps( _mm_add_ps( _mm_mul_ps( dx, dx ), _mm_mul_ps( dy, dy ) ), _mm_mul_ps( dz, dz ) );
    distSq = _mm_add_ps( distSq, softeningSquared );
    __m128 invDist = rcp_sqrt_nr_ps( distSq );
    __m128 invDistCube = _mm_mul_ps( invDist, _mm_mul_ps( invDist, invDist ) );
    __m128 s = _mm_mul_ps( mass1, invDistCube );
    a0 = _mm_add_ps( a0, _mm_mul_ps( dx, s ) );
    a1 = _mm_add_ps( a1, _mm_mul_ps( dy, s ) );
    a2 = _mm_add_ps( a2, _mm_mul_ps( dz, s ) );
}

This routine uses the <xmmintrin.h> header file (available both on Linux for gcc and Windows via MSVC), which declares the __m128 data type and the intrinsic functions that correspond to the SSE instruction set.

This routine uses two subroutines, both of which I was disappointed not to find in <xmmintrin.h> (in fact, <xmmintrin.h> seems not to have changed since 1998 or so when it was introduced): a subroutine to compute the sum of the four 32-bit floats in an SSE register, and a subroutine to compute the reciprocal square root. Because the RSQRTPS instruction only returns a 12-bit approximation, the rcp_sqrt_nr_ps() function, adapted from here, performs one Newton-Raphson iteration to achieve almost-full single precision on the four 32-bit floats in the SSE register. NOTE TO CPU ARCHITECTS: the SFU (“special function unit) in CUDA hardware implements a much higher-precision (22-23 bits) approximation to reciprocal square root that often can be used directly. The instruction sequence generated by rcp_sqrt_nr_ps() is 6 FLOPS but as many as 10-12 instructions (depending on how the constants are handled).

static inline __m128
rcp_sqrt_nr_ps(const __m128 x)
{
    const __m128
        nr      = _mm_rsqrt_ps(x),
        muls    = _mm_mul_ps(_mm_mul_ps(nr, nr), x),
        beta    = _mm_mul_ps(_mm_set_ps1(0.5f), nr),
        gamma   = _mm_sub_ps(_mm_set_ps1(3.0f), muls);
    return _mm_mul_ps(beta, gamma);
}

The other subroutine, adapted from a stackoverflow answer, computes the “horizontal” sum of the four 32-bit floats in an SSE register:

static inline __m128
horizontal_sum_ps( const __m128 x )
{
    const __m128 t = _mm_add_ps(x, _mm_movehl_ps(x, x));
    return _mm_add_ss(t, _mm_shuffle_ps(t, t, 1));
}

I don’t know about you guys, but I don’t think this intrinsics-based SSE code comes close to approaching the readability of the CUDA code, which “looks” scalar and is templatized to support double precision.

It does run fast, however; the resulting N-Body implementation, which is still single-threaded, takes 54 milliseconds to process 4096 bodies (4096²=16777216 interactions) as opposed to 583 milliseconds for the naïve (presumably x87) implementation. I haven’t yet analyzed why it is 10x faster instead of 4x faster; I suspect the SSE reciprocal square root sequence is quite a bit faster than the scalar equivalent. Note, however, that recompiling the application with SSE2 optimizations enabled does not improve performance noticeably (not with Visual Studio, anyway).

Further performance opportunities exist in multithreading (it should scale almost-linearly in the number of cores) and porting to AVX, which should almost-double performance.

The not-yet-optimized CUDA implementation takes about 4.5 ms to do the same computation on a GK104: not 100x faster, but 10x faster and the code is much more readable. A side-by-side comparison of the two bodyBodyInteraction() routines illustrates the point made by Dr. Vincent Natoli, in his Kudos for CUDA article in hpcwire:

In a recent project we reduced 3,500 lines of highly-optimized C code to a CUDA kernel of about 800 lines. The optimized C was peppered with inline assembly, SSE macros, unrolled loops and special cases, making it difficult to read, extract algorithmic meaning and extend in the future. By comparison the CUDA code was cleaner and more readable. Ultimately it will be easier to maintain.

Which would you rather write?

T distSqr = dx*dx + dy*dy + dz*dz;

__m128 distSq = _mm_add_ps( _mm_add_ps( _mm_mul_ps( dx, dx ), _mm_mul_ps( dy, dy ) ), _mm_mul_ps( dz, dz ) );

The story’s still out on exactly how much of a performance advantage CUDA will have over a truly optimized CPU implementation of this application, but until vectorized compilers go mainstream, for this type of code, CUDA will have a big advantage in programmability.

Stream Callbacks

September 18, 2012Uncategorizedadmin

CUDA 5.0 adds a new feature called “stream callbacks,” a new mechanism for CPU/GPU synchronization. Previously, CPU/GPU synchronization was accomplished by calling functions like cuStreamSynchronize(), which returns when all preceding commands in the stream have been completed, or cuEventSynchronize(), which waits until the specified event has been recorded by the GPU.

I spent some time investigating how stream callbacks are implemented on Windows 7. cudaStreamAddCallback() specifies a callback that will be called once, when all preceding operations in the stream have completed. CUDA makes no representation about when this callback will be performed, or the order in which the callbacks will be called, except that callbacks in a given stream will be called in the same order they were specified to that stream.

It is hard to imagine how NVIDIA implemented stream callbacks without creating at least one extra CPU thread, something that we never did while I was working on CUDA. We always knew that we could do useful things by creating CPU threads without the developer’s knowledge or involvement, like asynchronous memcpy of pageable memory. But, we were reluctant to create threads without an opt-in or some other sign that the developer wanted extra threads running around in their process. Also, making the driver fully thread-safe was a very difficult problem, not solved until CUDA 4.0.

So I wrote an application to look at the number of active threads in the system (and note, “the number of active threads” is always subject to change in a multithreaded operating system – in Windows, you take a “snapshot” of the process, presumably implemented using copy-on-write semantics like fork() in old-time UNIX, and count the active threads in the snapshot) using code from this MSDN page.

I learned some interesting things. First of all, the CUDA runtime creates 2 threads at initialization time – my app reports 3 active threads after cudaFree(0) has been called. (I had to add a one-second sleep to my app to detect this, since snapshotting and counting the active threads immediately after calling cudaFree() still yielded a thread count of 1 – the application thread). Using the CUDA driver API, I confirmed that the CUDA driver does not create any extra threads at initialization time (cuInit()) but does create them along with the context (cuCtxCreate()).

Anyway, it seems the first time cudaStreamAddCallback() is called, yet another CPU thread is created for purposes of calling into stream callbacks. I instrumented my app to detect and report a change in the ID of the calling thread, and confirmed that only one thread is ever used for this purpose. (But it would seem unwise to rely on that assumption!)

It’s important not to make any assumptions about the threading model for stream callbacks. Make sure your code is thread-safe. The verbiage in the CUDA C Programming Guide doesn’t exactly lend confidence, either:

[C]allbacks must not use any CUDA APIs. They can, however, make blocking calls but should not create a transitive dependency on CUDA APIs that are not guaranteed to be ready before the callback. For example, waiting for a thread that is waiting on cudaStreamSynchronize() is as bad as calling it directly. Callbacks carry the same scheduling restrictions as other commands issued on streams.

They, in a given stream, execute in the order in which they are added. However, false dependencies between commands issued to different streams (refer to Streams) can introduce false dependencies between callbacks from different streams and between kernels and callbacks in different streams. Further, as the execution order of callbacks is not guaranteed, it is not safe to have synchronization in user code between callbacks.

“Blocking” callbacks are ones that suspend execution of the CUDA stream until they’ve returned.

So what are stream callbacks good for? Frankly, I am not yet sure. I don’t think they enable anything that wasn’t possible before – they seem to be implemented in terms of previously-available CUDA abstractions (a combination of blocking events and having the CUDA driver’s CPU thread alternate between waiting for a CUDA event and calling into the application’s callback). Maybe they’re intended to enable something on applications that use nested parallelism.

In the meantime, if stream callbacks provide a more convenient mechanism to implement your application, no one will fault you for using them.

Streaming Multiprocessors – updated!

August 30, 2012Uncategorizedadmin

A new version of the Streaming Multiprocessors chapter has been uploaded, this one with merged coverage of the math library for float and double, plus improved coverage of shared memory (especially shared memory atomics) and conditional code.

The code emitted by the compiler when performing shared memory atomics turns out to be the perfect illustration of how CUDA hardware handles conditional code. For the SM 2.0 architecture, a shared atomic add compiles to the following microcode (excerpted from Listing 8-2):

     /*0040*/     SSY 0x80;
     /*0048*/     BAR.RED.POPC RZ, RZ;
     /*0050*/     LD R0, [R0];
     /*0058*/     LDSLK P0, R2, [R3];
     /*0060*/     @P0 IADD R2, R2, R0;
     /*0068*/     @P0 STSUL [R3], R2;
     /*0070*/     @!P0 BRA 0x58;
     /*0078*/     NOP.S CC.T;

The SSY/NOP.S instructions bracket the divergence and convergence of the loop, which iterates until all threads in the warp have performed their atomic operation. Inside the loop, the LDSLK instruction attempts to lock the bank in shared memory. The actual atomic operation is predicated on whether the lock was acquired; conversely, the branch to reattempt the atomic operation is predicated so as to be taken if the lock was not acquired.

Enjoy the new version of the chapter, and stay tuned as we turn our attention to the other topics that the book will cover!