I submitted the final manuscript last Friday, and thought I would reflect briefly on how it aligns with my original goals.
I’ve wanted write a book on CUDA for years. Until I left NVIDIA, I was just too busy building CUDA to work on it. So when it came time to write up a proposal, I’d been thinking about the subject matter and the organization for some time.
One of the exercises that authors undertake (and that editors demand as part of a proposal) is a competitive analysis: What books already exist that address the topic? How will yours be different? Why would someone buy your book instead of one of those other books? I knew I wanted my book to be comprehensive, covering topics like the driver API, all CUDA hardware, and all CUDA-capable platforms. I wanted it to cover both software abstractions (like streams and events) and how to write CUDA kernels. And as I weighed those aspirations and put together outlines and looked at the competitive landscape, it became clear to me: Writing a book on CUDA is hard.
When I said that to Ian Buck, he got an expression like a whipped puppy. “But why?” he asked. Hearing that anything about CUDA is hard upsets Ian because he believes the key to CUDA’s success has been simplicity. I told him, “Because in order to cover a topic correctly, I wind up explaining the same thing more than once, from different perspectives.” That is true, and is reflected in my book; but even books that cover only the CUDA runtime and the latest CUDA hardware can’t get away from the fundamental difficulty of CUDA programming: performance optimization is hard, and no one uses CUDA because it is cute and cuddly. People only ever use CUDA because it can run their applications faster. The reason CUDA programmers worry about performance bottlenecks and implementation details is because those considerations are weighed by developers engaged in low-level optimization, whether on GPUs or CPUs.
In looking at the landscape, I saw a lot of books that presented a lot of material on CUDA, but hadn’t imposed much of an organizational structure. So my book is organized into three parts. Part I gives overviews of the hardware, the software, and the operating environment; Part II (“Details”) gives in-depth descriptions of various CUDA abstractions, like memory, streams and events, and texturing; Part III (“Select Applications”) covers the full gamut of classes of CUDA application: streaming workloads (where PCI Express transfer overhead figures prominently), key parallel algorithms Reduction and Scan, an illustrative compute-intensive workload (N-body, the anti-streaming workload), and an image processing workload that combines texturing and shared memory for performance.
The source code accompanying Part II consists almost entirely of microdemos and microbenchmarks, while the source code accompanying Part III consists of several implementations of the same exact operation, with different tradeoffs in performance and complexity. Some of the microbenchmarks in Part II are intended to be reused (plug your own kernel into a makework kernel, for example), while reuse of code from Part III may be trickier. I did what I could, but reuse of application-specific code is intrinsically more difficult.
The code in the Streaming Workloads chapter can definitely intended be reused – just replace the SAXPY kernel with a different kernel, the more compute-intensive the better. Even calculations as complex as Black-Scholes options computation are transfer-bound on CUDA hardware*.
With Scan, one problem that hinders reuse is that NVIDIA has added numerous primitives that make Scan more efficient (like syncthreads_count() and warp shuffle). For a book that aspires to cover all CUDA hardware (all the way back to the seminal SM 1.x, “Tesla” hardware), that presents a challenge. For another, Scan has so many variants (inclusive/exclusive, segmented scan, predicate scan) that covering all permutations just of the basic primitive requires a lot of space, without covering any applications like Radix Sort or stream compaction. And if you try to cover all those permutations in the source code, you wind up with code that bears a disturbingly close resemblance to Thrust, except that Thrust was written by NVIDIA employees with a much more intimate understanding of modern C++ programming idioms than yours truly.
I will say this: The Scan chapter does a better job of covering the topic than any other book I have seen. I could have covered more, but then again, you could devote 100+ pages to the topic without covering everything.
For N-body, the fastest implementation just stages tiles into shared memory to make the data available to the SMs with lower latency; but after reviewing the literature on computationally-dense workloads, I saw an opportunity to broaden coverage beyond that. For one thing, some applications need shared memory for other purposes than a read-only, software-managed cache. The Direct Coulomb Summation code, described by Stone et al. and that is covered in detail in Kirk & Hwu’s Programming Massively Parallel Processors, stages read-only atom descriptions through constant memory, working around the 64K constant memory limit by doing the computation on 4,000 16-byte atoms at a time. So my book has an implementation that mirrors that strategy, even though it is slightly slower than the shared memory formulation.
A colleague pointed out that some N-body computations – in particular, the ones in the AMBER molecular modeling code – exploit symmetry of forces. So I spent some time exploring ways to take advantage of the symmetry of gravitational forces, without much success. One problem is that the gravitational computation is so lightweight that doing twice as many is faster than saving the work! AMBER does its calculations on 32×32 tiles (to correspond to CUDA’s warp size), and the source code on github includes an implementation that mirrors that strategy. It is sufficiently slower that I don’t even cover the source code in the book; as a compromise, I describe the strategy in the beginning of the chapter that gives an overview of the computation.
I’m cautiously optimistic that some method of exploiting symmetric forces will bear fruit, even for fairly lightweight calculations, but I wasn’t going to let that perfect be the enemy of the book’s done.
There are a couple of self-indulgent topics in the book: float->half conversion, normalized correlation, an SSE-optimized N-body implementation. But those were all included for good reasons. I still think the float->half conversion (which may seem out of place in the Streaming Multiprocessors chapter) is one of the best ways for developers to learn about floating point precision and rounding. Normalized correlation is the only application in the book that combines texturing and shared memory. And the SSE-optimized N-body implementation gives a stark illustration of how much easier CUDA is to program, while still yielding better performance. Even when biasing the results in favor of the CPU, N-body is still 8x faster on GK104 then on a dual-socket Xeon E2670 machine. (Porting to AVX might double performance on the CPU side, but by the same token, running on a GK110 might double performance on the GPU side, and using a faster GPU is a whole lot less software work than porting to AVX.) The reported result may not make NVIDIA happy (since I am reporting an 8x improvement, not 400x) and may not make Intel happy (since it underscores the difficulty of SIMD coding), but I think it puts CUDA in a positive light.
The book can be preordered on amazon.com.
* Black-Scholes was the workload that I used to prove out mapped pinned memory when we added it for the chipset team in CUDA 2.2, and we were pleasantly surprised to discover that GT200 also got a big benefit. (The architect who’d invested a lot of effort into making sure GT200 would be good at system memory rendering was gratified, but not surprised.)