The following errors have been discovered in the first printing of The CUDA Handbook.

If you find a mistake that is not listed here, please write an email so it can be added to this page and possibly corrected in future printings.

Page 34-35: Figures 2.24 and 2.25 (CPU-bound and GPU-bound push buffers) are reversed.

Page 102: Table 4.8 should read: “machine [32:64]” (credit to Daniel Galvez)

Page 175: “Without it, the GPU would still be processing the last kernel invocations when the end top is recorded…” – that should be “when the end time is recorded…”

Page 252: In Table 8.11, second row from the bottom (log2x), the basne of the logarithm (2) should be subscripted and x should not be subscripted.

Page 259: For the atan2() function, the expression is incorrect. There should not be an ‘x’ between the superscripted -1 and the parenthesized expression (y/x).

Pages 307-308: the next multiple of 64 above 950 is 960, not 964.

Page 310: The “width” parameter of cudaMallocArray is the number of elements, not bytes.

Page 366: Figure 12.1 – the “log-step reduction” expression should read: “(((a0+a4)+(a1+a5))+((a2+a6)+(a3+a7)))

Page 393: Figure 13.10 contains some extra fans. Numbering the columns 0 to 15 (left to right), and the rows 0 to 4 (top to bottom), the fans originating at locations (3, 1), (7, 1) and (11, 1) should not be present. Thanks to Peter Longhurst for pointing this out!

Page 417: Line 147 of streamCompact_odd.cuh declares an int instead of T (thanks to Louise Knight for pointing this out!):

<int value = (index < N) ? in[index] : 0;
>T value = (index < N) ? in[index] : 0;


One thought on “Errata

  1. Hi, I think there is something wrong is the function “chMemcpyHtoD” of as following:

    chMemcpyHtoD( void *device, const void *host, size_t N )
    cudaError_t status;
    char *dst = (char *) device;
    const char *src = (const char *) host;
    int stagingIndex = 0;
    while ( N ) {
    size_t thisCopySize = min( N, STAGING_BUFFER_SIZE );

    CUDART_CHECK( cudaEventSynchronize( g_events[stagingIndex] ) );
    memcpy( g_hostBuffers[stagingIndex], src, thisCopySize );
    CUDART_CHECK( cudaMemcpyAsync( dst, g_hostBuffers[stagingIndex], thisCopySize,
    cudaMemcpyHostToDevice, NULL ) );
    CUDART_CHECK( cudaEventRecord( g_events[1-stagingIndex], NULL ) );
    dst += thisCopySize;
    src += thisCopySize;
    N -= thisCopySize;
    stagingIndex = 1 – stagingIndex;

    It’s so weird why it’s “1-stagingIndex” in “CUDART_CHECK( cudaEventRecord( g_events[1-stagingIndex], NULL ) );”, but not “stagingIndex”, and this causes the copy going serially.
    After I changed the “1-stagingIndex” to “stagingIndex”, I still got right result, and got a double speed.

Leave a Reply