Monthly Archives: December 2024

cudaGetLastError(): Its Raison D’Etre

I am not a great fan of cudaGetLastError(), but after exhorting people never to call it, I feel obliged to mention that there is a circumstance where it is the only way to query whether an error occurred: after a kernel launch that may have failed due to a misconfiguration, such as launching the kernel in an invalid stream or requesting more shared memory than the hardware can support. In production code, almost all kernel launches are known to succeed, because the developer found and fixed all the bugs that led to misconfigured kernel launches in the first place. But for kernels whose configurations are computed at runtime, it is conceivable that the kernel launch itself may fail, and developers can check that error status by calling cudaGetLastError() immediately after the kernel launch.

What it will not do, though—and this is important—is tell you whether the kernel executed successfully. Kernel launches always have been asynchronous, and calling cudaGetLastError() does not synchronize with the GPU. So if your kernel does an invalid memory reference, the error status returned by cudaGetLastError() (and cudaDeviceSynchronize(), and all the other functions that may return error statuses due to previous error conditions) will be updated when the kernel encounters the error.

To illustrate the point, I created a variant of nullKernelAsync.cu called cudaGetLastErrorIsAsynchronous.cu. The original nullKernelAsync.cu is designed to measure the overhead of kernel launch by repeatedly launching a null kernel, measuring the wall clock time, and reporting the launch rate in microseconds per launch. The NULL kernel in question is designed to ensure that the compiler emits some code, but doesn’t actually do any work:

__global__
void
NullKernel( volatile int *p, bool write, int a=0, int b=1, int c=2, int d=3, int e=4, int f=5, int g=6 )
{
if ( write && 0==threadIdx.x && 0==blockIdx.x ) {
*p = a+b+c+d+e+f+g;
}
}

The original timing loop invokes this kernel with the second parameter always false, ensuring that it does not try to write to the NULL memory location:

chTimerGetTime( &start );
for ( int i = 0; i < cIterations; i++ ) {
    NullKernel<<<1,1>>>( NULL, false );
}
cuda(DeviceSynchronize());
chTimerGetTime( &stop );

Since the GPU can’t consume kernel launches, even ones that do nothing, as fast as the CPU can launch them, the end result is a measurement of the overhead of a kernel launch.

To show that cudaGetLastError() is asynchronous, I updated the sample to cause the NULL kernel to dereference a NULL pointer on the last invocation of the kernel, triggering the asynchronous error reporting mechanism described in an earlier blog, where cudaDeviceSynchronize() and other functions will belatedly report an error condition encountered by a running kernel.

The kernel is the same as before, but the timing loop calls it with the second parameter true, causing it to dereference NULL and put the CUDA context into an invalid state:

chTimerGetTime( &start );
for ( int i = 0; i < cIterations; i++ ) {
    NullKernel<<<1,1>>>( NULL, false );
}
cuda(EventRecord( ev ));
NullKernel<<<1,1>>>( NULL, true );
status = cudaEventQuery( ev );
std::cout << "cudaEventQuery returned " << status << std::endl;
status = cudaGetLastError();
std::cout << "cudaGetLastError returned " << status << " (before cudaDeviceSynchronize())" << std::endl;

// this returns error due to deliberate dereference of NULL on last kernel invocation
(void) cudaDeviceSynchronize();
status = cudaGetLastError();
std::cout << "cudaGetLastError returned " << status << " (after cudaDeviceSynchronize())" << std::endl;
cuda(EventDestroy(ev));
chTimerGetTime( &stop );

The event is used to confirm that the kernel launch hasn’t yet been processed. Expected output from the application is as follows:

Measuring asynchronous launch time... cudaEventQuery returned 600
cudaGetLastError returned 0 (before cudaDeviceSynchronize())
cudaGetLastError returned 700 (after cudaDeviceSynchronize())
0.00 us

What’s really fun is that if enough time elapses between the operation and the call to cudaGetLastError(), it may, indeed, return an error. Without explicit synchronization, it amounts to a race condition. For example, if you are running a debug build with compiler optimizations disabled, your code may run slowly enough for the error condition to be returned just because the unoptimized code runs slowly.

So if you need to know whether a kernel launch was correctly configured, go ahead and call cudaGetLastError() immediately after launching the kernel. Otherwise, I stand by the claim that cudaGetLastError() is best avoided; check your error codes, and make sure your kernel launches are always configured correctly.

What’s Next For Intel? No One Knows

This important point seems obvious in retrospect:

In addition to the CEO, the Board of Directors of enterprises whose core products are engineered, must have enough of an engineering background to make a clear-eyed assessment of the best path forward.

Usually, the importance of having technologists run technology companies finds expression in the form of CEO selection. Microsoft suffered a lost decade during Steve Ballmer’s tenure as CEO. The story may be apocryphal, but Microsoft lore held that Ballmer’s signature accomplishment at Procter & Gamble had been to design packaging that literally crowded competitors out of store shelves. In any case, he was a marketer, not a technologist, and Microsoft found its footing again after appointing Satya Nadella to replace him as CEO.  

Boeing lost its way under CEOs whose backgrounds favored finance over engineering, with Jim “Prince Jim” McNerney referring to longtime engineers and skilled machinists as “phenomenally talented assholes” and encouraging their ouster from the company. The disastrous performances of the 787 and 737 MAX are widely attributed to Boeing’s embrace of financial as opposed to aeronautical engineering.

For Intel’s part, the beginning of the end appears to date back to the CEOs appointed after Paul Otellini (May 2005-May 2013): Brian Krzanich (May 2013-June 2018) and especially ex-CFO Bob Swan (June 2018-January 2021) recorded tenures marred by acquisitions of dubious merit, blunders in product development, and a loss of Intel’s historic lead in semiconductor fabrication.

The commentary around Gelsinger’s ouster quickly coalesced into two camps, as summarized by Dr. Ian Cutress:

John Carmack made his presence known in the (1) camp with this tweet:

When Intel first announced that Pat Gelsinger would return as CEO, I was surprised to hear that he even wanted the role. Intel had been foundering for years, and Gelsinger was in a position to know just how deep a hole they’d dug for themselves. He tried, and apparently failed, to set expectations to the public and to the board as to what a long and difficult road lay ahead. Breaking apart the CPU product development and chip fabrication, as AMD did with Global Foundries almost 15 years ago, was the right thing to do. Lobbying for the enactment of CHIPS, and soliciting federal subsidies to re-shore semiconductor manufacturing, also was the right thing to do. It was going to take a long time, and some serious politicking: on the one hand, layoffs seemed inevitable and necessary; on the other, Members of Congress being asked to support a chipmaker with billions of taxpayer dollars don’t want to hear about the need for layoffs.

It turned out to be too difficult an optimization problem to solve in the time allotted. To many, it doesn’t seem reasonable for Intel’s Board to have expected a turnaround in the time Gelsinger had at the helm, and it is disqualifying for them to oust him without a succession plan.

I am no fan of Intel. In its heyday, Intel indulged in anticompetitive practices that put to shame anything Microsoft attempted in the 1990s, and never got the regulatory scrutiny it deserved. Before TransMeta, there was Intergraph. AMD and NVIDIA eventually prevailed in antitrust settlements, and it is genuinely shocking to intellectualize that as recently as 2016, Intel was paying $300M per quarter to NVIDIA as part of a $1.5B private antitrust settlement. How quickly the mighty have fallen!

At the same time, Intel’s steadfast commitment to executing on its core technical roadmap – improving the x86 architecture, without breaking backward compatibility – enabled them to bring volume business models to positively disrupt not only the PC industry (1980s), but the workstation industry (1990s) and HPC and data centers. They democratized computing in a way that few other companies can claim. For that, the company deserves our gratitude and respect, and we all should be pulling for its turnaround and a brighter future.

Unfortunately for Intel and the board, it is not at all clear that ousting Pat Gelsinger will lead to those favorable outcomes.