I am not a great fan of cudaGetLastError(), but after exhorting people never to call it, I feel obliged to mention that there is a circumstance where it is the only way to query whether an error occurred: after a kernel launch that may have failed due to a misconfiguration, such as launching the kernel in an invalid stream or requesting more shared memory than the hardware can support. In production code, almost all kernel launches are known to succeed, because the developer found and fixed all the bugs that led to misconfigured kernel launches in the first place. But for kernels whose configurations are computed at runtime, it is conceivable that the kernel launch itself may fail, and developers can check that error status by calling cudaGetLastError() immediately after the kernel launch.
What it will not do, though—and this is important—is tell you whether the kernel executed successfully. Kernel launches always have been asynchronous, and calling cudaGetLastError() does not synchronize with the GPU. So if your kernel does an invalid memory reference, the error status returned by cudaGetLastError() (and cudaDeviceSynchronize(), and all the other functions that may return error statuses due to previous error conditions) will be updated when the kernel encounters the error.
To illustrate the point, I created a variant of nullKernelAsync.cu
called cudaGetLastErrorIsAsynchronous.cu
. The original nullKernelAsync.cu
is designed to measure the overhead of kernel launch by repeatedly launching a null kernel, measuring the wall clock time, and reporting the launch rate in microseconds per launch. The NULL kernel in question is designed to ensure that the compiler emits some code, but doesn’t actually do any work:
__global__
void
NullKernel( volatile int *p, bool write, int a=0, int b=1, int c=2, int d=3, int e=4, int f=5, int g=6 )
{
if ( write && 0==threadIdx.x && 0==blockIdx.x ) {
*p = a+b+c+d+e+f+g;
}
}
The original timing loop invokes this kernel with the second parameter always false
, ensuring that it does not try to write to the NULL memory location:
chTimerGetTime( &start );
for ( int i = 0; i < cIterations; i++ ) {
NullKernel<<<1,1>>>( NULL, false );
}
cuda(DeviceSynchronize());
chTimerGetTime( &stop );
Since the GPU can’t consume kernel launches, even ones that do nothing, as fast as the CPU can launch them, the end result is a measurement of the overhead of a kernel launch.
To show that cudaGetLastError() is asynchronous, I updated the sample to cause the NULL kernel to dereference a NULL pointer on the last invocation of the kernel, triggering the asynchronous error reporting mechanism described in an earlier blog, where cudaDeviceSynchronize() and other functions will belatedly report an error condition encountered by a running kernel.
The kernel is the same as before, but the timing loop calls it with the second parameter true, causing it to dereference NULL and put the CUDA context into an invalid state:
chTimerGetTime( &start );
for ( int i = 0; i < cIterations; i++ ) {
NullKernel<<<1,1>>>( NULL, false );
}
cuda(EventRecord( ev ));
NullKernel<<<1,1>>>( NULL, true );
status = cudaEventQuery( ev );
std::cout << "cudaEventQuery returned " << status << std::endl;
status = cudaGetLastError();
std::cout << "cudaGetLastError returned " << status << " (before cudaDeviceSynchronize())" << std::endl;
// this returns error due to deliberate dereference of NULL on last kernel invocation
(void) cudaDeviceSynchronize();
status = cudaGetLastError();
std::cout << "cudaGetLastError returned " << status << " (after cudaDeviceSynchronize())" << std::endl;
cuda(EventDestroy(ev));
chTimerGetTime( &stop );
The event is used to confirm that the kernel launch hasn’t yet been processed. Expected output from the application is as follows:
Measuring asynchronous launch time... cudaEventQuery returned 600
cudaGetLastError returned 0 (before cudaDeviceSynchronize())
cudaGetLastError returned 700 (after cudaDeviceSynchronize())
0.00 us
What’s really fun is that if enough time elapses between the operation and the call to cudaGetLastError(), it may, indeed, return an error. Without explicit synchronization, it amounts to a race condition. For example, if you are running a debug build with compiler optimizations disabled, your code may run slowly enough for the error condition to be returned just because the unoptimized code runs slowly.
So if you need to know whether a kernel launch was correctly configured, go ahead and call cudaGetLastError() immediately after launching the kernel. Otherwise, I stand by the claim that cudaGetLastError() is best avoided; check your error codes, and make sure your kernel launches are always configured correctly.