A Tale of Technical Debt
Early on in CUDA’s development, I made a change to the driver API that moved the context from being a parameter to every function, to being stored in a TLS (thread local storage) slot.
The interface change resulted in the context parameter being removed from almost every CUDA entry point. For example, the function to create a CUDA array changed from:
CUresult cuArrayCreate( ctx, array, descriptor );
CUresult cuArrayCreate( array, descriptor );
and the context was taken from a TLS slot. TLS slots, described here for the GNU toolchain and here for Microsoft Windows, resemble globals, in that they can be accessed at any time without having been passed as a parameter; but every CPU thread gets its own copy of the contents. Whenever the operating system does a context switch and swaps in a new thread, it updates the registers to make it so the thread gets its own copies of its TLS slots.
The change engendered a conversation among the scant few developers working on CUDA at the time. All other things being equal, statefulness is a bad idea in API design, because it makes software less composable. To examine the implications of changing the scope of state, we’ll first examine the floating point control word (FPCW), an example of statefulness in CPU designs that causes problems.
FPCW (Floating Point Control Word)
The Floating Point Control Word (FPCW) is a special register that dates back to the original 8087 floating point unit (FPU), which first became available from Intel in 1981. As you might surmise from the name, the FPCW contains bit fields that control the precision, round mode, and exception handling behavior of every floating point operation. Although the designers of the IEEE floating point standard anticipated benefits of having a register that implicitly affected every floating point operation, that implicit behavior has had unintended side effects, as described around the Internet in fpcw trashing, revisited, or Third party code is modifying the FPU control word, or Someone’s Been Messing With My Subnormals.
If a function changes the FPCW, the behavior of your floating point code can change as a result of having called a seemingly-innocuous function. There also have been cases where loading DLLs (including the Microsoft C runtime) changed the FPCW, which is especially problematic given that Windows has no ordering guarantees as to the order in which DLLs get loaded!
The problem is that across a function call boundary, there are several equally legitimate considerations concerning the state in question:
- The caller may wish to influence the callee’s behavior by controlling the state→the callee must not modify the state.
- The callee may be providing the service of setting the state to its caller→the callee modifies the state on behalf of the caller.
- The callee may have specific requirements for the state that conflict with the caller’s requirements→the callee must save and restore the state.
Case 1) is reflected in the IEEE floating point specification, which requires that compliant implementations include a round mode in the control word. The directed rounding modes (round toward negative and positive infinity) were added to support interval arithmetic, where numbers are represented by a range (lower and upper bounds) instead of a singular value that is rounded to the precision of the floating point format being used for the computation. To correctly perform arithmetic on intervals, the lower bound must be rounded toward negative infinity and the positive bound must be rounded toward positive infinity. If we denote an interval as a tuple [lwr, upr], interval addition is implemented as follows:
a+b[lwr,upr] = [RoundDown(a.lwr,b.lwr),RoundUp(a.upr,b.upr)]
The reason the standard specified a round mode that implicitly affects behavior, rather than simply defining the operations needed by the standard, is that the designers believed interval arithmetic could be implemented by calling a function twice: compute the lower bound by calling the function with the round mode set to RoundDown, then compute the upper bound by calling the function with the round mode set to RoundUp. The problem, as shown above, is that even primitive arithmetic operations like addition must be done with different round modes in close proximity. If the round mode must be changed with great frequency, the performance hit from the increase in static and dynamic instruction count is exacerbated by a quirk of implementation: updates to FPCW are expensive.
I am a little perplexed as to why the IEEE specification requires that these round modes be included in a persistent state, rather than in each instruction. When the first edition of the standard was ratified, FPUs (floating point units) were still in the early stages of development; it may be that CPU designers simply did not want to waste op code space on round modes. The original FPCW also included mode bits for precision, a design tradeoff that sort of made sense for the Intel 8087 (which has a stack of up to 8 registers in a canonical 80-bit precision) but does not make sense when 64-bit double precision values occupy twice as much register space as 32-bit floats. For this reason, the SSE and, later, AVX variants of the x86 instruction set adopted the model of including the precision of the floating point operation in the instruction.
Cases 2) and 3) came to light during development of Direct3D’s geometry pipeline. For Case 3), we required that the FPCW be set to 32-bit precision, to ensure that divide instructions would execute faster. For Case 2), we found that forcing the FPCW to 32-bit precision caused some software to misbehave because its code relied on the FPCW being set to 64-bit precision. One way to deal with the caller and callee disagreeing on how the FPCW should be set would be to update the ABI to save and restore the FPCW across function call boundaries. The problem is, some functions provide the service of setting the FPCW. In Direct3D9, applications can specify the
D3DCREATE_FPU_PRESERVE flag, instructing Direct3D not to tamper with the value of the FPCW: “Set the precision for Direct3D floating-point calculations to the precision used by the calling thread.”
More recent instruction set designs, such as NVIDIA’s GPU instructions for directed rounding, tend to specify the round mode on a per-instruction basis. Hence, CUDA intrinsics like
__fadd_rd correspond to instructions that round in a particular direction, as detailed here. Enabling floating point operations to be rounded in different directions on an instruction-by-instruction basis is a better fit with the requirement to implement fast interval arithmetic.
The DEC Alpha instruction set struck an elegant compromise between the two: it encoded the round mode as a 2-bit field in the opcode, and one of the encodings specified that the round mode be retrieved from the floating point control word. This feature enabled Alpha developers to have their cake and eat it too: they had per-instruction control over the rounding, but if they wanted their caller to specify the round mode, they could just compile their apps to use the control word all the time. The only slight oversight in the design was that because only 2 bits were reserved in the op code, and there are four (4) IEEE-compliant round modes, the designers had to pick one of the 4 round modes to be the one that could only be specified through the control word, and unfortunately they chose round-up (round toward positive infinity). Probably they should have given that status to the most common round mode, round-to-nearest-even. If ever there were a round mode that developers wanted to override with the control word, round-to-nearest-even is the one.
What does all that FPCW history have to do with CUDA contexts? Well, just as every floating point operation must have a precision and a round mode, almost every CUDA operation must have a CUDA context. Just as instruction set designers had to choose between specifying the round mode in every instruction, versus inferring it from the control word, we had to choose between having every CUDA function take the context as a parameter, or placing it into less-frequently referenced state where we could infer the value.
Given all the difficulties discussed above with having per-thread state instead of per-instruction (or per-API call) parameters, why would we willingly incur this pain?
Current CUDA Contexts
There is a simple explanation as to why it made sense for us to make CUDA context per-thread. If you go back to the earliest versions of CUDA, you will see that the set-context functions required that a context can only be current to one thread at a time.
By imposing this restriction, we were able to ensure that CUDA contexts were thread-safe, a table stakes requirement for API designers in the mid-2000s and most API designers today. (The CUDA team seems to have revisited this requirement for CUDA Graphs, reverse-delegating thread safety onto API clients.) We were deliberately taking on technical debt, because making CUDA contexts thread-safe would not have been a good use of our scant engineering resources. There were several levels of granularity for thread safety in the driver: not just global and context-wide, but also some more-granular data structures that all had their own mutexes. When multiple levels of scope are involved, deadlock scenarios in large, complex code bases become eminently plausible, and writing the test code to smoke out those bugs didn’t seem like a good use of resources when we could just legislate our way out of the problem.
And we didn’t need multiple CPU threads to “feed” our GPUs; with streams and events, a single CPU thread was adequate to keep a GPU busy. We had plausible use cases where we wanted to concurrently drive multiple GPUs, but the application could just create a thread per GPU and make a different CUDA context current to each one.
The reason I can frame this decision as technical debt is because on the one hand, by ensuring that only one thread could be in a context at a time, we were ensuring that the API was thread-safe; but on the other, once we decided to make CUDA thread-safe, we could seamlessly expose that new capability simply by relaxing that restriction and enabling multiple threads to have a given CUDA context current at a time. CUDA 4.0 (c. 2011) was the first version to enable this functionality.
Technically, there is a compatibility break when you relax parameter validation restrictions; but it is such an innocuous compatibility break that we all take for granted that it won’t impact any real-world applications. Consider that the gold standard of compatibility, the x86 instruction set, also technically breaks compatibility whenever new instructions are added, because formerly-invalid op codes suddenly start having architectural side effects instead of signaling invalid-opcode exceptions. Similarly, relaxing CUDA 4.0’s parameter validation caused any apps that previously had been attempting to attach CUDA contexts to more than one thread at a time, to start succeeding where they had been failing. As with new x86 instructions, the number of CUDA apps that stopped working because NVIDIA relaxed this restriction is zero, or close to it.
Unlike the FPCW, where a clear industry consensus seems to be driving toward more-granular control, switching CUDA contexts is sufficiently uncommon that having the context current to CPU threads still seems like the right tradeoff. If anything, with benefit of hindsight, it made sense to conflate contexts and devices, as the CUDA runtime did. In practice, CUDA now does this, with the abstraction “primary context” being preferred, and having multiple driver API contexts per device being strongly discouraged. But in the mid-2000s, introducing a new, needed abstraction would have been much more difficult than keeping two where one could be hidden at the behest of the toolchain and runtime, exposed later when it made sense (as CUDA modules were hidden, then exposed when NVIDIA decided to add runtime compilation of source code).
There is one major decision relating to CUDA contexts that I would’ve made differently, and that is the current-context stack. When we first shipped the ability to detach CUDA contexts (CUDA 2.2, if memory serves), I added cuCtxPushCurrent() and cuCtxPopCurrent() to push and pop the current context. Additionally, the specification for cuCtxCreate() was revised to state that if successful, the newly-created context was pushed onto the calling thread’s context stack.
When I was on-site at NVIDIA headquarters at some point during this design process, Chris Lamb tried patiently to get me to change this API before it shipped, but I was invested in the policy imposition baked into the stack, and he gave me the benefit of the doubt. The intention was to codify Case 3 – an explicit save and restore of the state – into the API set. As Chris pointed out then, and as is now reflected in CUDA’s current API set, developers are accustomed to get/set semantics, not push/pop semantics, and it would have been better to adhere to the conventional wisdom. The reason conventional wisdom is conventional is that it’s usually correct!