October | 2023 |

Fatalism Can Be Useful Sometimes

Introduction

A wise man once told me: “Scientists build in order to learn. Engineers learn in order to build.” One of the most exciting experiences for an engineer is realizing that new discoveries or technologies enable new possibilities, but part of an engineer’s job also is to save work by eliminating possibilities. For example, it sometimes comes in handy to know that nothing can travel faster than the speed of light in a vacuum.

I call this principle “The Utility of Futility,” and it is an ethos that more software engineers would do well to embrace. As a species, software engineers are incorrigible optimists, and we work in a profession where 90% solutions can be deceptively easy to develop. These characteristics may lead us to overlook opportunities to exploit the utility of futility. But before exploring some costly mistakes that could have been avoided by embracing this ethos, let’s review some related areas where system design was informed by a recognition of what’s not possible.

Speed Bumps

Our first story begins with a (metaphorical) speed bump.

Work on CUDA began in earnest in early 2005, and when I joined the team, the driver consisted of a few hundred lines of code. Most of that code implemented a handle allocator that would ‘allocate’ fixed-length memory buffers (e.g., context structures) that then could be referenced through integer handles. Internally, the driver then would translate these handles into pointers through pointer arithmetic on the buffer from which the index had been allocated.

Early on, we replaced these integer handles with so-called “opaque pointers,” replacing e.g.

typedef unsigned int CUcontext;

with:

typedef CUctx *CUcontext;

Note that the typedef does not declare the context structure – it declares a pointer to same. C/C++ clients of CUDA know that there is a context structure, but they do not know what it contains.

When this minor refactoring was done, it caused a stir and prompted an internal discussion on our development team. To some, the opaque handles seemed more secure, because the driver had to run some code to translate them into pointers to the driver’s internal structures. And developers who reverse-engineered the structure’s layout and then took advantage by, say, hard-coding offsets from the structure pointer into their own applications, would be risking their own applications’ continued functionality. Such code definitely breaks compatibility with future versions of CUDA, among other things. Why not put a little speed bump in the path of such developers?

Applying the utility of futility, we decided that the benefits of the speed bump were outweighed by the additional complexity needed to implement a fixed-length allocator and handle validation. Any developers determined to reverse-engineer the layout of the CUDA context structure would be able to do so.

One can consider this particular “utility of futility” story to occupy a gray area. Most do. Intel is famed for backward compatibility, but every time they ship a new instruction set, they break compatibility in a subtle way: the new instructions—previously-invalid opcodes—execute and have architectural side effects, instead of signaling invalid-opcode exceptions! But any developer who ships software that relies on that behavior would elicit little sympathy: about the same amount as any developer who reverse-engineered the layout of an internal structure in CUDA.

The Leaking Nanny

Software that runs in data centers must be robust. If a server running hundreds of virtual machines loses power, it must resume running them after regaining power, having lost as little work as possible. Such robustness involves heroics like journaling disk traffic to solid state drives that have enough capacitance built into their power supplies to post all their pending writes in the event of a power failure. The hypervisor software has to be able to restart all those virtual machines, in as close a state as possible to whatever they were doing before the server lost power.

I once worked at a utility computing vendor that ran a management process whose memory usage would steadily increase, and no one could figure out why. There was a memory leak somewhere in the code, and the code was big and complicated and written in a garbage-collecting language that made it difficult to diagnose such issues. Eventually, the excess memory usage caused the server to fail.

The “utility of futility” solution to this problem: instead of fixing the memory leak, the vendor simply created a watchdog that monitored this process’s memory usage and, when it became too much, killed the process. Remember, this process had been built to be robust in the face of power failure, so getting summarily executed by a peer process is also a recoverable event.

If the stopgap measure is using a feature that the system must deliver in any case, it needn’t stay a stopgap.

Exit On Malloc Failure

QEMU, the hardware emulator that enables virtualization for HVM guests on Xen, features an interesting engineering compromise: its internal memory allocator, the equivalent to malloc(), exits on failure. As a result, code reviews that check the return value from this function are rejected – the function either succeeds, or does not return at all because the whole process exited. The reason: gracefully handling out-of-memory situations introduces too much possibility for (presumably rare) and difficult-to-diagnose error, and therefore security risk. Since QEMU instances and their clients have to be robust in the same ways as the preceding system (e.g. recover machines in their latest-known states before the adverse event such as power failure), the “utility of futility” favored having malloc() exit rather than doing a prohibitively expensive and error-prone security analysis.

Memory Probes

As a young software engineer who cut his teeth on microcomputers with no memory protection whatsoever (IBM PCs running MS-DOS, Macs running the original MacOS), I was excited to work on Windows, a platform with some semblance of memory protection (don’t laugh – before Apple bought NeXT for its UNIX-like operating system, the MacOS was not any more secure than MS-DOS). Running your program under a debugger, invalid memory references in your program were flagged immediately.

But the Windows API had something more: functions IsBadReadPtr() and IsBadWritePtr() that could check the validity of a memory location. As a budding API designer, this function seemed like the perfect opportunity to elevate my parameter validation game: if my caller passed an invalid memory range to one of my functions, I could return an error rather than just having the program crash in my function.

The problem with this API is that even 16-bit Windows was a multitasking operating system. Memory could become invalid as a consequence of code running elsewhere in the system. Even if you wanted to build a function that “validated” a memory range before using it, one could contrive a scenario – say, a context switch at an inopportune time – where the memory was subsequently invalidated by other code running in the system. If Microsoft had recognized the utility of this futility, they would not have built this API, since all it did was give a false sense of security.

Note: In Windows NT, the structured exception handling (SEH) feature did, in fact, enable robust memory validation. Using SEH, the memory’s validity is evaluated at the time it is referenced, and not before, as when using a memory probe API. But in the intervening years, a consensus has developed among API designers is that the costs of such memory validation outweighs the benefits. It is left as an exercise for the student to determine whether APIs that crash when you pass in NULL parameters are a manifestation of the utility of futility!

Microsoft WDDM: A Missed Opportunity

This utility of futility story is going to require more background. A lot more background.

One of the signature achievements for graphics in Windows Vista (c. 2007) was to move most of the graphics software stack into user mode. Or I should say, back into user mode. For Windows NT 4.0 (c. 1996), Microsoft had moved GDI (the Graphics Device Interface) into kernel mode, reportedly after Michael Abrash buttonholed Bill Gates at a party (never a robust process for decision-making), and presumably because Abrash wanted to continue writing optimized software rasterizers and believed overall performance would be better if they were running in kernel mode. Wrong on all counts, and fodder for another blog to be written someday.

By the time Windows XP shipped (c. 2001), it was abundantly clear that moving GDI into kernel mode had been a catastrophic mistake, because the entirety of display drivers got moved into kernel mode with it, and by the time Windows XP shipped, graphics drivers included things like pixel shader compilers that were waaay too unstable and used waaaay too much stack to run in kernel mode. (In Windows NT, kernel stacks are limited to 12K on x86, or 24K on x86-64 – big enough for reasonable kernel applications, but not for things like shader compilers that should not run in kernel mode at all.) In fact, when I ported NVIDIA’s graphics driver to x86-64, one of the things I had to do to get the compiler running was spawn another thread and delegate onto it just to buy another kernel stack. Fortunately, the shader compiler didn’t seem to need more than 2 kernel stacks, or I would’ve been tempted to build a kernel stack usage nanny that spawned threads on an as-needed basis just to emulate a larger stack!

By the time Windows XP shipped, there was widespread consensus that most of the graphics driver had to be moved back to user mode. But at Microsoft, the kernel team still harbored a deep distrust of graphics hardware vendors, fueled by a mid-1990s era of incredibly buggy hardware, operated by poorly written drivers that had to try to work around the buggy hardware. Back then, dozens of hardware vendors had been competing for OEMs’ business, and schedule slips could be fatal; as a result, many hardware bugs were left in, as long as they could be papered over by driver work (provided those driver workarounds did not have too much performance impact). The bulk of this activity was occurring on Microsoft’s Windows 95 platform, which was built on a completely separate code base from Windows NT. Cries from the NT kernel team, who wanted robust hardware and drivers, went unheard by hardware developers who were more concerned about their companies’ continued existence. The number of OEMs was daunting as well as the number of graphics IHVs: graphics companies such as S3, ATI, Cirrus Logic, Tseng Labs, Matrox, Chips and Technologies, Oak Technology, Number Nine, and Trident were selling to OEMs such as Acer, AST, Compaq, Dell, Gateway 2000, HP, IBM, NCR, NEC, and Packard Bell. Both of these businesses were competitive, with new entrants funded either by large companies seeking to diversify, or startups seeking to parlay niche expertise into market share. Consumer electronics titans Samsung and Sharp entered the PC business, for example, while startups like 3Dfx, 3Dlabs, Rendition, and NVIDIA entered the graphics chip business.

Suffice to say that in that competitive environment, graphics chip companies were in no mood to slip schedule just to make their hardware more robust for a workstation platform whose unit sales were a fraction of the consumer platform – even if the workstation platform represented the future.

By the early 2000s, the competitive landscape had shifted, for graphics IHVs at least. Companies that couldn’t deliver competitive performance or features were acquired, went out of business, or were relegated to the margins. Consolidation eventually reduced the major players to a handful: Intel, NVIDIA and ATI accounted for most unit sales. These companies all had the wherewithal to build robust hardware and drivers, but between ongoing fierce competition and vastly more complicated hardware, the vendors did little to earn back the trust of the Windows NT kernel team after losing it in the 1990s.

To understand the landscape, it’s important also to understand the organizational tension between the NT kernel team and the multimedia team that owned the Windows graphics stack. Much of the NT kernel team, led by the brilliant and cantankerous operating system architect Dave Cutler, had been recruited to Microsoft from Digital Equipment Corporation in 1987; in contrast, the multimedia team that owned the 3D graphics drivers had developed in the Windows 95 organization and been reorganized into the NT organization in 1997. So, as the multimedia team redesigned the graphics stack to move most code back into user mode, they were doing so under the watchful eye of skeptical kernel architects who did not particularly trust them or the vendors whose capabilities were being exposed by the multimedia team.[1]

Even the hardware interfaces had changed so much that they would’ve been unrecognizable to graphics chip architects of the mid-1990s. Instead of submitting work to the hardware by writing to memory-mapped registers (MMIO), the drivers allocated memory that could be read directly by the graphics chips (via direct memory access or DMA), filled those buffers with hardware commands, then dispatched that work to the graphics chips[2]. Given that the NT architecture required that hardware be accessed only from kernel mode, management of these “command buffers” presented a challenge to the multimedia team. For performance and platform security, the bulk of the code to construct these command buffers had to run in user mode; but in keeping with the NT architecture, the command buffers could only be dispatched from kernel mode.

To avoid extra copying, Microsoft designed the system so hardware-specific commands to be written directly into these buffers by the user mode driver, since one vendor’s idea of a “draw triangle” command may differ from that of another. These commands would be queued up until the command buffer was full, or had to be submitted for some other reason; the system them would do a “kernel thunk” (transition from user to kernel mode), where the kernel mode driver would validate the buffer before submitting it to the hardware.

For those familiar with the NT architecture, the flaw in this design should be obvious, and is somewhat related to the preceding memory probe “utility of futility” story: since Windows is a multitasking API, the buffer can be corrupted during validation by the kernel mode driver. No amount of validation by the kernel mode driver can prevent corruption by untrusted user mode code between when the kernel mode driver is done with the validation, and when the hardware reads and executes the commands in the buffer.

It is, frankly, incredible to me that this platform vulnerability was not identified before the WDDM design was closed. The NT kernel team may not like having to trust graphics hardware, but as long as buffers can be corrupted by user mode code before the hardware can to read it, the only way to build a robust platform is to have the hardware validate the commands.

Another way to protect from this race condition would be to unmap the buffer so user mode code wouldn’t be able to change it, but editing the page tables and propagating news of the newly-edited page tables (“TLB invalidations”) would be too costly.

Conclusion

As you explore design spaces in software architecture, if you can prove that an interface is making promises it is not in a position to keep, don’t be afraid to invoke the Utility of Futility to simplify the system.

[1] Talking with members of the NT kernel team was illuminating, because they had a different relationship with hardware than did the multimedia team; they literally had fixed the Pentium FDIV bug by trapping and emulating the FDIV (floating point divide) instruction. But since FDIV is infrequently executed, emulating it incurred a modest performance penalty that would go unnoticed by end users without the aid of measurement tools. Even if graphics hardware were designed to be trapped and emulated, like the CPU instruction set, trapping and emulating graphics functionality in the NT kernel would incur a large-enough performance penalty that the resulting product would be unusable.

[2] These are commands such as “move this 2D block of pixels from here to there,” or “draw the following set of triangles.” The WDDM (the Windows Display Driver Model) architecture had many other features, such as supporting multitasking so many applications could do 3D rendering into the Windows desktop concurrently, but those details are not relevant to this “utility of futility” discussion.