On January 3rd, 2018, Microsoft released an advisory and security updates that relate to a new class of discovered hardware vulnerabilities, termed speculative execution side channels, that affect the design methodology and implementation decisions behind many modern microprocessors. This post dives into the technical details of Kernel Virtual Address (KVA) Shadow which is the Windows kernel mitigation for one specific speculative execution side channel: the rogue data cache load vulnerability (CVE-2017-5754, also known as “Meltdown” or “Variant 3”). KVA Shadow is one of the mitigations that is in scope for Microsoft’s recently announced Speculative Execution Side Channel bounty program.
It’s important to note that there are several different types of issues that fall under the category of speculative execution side channels, and that different mitigations are required for each type of issue. Additional information about the mitigations that Microsoft has developed for other speculative execution side channel vulnerabilities (“Spectre”), as well as additional background information on this class of issue, can be found here.
Please note that the information in this post is current as of the date of this post.
Vulnerability description & background
The rogue data cache load hardware vulnerability relates to how certain processors handle permission checks for virtual memory. Processors commonly implement a mechanism to mark virtual memory pages as owned by the kernel (sometimes termed supervisor), or as owned by user mode. While executing in user mode, the processor prevents accesses to privileged kernel data structures by way of raising a fault (or exception) when an attempt is made to access a privileged, kernel-owned page. This protection of kernel-owned pages from direct user mode access is a key component of privilege separation between kernel and user mode code.
Certain processors capable of speculative out-of-order execution, including many currently in-market processors from Intel, and some ARM-based processors, are susceptible to a speculative side channel that is exposed when an access to a page incurs a permission fault. On these processors, an instruction that performs an access to memory that incurs a permission fault will not update the architectural state of the machine. However, these processors may, under certain circumstances, still permit a faulting internal memory load µop (micro-operation) to forward the result of the load to subsequent, dependent µops. These processors can be said to defer handling of permission faults to instruction retirement time.
Out of order processors are obligated to “roll back” the architecturally-visible effects of speculative execution down paths that are proven to have never been reachable during in-program-order execution, and as such, any µops that consume the result of a faulting load are ultimately cancelled and rolled back by the processor once the faulting load instruction retires. However, these dependent µops may still have issued subsequent cache loads based on the (faulting) privileged memory load, or otherwise may have left additional traces of their execution in the processor’s caches. This creates a speculative side channel: the remnants of cancelled, speculative µops that operated on the data returned by a load incurring a permission fault may be detectable through disturbances to the processor cache, and this may enable an attacker to infer the contents of privileged kernel memory that they would not otherwise have access to. In effect, this enables an unprivileged user mode process to disclose the contents of privileged kernel mode memory.
Operating system implications
Most operating systems, including Windows, rely on per-page user/kernel ownership permissions as a cornerstone of enforcing privilege separation between kernel mode and user mode. A speculative side channel that enables unprivileged user mode code to infer the contents of privileged kernel memory is problematic given that sensitive information may exist in the kernel’s address space. Mitigating this vulnerability on affected, in-market hardware is especially challenging, as user/kernel ownership page permissions must be assumed to no longer prevent the disclosure (i.e., reading) of kernel memory contents from user mode. Thus, on vulnerable processors, the rogue data cache load vulnerability impacts the primary tool that modern operating system kernels use to protect themselves from privileged kernel memory disclosure by untrusted user mode applications.
In order to protect kernel memory contents from disclosure on affected processors, it is thus necessary to go back to the drawing board with how the kernel isolates its memory contents from user mode. With the user/kernel ownership permission no longer effectively safeguarding against memory reads, the only other broadly-available mechanism to prevent disclosure of privileged kernel memory contents is to entirely remove all privileged kernel memory from the processor’s virtual address space while executing user mode code.
This, however, is problematic, in that applications frequently make system service calls to request that the kernel perform operations on their behalf (such as opening or reading a file on disk). These system service calls, as well as other critical kernel functions such as interrupt processing, can only be performed if their requisite, privileged code and data are mapped in to the processor’s address space. This presents a conundrum: in order to meet the security requirements of kernel privilege separation from user mode, no privileged kernel memory may be mapped into the processor’s address space, and yet in order to reasonably handle any system service call requests from user mode applications to the kernel, this same privileged kernel memory must be quickly accessible for the kernel itself to function.
The solution to this quandary is to, on transitions between kernel mode and user mode, also switch the processor’s address space between a kernel address space (which maps the entire user and kernel address space), and a shadow user address space (which maps the entire user memory contents of a process, but only a minimal subset of kernel mode transition code and data pages needed to switch into and out of the kernel address space). The select set of privileged kernel code and data transition pages handling the details of these address space switches, which are “shadowed” into the user address space are “safe” in that they do not contain any privileged data that would be harmful to the system if disclosed to an untrusted user mode application. In the Windows kernel, the usage of this disjoint set of shadow address spaces for user and kernel modes is called “kernel virtual address shadowing”, or KVA shadow, for short.
In order to support this concept, each process may now have up to two address spaces: the kernel address space and the user address space. As there is no virtual memory mapping for other, potentially sensitive privileged kernel data when untrusted user mode code executes, the rogue data cache load speculative side channel is completely mitigated. This approach is not, however, without substantial complexity and performance implications, as will later be discussed.
On a historical note, some operating systems previously have implemented similar mechanisms for a variety of different and unrelated reasons: For example, in 2003 (prior to the common introduction of 64-bit processors in most broadly-available consumer hardware), with the intention of addressing larger amounts of virtual memory on 32-bit systems, optional support was added to the 32-bit x86 Linux kernel in order to provide a 4GB virtual address space to user mode, and a separate 4GB address space to the kernel, requiring address space switches on each user/kernel transition. More recently, a similar approach, termed KAISER, has been advocated to mitigate information leakage about the kernel virtual address space layout due to processor side channels. This is distinct from the rogue data cache load speculative side channel issue, in that no kernel memory contents, as opposed to address space layout information, were at the time considered to be at risk prior to the discovery of speculative side channels.
KVA shadow implementation in the Windows kernel
While the design requirements of KVA shadow may seem relatively innocuous, (privileged kernel-mode memory must not be mapped in to the address space when untrusted user mode code runs) the implications of these requirements are far-reaching throughout Windows kernel architecture. This touches a substantial number of core facilities for the kernel, such as memory management, trap and exception dispatching, and more. The situation is further complicated by a requirement that the same kernel code and binaries must be able to run with and without KVA shadow enabled. Performance of the system in both configurations must be maximized, while simultaneously attempting to keep the scope of the changes required for KVA shadow as contained as possible. This maximizes maintainability of code in both KVA shadow and non-KVA-shadow configurations.
This section focuses primarily on the implications of KVA shadow for the 64-bit x86 (x64) Windows kernel. Most considerations for KVA shadow on x64 also apply to 32-bit x86 kernels, though there are some divergences between the two architectures. This is due to ISA differences between 64-bit and 32-bit modes, particularly with trap and exception handling.
Please note that the implementation details described in this section are subject to change without notice in the future. Drivers and applications must not take dependencies on any of the internal behaviors described below without first checking for updated documentation.
The best way to understand the complexities involved with KVA shadow is to start with the underlying low-level interface in the kernel that handles the transitions between user mode and kernel mode. This interface, called the trap handling code, is responsible for fielding traps (or exceptions) that may occur from either kernel mode or user mode. It is also responsible for dispatching system service calls and hardware interrupts. There are several events that the trap handling code must handle, but the most relevant for KVA shadow are those called “kernel entry” and “kernel exit” events. These events, respectively, involve transitions from user mode into kernel mode, and from kernel mode into user mode.
Trap handling and system service call dispatching overview and retrospective
As a quick recap of how the Windows kernel dispatches traps and exceptions on x64 processors, traditionally, the kernel programs the current thread’s kernel stack pointer into the current processor’s TSS (task state segment), specifically into the KTSS64.Rsp0 field, which informs the processor which stack pointer (RSP) value to load up on a ring transition to ring 0 (kernel mode) code. This field is traditionally updated by the kernel on context switch, and several other related internal events; when a switch to a different thread occurs, the processor KTSS64.Rsp0 field is updated to point to the base of the new thread’s kernel stack, such that any kernel entry event that occurs while that thread is running enters the kernel already on that thread’s stack. The exception to this rule is that of system service calls, which typically enter the kernel with a “syscall” instruction; this instruction does not switch the stack pointer and it is the responsibility of the operating system trap handling code to manually load up an appropriate kernel stack pointer.
On typical kernel entry, the hardware has already pushed what is termed a “machine frame” (internally, MACHINE_FRAME) on the kernel stack; this is the processor-defined data structure that the IRETQ instruction consumes and removes from the stack to effect an interrupt-return, and includes details such as the return address, code segment, stack pointer, stack segment, and processor flags on the calling application. The trap handling code in the Windows kernel builds a structure called a trap frame (internally, KTRAP_FRAME) that begins with the hardware-pushed MACHINE_FRAME, and then contains a variety of software-pushed fields that describe the volatile register state of the context that was interrupted. System calls, as noted above, are an exception to this rule, and must manually build the entire KTRAP_FRAME, including the MACHINE_FRAME, after effecting a stack switch to an appropriate kernel stack for the current thread.
KVA shadow trap and system service call dispatching design considerations
With a basic understanding of how traps are handled without KVA shadow, let’s dive into the details of the KVA shadow-specific considerations of trap handling in the kernel.
When designing KVA shadow, several design considerations applied for trap handling when KVA shadow were active, namely, that the security requirements were met, that performance impact on the system was minimized, and that changes to the trap handling code were kept as compartmentalized as possible in order to simplify code and improve maintainability. For example, it is desirable to share as much trap handling code between the KVA shadow and non-KVA shadow configurations as practical, so that it is easier to make changes to the kernel’s trap handling facilities in the future.
When KVA shadowing is active, user mode code typically runs with the user mode address space selected. It is the responsibility of the trap handling code to switch to the kernel address space on kernel entry, and to switch back to the user address space on kernel exit. However, additional details apply: it is not sufficient to simply switch address spaces, because the only transition kernel pages that can be permitted to exist (or be “shadowed into”) in the user address space are only those that hold contents that are “safe” to disclose to user mode. The first complication that KVA shadow encounters is that it would be inappropriate to shadow the kernel stack pages for each thread into the user mode address space, as this would allow potentially sensitive, privileged kernel memory contents on kernel thread stacks to be leaked via the rogue data cache load speculative side channel.
It is also desirable to keep the set of code and data structures that are shadowed into the user mode address space to a minimum, and if possible, to only shadow permanent fixtures in the address space (such as portions of the kernel image itself, and critical per-processor data structures such as the GDT (Global Descriptor Table), IDT (Interrupt Descriptor Table), and TSS. This simplifies memory management, as handling setup and teardown of new mappings that are shadowed into user mode address spaces has associated complexities, as would enabling any shadowed mappings to become pageable. For these reasons, it was clear that it would not be acceptable for the kernel’s trap handling code to continue to use the per-kernel-thread stack for kernel entry and kernel exit events. Instead, a new approach would be required.
The solution that was implemented for KVA shadow was to switch to a mode of operation wherein a small set of per-processor stacks (internally called KTRANSITION_STACKs) are the only stacks that are shadowed into the user mode address space. Eight of these stacks exist for each processor, the first of which represents the stack used for “normal” kernel entry events, such as exceptions, page faults, and most hardware interrupts, and the remaining seven transition stacks represent the stacks used for traps that are dispatched using the x64-defined IST (Interrupt Stack Table) mechanism (note that Windows does not use all 7 possible IST stacks presently).
When KVA shadow is active, then, the KTSS64.Rsp0 field of each processor points to the first transition stack of each processor, and each of the KTSS64.Ist[n] fields point to the n-th KTRANSITION_STACK for that processor. For convenience, the transition stacks are located in a contiguous region of memory, internally termed the KPROCESSOR_DESCRIPTOR_AREA, that also contains the per-processor GDT, IDT, and TSS, all of which are required to be shadowed into the user mode address space for the processor itself to be able to handle ring transitions properly. This contiguous memory block is, itself, shadowed in its entirety.
This configuration ensures that when a kernel entry event is fielded while KVA shadow is active, that the current stack is both shadowed into the user mode address space, and does not contain sensitive memory contents that would be risky to disclose to user mode. However, in order to maintain these properties, the trap dispatch code must be careful to push no sensitive information onto any transition stack at any time. This necessitates the first several rules for KVA shadow in order to avoid any other memory contents from being stored onto the transition stacks: when executing on a transition stack, the kernel must be fielding a kernel entry or kernel exit event, interrupts must be disabled and must remain disabled throughout, and the code executing on a transition stack must be careful to never incur any other type of kernel trap. This also implies that the KVA shadow trap dispatch code can assume that traps arising in kernel mode already are executing with the correct CR3, and on the correct kernel stack (except for some special considerations for IST-delivered traps, as discussed below).
Fielding a trap with KVA shadow active
Based on the above design decisions, there is an additional set of tasks specific to KVA shadowing that must occur prior to the normal trap handling code in the kernel being invoked for a kernel entry trap events. In addition, there is a similar set of tasks related to KVA shadow that must occur at the end of trap processing, if a kernel exit is occurring.
On normal kernel entry, the following sequence of events must occur:
- The kernel GS base value must be loaded. This enables the remaining trap code to access per-processor data structures, such as those that hold the kernel CR3 value for the current processor.
- The processor’s address space must be switched to the kernel address space, so that all kernel code and data are accessible (i.e., the kernel CR3 value must be loaded). This necessitates that the kernel CR3 value must be stored in a location that is, itself, shadowed. For the purposes of KVA shadow, a single per-processor KPRCB page that contains only “safe” contents maintains a copy of the current processor’s kernel CR3 value for easy access to the KVA shadow trap dispatch code. Context switch between address spaces, and process attach/detach update the corresponding KPRCB fields with the new CR3 value on process address space changes.
- The machine frame previously pushed by hardware as a part of the ring transition from user mode to kernel mode must be copied from the current (transition) stack, to the per-kernel-thread stack for the current thread.
- The current stack must be switched to the per-kernel-thread stack. At this point, the “normal” trap handling code can largely proceed as usual, and without invasive modifications (save that the kernel GS base has already been loaded).
Roughly speaking, the inverse sequence of events must occur on normal kernel exit; the machine frame at the top of the current kernel thread stack must be copied to the transition stack for the processor, the stacks must be switched, CR3 must be reloaded with the corresponding value for the user mode address space of the current process, the user mode GS base must be reloaded, and then control may be returned to user mode.
System service call entry and exit through the SYSCALL/SYSRETQ instruction pair is handled slightly specially, in that the processor does not already push a machine frame, because the kernel logically does not have a current stack pointer until it explicitly loads one. In this case, no machine frame needs be copied on kernel entry and kernel exit, but the other basic steps must still be performed.
Special care needs to be taken by the KVA shadow trap dispatch code for NMI, machine check, and double fault type trap events, because these events may interrupt even normally uninterruptable code. This means that they could even interrupt the normally uninterruptable KVA shadow trap dispatch code itself, during a kernel entry or kernel exit event. These types of traps are delivered using the IST mechanism onto their own distinct transition stacks, and the trap handling code must carefully handle the case of the GS base or CR3 value being in any state due to the indeterminate state of the machine at the time in which these events may occur, and must preserve the pre-existing GS base or CR3 values.
At this point, the basics for how to enter and exit the kernel with KVA shadow are in place. However, it would be undesirable to inline the KVA shadow trap dispatch code into the standard trap entry and trap exit code paths, as the standard trap entry and trap exit code paths could be located anywhere in the kernel’s .text code section, and it is desirable to minimize the amount of code that needs be shadowed into the user address space. For this reason, the KVA shadow trap dispatch code is collected into a series of parallel entry points packed within their own code section within the kernel image, and either the standard set of trap entry points, or the KVA shadow trap entry points are installed into the IDT at system boot time, based on whether KVA shadow is in use at system boot. Similarly, the system service call entry points are also located in this special code section in the kernel image.
Note that one implication of this design choice is that KVA shadow does not protect against attacks against kernel ASLR using speculative side channels. This is a deliberate decision given the design complexity of KVA shadow, timelines involved, and the realities of other side channel issues affecting the same processor designs. Notably, processors susceptible to rogue data cache load are also typically susceptible to other attacks on their BTBs (branch target buffers), and other microarchitectural resources that may allow kernel address space layout disclosure to a local attacker that is executing arbitrary native code.
Memory management considerations for KVA shadow
Now that KVA shadow is able to handle trap entry and trap exit, it’s necessary to understand the implications of KVA shadowing on memory management. As with the trap handling design considerations for KVA shadow, ensuring the correct security properties, providing good performance characteristics, and maximizing the maintainability of code changes were all important design goals. Where possible, rules were established to simplify the memory management design implementation. For example, all kernel allocations that are shadowed into the user mode address space are shadowed system-wide and not per-process or per-processor. As another example, all such shadowed allocations exist at the same kernel virtual address in both the user mode and kernel mode address spaces and share the same underlying physical pages in both address spaces, and all such allocations are considered nonpageable and are treated as though they have been locked into memory.
The most apparent memory management consequence of KVA shadowing is that each process typically now needs a separate address space (i.e., page table hierarchy, or top level page directory page) allocated to describe the shadow user address space, and that the top level page directory entries corresponding to user mode VAs must be replicated from the process’s kernel address space top level page directory page to the process’s user address space top level page directory page.
The top level page directory page entries for the kernel half of the VA space are not replicated, however, and instead only correspond to a minimal set of page table pages needed to map the small subset of pages that have been explicitly shadowed into the user mode address space. As noted above, pages that are shadowed into the user mode address space are left nonpageable for simplicity. In practice, this is not a substantial hardship for KVA shadow, as only a very small number of fixed allocations are ever shadowed system-wide. (Remember that only the per-processor transition stacks are shadowed, not any per-thread data structures, such as per-thread kernel stacks.)
Memory management must then replicate any updates to top level user mode page directory page entries between the two process address spaces, as any updates occur, and access bit handling for working set aging and other purposes must logically OR the access bits from both user and kernel address spaces together if a top level page directory page entry is being considered (and, similarly, working set aging must clear access bits in both top level page directory page if a top level entry is being considered). Similarly, memory management must be aware of both address spaces that may exist for processes in various other edge-cases where top-level page directory pages are manipulated.
Finally, no general purpose kernel allocations can be marked as “global” in their corresponding leaf page table entries by the kernel, because processors susceptible to rogue data cache load cannot observe any cached virtual address translations for any privileged kernel pages that could contain sensitive memory contents while in user mode, for KVA shadow protections to be effective, and such global entries would still be cached in the processor translation buffer (TB) across an address space switch.
Booting is just the beginning of a journey
At this point, we have covered some of the major areas involved in the kernel with respect to KVA shadow. However, there’s much more that’s involved beyond just trap handling and memory management: For example, changes to how Windows handles multiprocessor initialization, hibernate and resume, processor shutdown and reboot, and many other areas were all required in order to make KVA shadow into a fully featured solution that works correctly in all supported software configurations.
Furthermore, preventing the rogue data cache load issue from exposing privileged kernel mode memory contents is just the beginning of turning KVA shadow into a feature that could be shipped to a diverse customer base. So far, we have only touched on the basics of the highlights of an unoptimized implementation of KVA shadow on x64 Windows. We’re far from done examining KVA shadowing, however; a substantial amount of additional work was still required in order to reduce the performance overhead of KVA shadow to the absolute minimum possible. As we’ll see, there are a number of options that have been considered and employed to that end with KVA shadow. The below optimizations are already included with the January 3rd, 2018 security updates to address rogue data cache load.
Performance optimizations
One of the primary challenges faced by the implementation of KVA shadow was maximizing system performance. The model of a unified, flat address space shared between user and kernel mode, with page permission bits to protect kernel-owned pages from access by unprivileged user mode code, is both convenient for an operating system kernel to implement, and easily amenable to high performance user/kernel transitions.
The reason why the traditional, unified address space model allows for fast user/kernel transitions relates to how processors handle virtual memory. Processors typically cache previously fetched virtual address translations in a small internal cache that is termed a translation buffer, (or TB, for short); some literature also refers to these types of address translation caches as translation lookaside buffers (or TLBs for short). The processor TB operates on the principle of locality: if an application (or the kernel) has referenced a particular virtual address translation recently, it is likely to do so again, and the processor can save the costly process of re-walking the operating system’s page table hierarchy if the requisite translation is already cached in the processor TB.
Traditionally, a TB contains information that is primarily local to a particular address space (or page table hierarchy), and when a switch to a different page table hierarchy occurs, such as with a context switch between threads in different processes, the processor TB must be flushed so that translations from one process are not improperly used in the context of a different process. This is critical, as two processes can, and frequently do, map the same user mode virtual address to completely different physical pages.
KVA shadowing requires switching address spaces much more frequently than operating systems have traditionally done so, however; on processors susceptible to the rogue data cache load issue, it is now necessary to switch the address space on every user/kernel transition, which are vastly more frequent events than cross-process context switches. In the absence of any further optimizations, the fact that the processor TB is flushed and invalidated on each user/kernel transition would substantially reduce the benefit of the processor TB, and would represent a significant performance cost on the system.
Fortunately, there are some techniques that the Windows KVA shadow implementation employs to substantially mitigate the performance costs of KVA shadowing on processor hardware that is susceptible to rogue data cache load. Optimizing KVA shadow for maximum performance presented a challenging exercise in finding creative ways to make use of existing, in-the-field hardware capabilities, sometimes outside the scope of their original intended use, while still maintaining system security and correct system operation, but several techniques have been developed to substantially reduce the cost.
PCID acceleration
The first optimization, the usage of PCID (process-context identifier) acceleration is relevant to Intel Core-family processors of Haswell and newer microarchitectures. While the TB on many processors traditionally maintained information local to an address space, and which had to be flushed on any address space switch, the PCID hardware capability allows address translations to be tagged with a logical PCID that informs the processor which address space they are relevant to. An address space (or page table hierarchy) can be tagged with a distinguished PCID value, and this tag is maintained with any non-global translations that are cached the processor’s TB; then, on address space switch to an address space with a different associated PCID, the processor can be instructed to preserve the previous TB contents. Because the processor requires that the current address space’s PCID to match that of any cached translation in the TB for the purposes of matching any translation lookups in the TB, address translations from multiple address spaces can now be safely represented concurrently in the processor TB.
On hardware that is PCID-capable and which requires KVA shadowing, the Windows kernel employs two distinguished PCID values, which are internally termed PCID_KERNEL and PCID_USER. The kernel address space is tagged with PCID_KERNEL, and the user address space is tagged with PCID_USER, and on each user/kernel transition, the kernel will typically instruct the processor to preserve the TB contents when switching address spaces. This enables the preservation of the entire TB contents on system service calls and other high frequency user/kernel transitions, and in many workloads, substantially mitigates almost all of the cost of KVA shadowing. Some duplication of TB entries between user and kernel mode is possible if the same user mode VA is referenced by user and kernel code, and additional processing is also required on some types of TB flushes, as certain types of TB flushes (such as those that invalidate user mode VAs) must be replicated to both user and kernel PCIDs. However, this overhead is typically relatively minor compared to the loss of all TB entries if the entire TB were not preserved on each user/kernel transition.
On address space switches between processes, such as context switches between two different processes, the entire TB is invalidated. This must be performed because the PCID values assigned by the kernel are not process-specific, but are global to the entire system. Assigning different PCID values to each process (which would be a more “traditional” usage of PCID) would preclude the need to flush the entire TB on context switches between processes, but would also require TB flush IPIs (interprocessor-interrupts) to be sent to a potentially much larger set of processors, specifically being all of those that had previously loaded a given PCID, which in and of itself is a performance trade-off due to the cost involved in TB flush IPIs.
It’s important to note that PCID acceleration also requires the hypervisor to expose CR4.PCID and the INVPCID instruction to the Windows kernel. The Hyper-V hypervisor was updated to expose these capabilities with the January 3rd, 2018 security updates. Additionally, the underlying PCID hardware capability is only defined for the native 64-bit paging mode, and thus a 64-bit kernel is required to take advantage of PCID acceleration (32-bit applications running under a 64-bit kernel can still benefit from the optimization).
User/global acceleration
Although many modern processors can take advantage of PCID acceleration, older Intel Core family processors, and current Intel Atom family processors do not provide hardware support for PCID and thus cannot take advantage of that PCID support to accelerate KVA shadowing. These processors do allow a more limited form of TB preservation across address space switches, however, in the form of the “global” page table entry bit. The global bit allows the operating system kernel to communicate to the processor that a given leaf translation is “global” to the entire system, and need not be invalidated on address space switches. (A special facility to invalidate all translations including global translations is provided by the processor, for cases when the operating system changes global memory translations. On x64 and x86 processors, this is accomplished by toggling the CR4.PGE control register bit.)
Traditionally, the kernel would mark most kernel mode page translations as global, in order to indicate that these address translations can be preserved in the TB during cross-process address space switches while all non-global address translations are flushed from the TB. The kernel is then obligated to ensure that both incoming and outgoing address spaces provide consistent translations for any global translations in both address spaces, across a global-preserving address space switch, for correct system operation. This is a simple matter for the traditional use of kernel virtual address management, as most of the kernel address space is identical across all processes. The global bit, thus, elegantly allows most of the effective TB contents for kernel VAs to be preserved across context switches with minimal hardware and software complexity.
In the context of KVA shadow, however, the global bit can be used for a completely different purpose than its original intention, for an optimization termed “user/global acceleration”. Instead of marking kernel pages as global, KVA shadow marks user pages as global, indicating to the processor that all pages in the user mode half of the address space are safe to preserve across address space switches. While an address space switch must still occur on each user/kernel transition, global translations are preserved in the TB, which preserves the user TB entries. As most applications primarily spend their time executing in user mode, this mode of operation preserves the portion of the TB that is most relevant to most applications. The TB contents for kernel virtual addresses are unavoidably lost on each address space switch when user/global acceleration is in use, and as with PCID acceleration, some TB flushes must be handled differently (and cross-process context switches require an entire TB flush), but preserving the user TB contents substantially cuts the cost of KVA shadowing over the more naïve approach of marking no translations as global.
Privileged process acceleration
The purpose of KVA shadowing is to protect sensitive kernel mode memory contents from disclosure to untrusted user mode applications. This is required for security purposes in order to maintain privilege separation between kernel mode and user mode. However, highly-privileged applications that have complete control over the system are typically trusted by the operating system for a variety of tasks, up to and including loading drivers, creating kernel memory dumps, and so on. These applications effectively already have the privileges required in order to access kernel memory, and so KVA shadowing is of minimal benefit for these applications.
KVA shadow thus optimizes highly privileged applications (specifically, those that have a primary token which is a member of the BUILTINAdministrators group, which includes LocalSystem, and processes that execute as a fully-elevated administrator account) by running these applications only with the KVA shadow “kernel” address space, which is very similar to how applications execute on processors that are not susceptible to rogue data cache load. These applications avoid most of the overhead of KVA shadowing, as no address space switch occurs on user/kernel transitions. Because these applications are fully trusted by the operating system, and already have (or could obtain) the capability to load drivers that could naturally access kernel memory, KVA shadowing is not required for fully-privileged applications.
Optimizations are ongoing
The introduction of KVA shadowing radically alters how the Windows kernel fields traps and exceptions from a processor, and significantly changes several key aspects of memory management. While several high-value optimizations have already been deployed with the initial release of operating system updates to integrate KVA shadow support, research into additional avenues of improvement and opportunities for performance tuning continues. KVA shadow represents a substantial departure from some existing operating system design paradigms, and with any such substantial shift in software design, exploring all possible optimizations and performance tuning opportunities is an ongoing effort.
Driver and application compatibility
A key consideration of KVA shadow was that existing applications and drivers must continue to work. Specifically, it would not have been acceptable to change the Windows ABI, or to invalidate how drivers work with user mode memory, in order to integrate KVA shadow support into the operating system. Applications and drivers that use supported and documented interfaces are highly compatible with KVA shadow, and no changes to how drivers access user mode memory through supported and documented means are necessary. For example, under a try/except block, it is still possible for a driver to use ProbeForRead to probe a user mode address for validity, and then to copy memory from that user mode virtual address (under try/except protection). Similarly, MDL mappings to/from user mode memory still function as before.
A small number of drivers and applications did, however, encounter compatibility issues with KVA shadow. By and large, the majority of incompatible drivers and applications used substantially unsupported and undocumented means to interface with the operating system. For example, Microsoft encountered several software applications from multiple software vendors that assumed that the raw machine instructions in certain, non-exported Windows kernel functions would remain static or unchanged with software updates. Such approaches are highly fragile and are subject to breaking at even slight perturbations of the operating system kernel code.
Operating system changes like KVA shadow, that necessitated a security update which changed how the operating system manages memory and trap and exception dispatching, underscore the fragility of depending on highly unsupported and undocumented mechanisms in drivers and applications. Microsoft strongly encourages developers to use supported and documented facilities in drivers and applications. Keeping customers secure and up to date is a shared commitment, and avoiding dependencies on unsupported and undocumented facilities and behaviors is critical to meeting the expectations that customers have with respect to keeping their systems secure.
Conclusion
Mitigating hardware vulnerabilities in software is an extremely challenging proposition, whether you are an operating system vendor, driver writer, or an application vendor. In the case of rogue data cache load and KVA shadow, the Windows kernel is able to provide a transparent and strong mitigation for drivers and applications, albeit at the cost of additional operating system complexity, and especially on older hardware, at some potential performance cost depending on the characteristics of a given workload. The breadth of changes required to implement KVA shadowing was substantial, and KVA shadow support easily represents one of the most intricate, complex, and wide-ranging security updates that Microsoft has ever shipped. Microsoft is committed to protecting our customers, and we will continue to work with our industry partners in order to address speculative execution side channel vulnerabilities.
Ken Johnson, Microsoft Security Response Center (MSRC)