A line graph titled "Time per Allocation, 1 Thread, NonPaged Pool" showing allocation times in nanoseconds for different allocation sizes in bytes. The blue line represents "Without Pool Zoning" and the orange line represents "With Pool Zoning." The graph highlights the impact of pool zoning on allocation time in cybersecurity contexts.

Uninitialized Kernel Pool Memory on Windows: A Solution

This blog post describes Microsoft’s efforts to remove Windows ‘ uninitialized kernel pool memory vulnerabilities as well as our motivations.

Please refer to our previous blog post for more information on the benefits of uninitialized memory and the solutions that have been used in the past to address this problem. In a nutshell, less than half of all uninitialized memory problems reported to Microsoft between 2017 and the middle of 2018 were caused by vulnerabilities in the kernel pool.

You can jump to the following sections of this blog post:

  1. Understanding How Uninitialized Pool Memory Can Be Solved
  2. Options for potential implementation
  3. new APIs for the Windows Kernel Pool
  4. Optimizations for performance
  5. Plan of deployment
  6. Customer Impact
  7. Plans for the future

Without a close collaboration between MSRC and the Windows organization, none of this work would have been possible.

Understanding How Uninitialized Pool Memory Can Be Solved

After the task of resolving uninitialized stack memory, work on a kernel pool memory was started.

We desired a method that could deterministically prevent vulnerabilities rather than relying solely on static analysis, fuzzing, or code review, similar to uninitialized stack memory. Knowing that our code does n’t have any problems with uninitialized kernel pools during construction is the ideal end state for us.

Uninitialized pool memory initially appeared to be much more difficult to solve than uninitiated stack memory. Take into account the following variations:

Allocations for stacks Allocations for Kernel Pools
Standard Size Small ( cannot expand due to a 20KB kernel stack ) many MB may be. Larger average pool allocation than stack
Cache When a variable is used by code, the active stack is typically present in or about to be in the L1 cache. Although pool allocations are typically used soon after allocation, they can be satisfied by memory that is n’t even in the cache.
the capacity to optimize In order to stack variables, MSVC does a fantastic job of eliminating redundant stores. The compiler cannot automatically optimize away the initial zeroing if allocations are zeroed in the poolAPI. The phrase” this API zeroes so that the caller immediately memsets it to zero after allocation can be eliminated” would need to be recognized by MSVC using custom optimization logic. Duplicate stores are typically more difficult for MSVC to optimize in order to pool/heap memory.
Time for allocation When the function is entered and the stack pointer is adjusted, stack allocations are instantly effectively made in bulk. swift but not immediately. involves using multiple memory reads, branching logic, locks, and possibly calling the memory manager for more virtual address space to consult structures and metadata.
Percent of “Total Time for allocation” Spent Initializing The “allocation time” increases from zero to the amount of time it takes to initialize if forced initialization cannot be optimized away. Pure overhead, this. Hopefully, the time spent in memset will represent a single-digit portion of the total allocation time since heap allocations already have overhead. We anticipate that the total allocation time will be dominated by the memset cost for very large memos.

In other words, we anticipate that pool allocations will be higher than stack averages, that optimizing redundant stores wo n’t be as simple as it should be, and that they will use more memory than a slower CPU cache can handle. Although we may spend more time initializing heap allocations, it will hopefully blend in with the noise of “time spent making a heap assignment,” which is the only saving grace.

We ran two sets of tests to put this theory to the test.

Windows Performance Tests in the Real World

In this test, we changed the current pool APIs so that they could not be chosen out of the behavior and would unconditionally zero all allocations. The impact of this change was then assessed in important scenarios using our current performance gate infrastructure.

These tests yielded very encouraging results. There was no discernible performance regression in the majority of the benchmarks.

One of our most important benchmarks, web fundamentals ( a measure of web server performance ), is well known for being extremely sensitive to the performance of kernel pool allocators and does a great job of estimating full system performance. Web fundamentals initially experienced 15 % regressions when the old kernel pool allocator was replaced with the segment heap implementation that runs in user-mode ( note: we fixed these regression ) The key point here is that the performance of pool allocators affects web fundamentals greatly.

One of the web fundamentals tests revealed a noise-level regression of about 1 % with pool zeroing in place. No regression was observed in the remaining web fundamentals tests.

This gave us the assurance that pool zeroing was possible as long as developers could choose to avoid regression-causing hot allocations.

Microbenchmarks

To better understand the overhead of zeroing for allocations of various sizes, we also created microbenchmarks. Keep in mind that these microbenchmarks do make noise; if you notice large spikes at one size, the test is probably just noise. Also take note that, despite some performance improvements, these benchmarks do not accurately reflect current performance. The initial performance numbers are listed here.

Allocating 8GB of memory using multiple, identical-sized allocations is the first test.

The regression brought on by pool zeroing when a single thread repeatedly allocates some fixed size is measured by the following benchmark. Keep in mind that 8GB of allocations are made for these tests.

Given that we anticipate allocations to be made and released during normal heap operations, allowing for the reuse of virtual address space, the scenario is somewhat improbable. In this test, it is slower for the heap to periodically request more memory from the memory manager.

Zeroing regression line chart

The scenario is depicted in the following graph, but each specific size is created using four threads simultaneously. Lock contention, interlocked operations, and SList operations may all collide as a result of this.

Time per allocation line chart

Zeroing regression line chart

It’s important to note that, as shown in the graph, multi-threaded tests typically have much higher noise levels than single threaded tests. Additionally, it’s important to note that overall, this regression appears to be lower ( although it was higher for very small sizes ). This is anticipated because when multiple threads are running the allocation path at once, it incurs additional overhead ( as previously mentioned ).

Allocations and Frees in Test 2

The following test allots and frees memory, which lowers the memory manager’s overhead by periodically adding more memory to the heap. Since the heap itself ought to be operating more quickly, we anticipate that the regression will be worse in this set of tests.

Time per allocation line chart

Zeroing regression line chart

The regression is significantly higher when allocations are made and released, as expected, according to the graph above.

How about four threads running simultaneously?

Time per allocation line chart

Zeroing regression line chart

We can see the regression is larger here than it was earlier once more.

The following graphs demonstrate how memset completely controls the amount of time needed to allocate as allocation size increases.

Time per allocation line chart

Zeroing regression line chart

Performance Data Rationalization

While some of the micro benchmark data appears to be fairly concerning, the real-world data looks good. How we can explain what we’re seeing is as follows:

  1. In hot paths, smaller allocations are more likely than larger ones. For instance, it’s uncommon to see multi-kilobyte or even megabyte allocations in a hot path. Larger allocations are frequently made in a small number of locations, and the allocation-making code is not the hot path.
  2. Performance from small allocations is not severely regressed. Impact is still present, but it is n’t excessive.
  3. Many existing code paths have already been given zero allocations. Many allocations are being double zeroed as a result of our real-world test setup, which unconditionally zeros the poolAPI. If we make sure allocations are only zeroed once, we might be able to recover some performance.
  4. Even so, the micro benchmarks fall short of accuracy. Branch predictors inside the API will be well-trained thanks to the benchmark’s use of allocations of the same size. This is occasionally true in the real world ( for example, when applications make numerous allocations of the same size in a row ), but it is frequently untrue. The normal pool allocation code will have additional branch misprediction overhead if branch predictors are n’t properly trained, and these tests wo n’tting take that into account.
  5. If the zeroing behavior becomes a bottleneck in the developers ‘ code, we can always allow them to opt-out allocations.

Based on the information we gathered, as you might have guessed, we ultimately decided to move forward with the pool zeroing project.

Options for potential implementation

We took into account three methods for obtaining initialized pool memory:

  1. Developing new pool APIs with a default memory of zero.
  2. utilizing a compiler’s magic to zero initialize pool allocations that are n’t provably fully initialized.
  3. Make the current pool API’s zero memory the default setting and add a new opt-out flag.

Because it would require one-off compiler logic to identify pool allocations, determine whether they were fully initialized, and insert initialization if they were n’t, we excluded# 2. The first method would also assist developers who compile their drivers with other compilers, whereas it would only be advantageous for developers to do so with MSVC.

# 3 was excluded because it causes the current pool APIs to undergo a breaking change. Drivers for all Windows versions are created by many businesses. Drivers would be forced to decide whether their driver needed to run on older Windows versions if we changed the current pool API:

  1. To ensure that their driver is functionally correct when used on Windows versions without this behavior, keep the existing pool APIs returning zero allocations.
  2. Create a driver that only functions with Windows versions that support the new pool API behavior.

Driver developers would n’t be able to rely on this change, even if we shipped it at a lower level. Some customers install updates slowly or never at all.

We looked into ways to “upgrade” the current APIs to have zeroing behavior for a while, but we were unable to find one that met the following criteria:

  1. The zeroing API must be more practical for a developer to use than the non-zero API ( i .e., we prefer to make people opt out of rather than opt in )
  2. No in-support platform must experience functional correctness issues as a result.
  3. No in-support platform must require double-zeroing.
  4. If necessary for performance, zeroing must be turned off.

new APIs for the Windows Kernel Pool

APIs for Windows 10 Version 2004

We have added new pool APIs that are zero by default for the release of Windows 10 Version 2004.

These APIs include:

ExAllocatePool2 is simpler to use because it requires fewer parameters. The most typical scenarios are covered.

ExAllocatePool3 is used for less frequent scenarios ( like priority allocations ) that call for more flexible parameters. We do n’t need to keep adding new APIs because both are built to be extensible in the future.

APIs that are Down-Level Compatible

Additionally, we’ve released a fresh set of wrapper APIs that are compatible with all down-level operating systems. The driver developer is needed to implement these as forceinline functions:

  1. Before importing any Windows headers, define POOL_ZERO_DOWN_LEVEL_SUPPORT in their driver (using a #define ).
  2. Before using these APIs, call ExInitializeDriverRuntime.

These APIs simply call in to the natively supported pool zeroing operating systems and let it carry out the 0inging. They will make the pool allocation and then memset the allocation to zero when used on operating systems that do n’t natively support pool zeroing ( i .e., systems before Windows 10 Version 2004 ).

Here, the goal is to give driver developers a means of being more clear about their program’s activities. Since the behavior is explicitly stated in the API name, there will never be any doubt as to whether a developer truly intended for an allocation to be uninitialized or zeroed.

Improvements to ExAllocatePool2 and 3 over outdated APIs

Throwing Actions

The behavior of the old pool APIs ‘ error paths was unclear.

Unless the POOL_QUOTA_FAIL_INSTEAD_OF_RAISE flag is passed to ExAllocatePoolWithQuotaTag, it returns NULL on error and issues an exception. Unless the POOL_RAISE_IF_ALLOCATION_FAILURE flag is passed to them, ExAllocatePoolWithTag and ExAllocationProoOlPriority fail and return NULL. In that case, they make an exception. Having several APIs with various semantics is a little confusing.

Unless the POOL_FLAG_RAISE_ON_FAILURE flag is specified, an exception is thrown when ExAllocatePool2/3 returns NULL on failure.

Behavior of tags

Pool tags of zero are accepted by the old pool APIs. Debugging may become more challenging as a result. Zero pool tags are not supported by the new pool APIs.

By default, a non-executable, paged pool is not available.

By defaulting to non-executable memory when using POOL_FLAGS_NON_PAGED. For executable non-paged pool memory, POOL_FLAGS_NON_PAGED_EXECUTABLE must be used. Developers are less likely to unintentionally commit an insecure mistake by making the safer allocation type the more convenient one.

On the x86 architecture, the paged pool is always executable; on all other architectures, it is not.

By default, zeroing

There are by default no allocations in the new poolAPI. If uninitialized allocations are required, callers must specify the POOL_FLAGS_UNINITIALIZED flag.

Optimizations for performance

There was n’t much done to further optimize the performance because it appeared good in real-world tests as is. A few things are important to draw attention to.

  1. The heap may need to access memory from the memory manager when making large allocations with zeroing requested. The memory manager will attempt to provide these pages using memory that has already been zeroed using the background zeroing thread in this situation. This makes it possible to quickly allocate sizable amounts of zeroed memory.
  2. Developers can manually remove the allocation from zeroing for very large allocations by selecting the appropriate flag. This typically does n’t need to be used because very large allocations are not typically made in a hot path.
  3. A heap-specific zeroing function has been developed that performs better than a typical memset implementation. In the future, we’ll publish another blog post about this. The heap’s specific alignment guarantees for its allocations are utilized by this function.

There was no need for additional optimizations.

Plan of deployment

The new pool zeroing APIs, in contrast to InitAll, call for code changes.

The new pool zeroing APIs have been added to the Windows memory manager for Windows 10 Version 2004. Zeroing allocations are used everywhere but one location ( a potentially sizable bitmap allocation ).

In order to use these new APIs, we have also made changes to Hyper-V and several networking components ( which will be released in a future version ). In the near future, we intend to use automatic bug filing tools to ensure that all kernel-mode code is converted to the new APIs.

The new APIs have received positive feedback so far. Since developers no longer need to call the pool API and memset in order to request a zero’d allocation, there have been no noticeable performance issues and code size has decreased.

Additionally, we’re considering how to assist third-party drivers in leaving the outdated pool APIs. Work is being done, but we do n’t yet have any firm plans to share.

Customer Impact

The majority of uninitialized memory vulnerabilities that currently affect customers will be fixed on Windows once we have finished moving our code to the new pool APIs. Of course, uninitialized memory vulnerabilities are still a possibility, but the likelihood of these problems infiltrating will be significantly lower thanks to InitAll’s protection of the stack and the majority of allocations using the zeroing flag.

Additionally, it is still possible that memory is initialized to something other than a program’s meaningful value ( i .e., memory must be initialed to 0 in order for the program to be accurate ). In these situations, we at least achieve deterministic behavior in our programs rather than random behavior ( i .e., the program always performs the incorrect initialization of the value ) or completely different behavior depending on the uninitialized value. Since we always know what the memory is set to, this makes it simpler to triage issues and easier to assess a bug’s impact. From a security standpoint, zero is typically the safest automatic value to choose even if it is n’t the” correct” one.

We continue to hold out hope that these mitigations will largely eliminate the vulnerability class threat, which in recent years made up between 5 and 10 % of all Microsoft CVEs.

Plans for the future

Pool zeroing is a great place to start, but there are still some things to look into:

  1. How to handle user-mode heaps. Should we forbid Malloc and make calloc mandatory? another thing?
  2. How to handle constructor-based C++ classes. Should these be required to fully initialize the class in the constructor instead of being zeroed out? Any internal padding bytes, please?

A future blog post about how we developed a brand-new, specialized memset that the kernel pool could use to zero allocations and how this work resulted in the implementation of higher-performance MEMsets across all Windows applications is also planned to be published.

Skip to content