What CHERI variety is the smallest?

Microsoft Research Cambridge, Microsoft Security Response Center, and Azure Silicon Engineering &amp, Solutions collaborated on the Portmeirion project. We have been looking into how to scale the major concepts from CHERI down to the smallest cores using the most affordable microcontrollers over the past year. The desktop and server-class processors that have been the Morello project’s main focus are very different from these cores.

Short pipelines and tens to hundreds of kilobytes of local SRAM are still typical in-order microcontrollers. Systems like Morello, on the other hand, have extensive and deep pipelines, carry out out-of-order execution, and a memory management unit with multiple levels of page tables. They also have gigabytes to terabits of DRAM hidden behind layers of caches. The number of microcontrollers worldwide is billions, and the likelihood that they are connected to the Internet is rising. They typically lack process-like abstraction due to the lack of virtual memory, running dangerous languages in a single privilege domain.

We currently have a working RTOS running existing C/C++ components in compartments thanks to this project. Over the next few months, we’ll be open sourcing the software stack and working to ensure that the Ibex core of the lowRISC project, to which we intend to contribute back upstream, is used to implement our proposed ISA extension.

If we are willing to co-design the instruction set architecture ( ISA ), application binary interface ( ABI ), isolation model, and core components of the software stack, our CHERI microcontroller project sought to determine whether we could obtain very strong security guarantees. Throughout the entire CHERI project, we used the same two fundamental security principles:

the intentionality principle. Without making a specific effort, no component should exercise privilege.

Both of these are not fully respected by current hardware security features. Protection rings, in which each ring is strictly more privileged than the next, are how traditional processors define privilege. The majority of the code in a typical kernel or hypervisor requires much less power than anything running in one privilege mode, which effectively has full control over anything operating in lower privilege states. Similar to this, even if it was n’t intended to, any instruction that executes in one privilege mode automatically uses that privilege. A kernel wo n’t act as a confused deputy and accidentally access userspace memory, which is why techniques like SMAP are designed to help address this.

RISC-V refers to a physical memory protection ( PMP ) unit when it comes to protecting memory on microcontrollers. This only partially fulfills the functions of a large system’s memory management unit ( MMU), which is to protect various address ranges but not translate addresses. Again, because the operation that accesses the range ( load and store instructions with an arbitrary integer address ) is separate from the permission that is granted ( access to a range of the physical address space ), this violates intentionality’s principle. For instance, a PMP may specify an explicit region for the stack, but out-of-bound global pointers or heap pointer math may still result in writing to that region.

Our architecture is a subset of the smallest core RISC-V specification,V32E. There are only 15 registers, a 32-bit address space, one privilege level, and no PMP ( RISC-V typically has 31 ). To support 64-bit capabilities, we expand all of the registers. A capability is now required as the operand in all modified RISC-V load and store instructions. We do n’t offer compatibility for legacy integer-addressing loads and stores because, in contrast to the large CHERI systems, we demand that everything that uses a microcontroller be recompiled.

What does CHERI provide for us?

A capability model for accessing memory is provided by CHERI, which was initially created by The University of Cambridge and SRI International with DARPA funding. A capability must grant permission for each memory access ( load, store, or instruction fetch ). A data type known as a CHERI capability is one that the hardware safeguards by carefully manipulating it to describe and permit memory access. CHERI capabilities are 64-bit values protected by a non-assailable tag bit ( 65 bits in total ) on an address space of 32 bits. They ca n’t just appear out of nothing. The register file has features that give the address space full access when the system boots. One of these can be used to create any capability in the system by simply copying it, removing permissions, or limiting the amount of memory it can store.

A C/C++ programmer can think of capabilities and pointers as being equivalent because capabilities are the hardware type that the compiler uses to represent points. A pointer in a CHERI-C system is unforgeable, has bounds checks that cannot be bypassed, and may have fewer permissions ( for instance, it may only be read-only ). The hardware imposes bounds checks on each access because every function pointer, data pointers, and implicit pointser—such as a stack pointe or global point — are capabilities. This provides us with a foundation that can be used for fine-grained compartmentalization as well as object-granularity memory safety.

Both of our security principles are naturally respected by CHERI. The base address must be available for each load or store instruction. Because the instruction captures your intent (you intended to access the object identified by the pointer you gave as an operand ), using an offset that would transfer a pointers from one object you own to another object will fail. Indirect jumps are another example; they use an executable capability as an operand and fail if the function pointer lacks permission to execute. It will be dangerous to try to use a data pointer to perform functions. It is simple to enforce the principle of least privilege because the amount of memory a piece of running code can access is constrained by the number of capabilities it possesses.

CHERI being scaled down

Up until now, the majority of CHERI’s work has concentrated on 64-bit architectures with 162-bit capabilities. The Morello capability format has 18 bits of permissions, 20 bits for bounds and 16 bit for object type. A direct transliteration would require more than half of the 32 bits available in a 32-bit processor to store all metadata. Early work on 32-bit CHERI systems has been done, but there are many restrictions on the encoding, such as 3 bit precision, necessitating a lot of padding for large allocations. For objects up to 64 bytes, it can offer byte-granularity capabilities, but it also necessitates stronger base and top alignment. This was sufficient evidence that a 32-bit CHERI is feasible but insufficient for deployment in the real world.

We do n’t need to be able to represent them because our encoding saves space by noting that many of the permission combinations offered by CHERI are never used by secure systems. Our friends at the University of Cambridge put forth one of these decompositions a few years ago: capabilities that convey sealing or unsealing permissions operate under different names than all other capabilities and may therefore have their own format. They also suggested separating writeable and executable capabilities, but POSIX or Windows software had to be changed in order to do so.

On top of these concepts, we reduced the encoding space for 13 architectural permissions to 7 bits. Any memory-access permissions cannot be combined with sealing and unsealing authorizations. No capability can convey the rights to execute and write memory because execution and storage permissions are separate. Microcontroller software frequently avoids the assumptions that make this kind of change problematic in desktop or server codebases by at least assuming the option to run on a Harvard architecture.

We can have byte-granularity bounds up to 510 bytes for any object ( or sub object ) thanks to this compression, which is sufficient for embedded systems.

Additionally, the system lacks a single omnipotent capability due to privilege compression. A 64-bit CHERI system boots with capabilities that give the initial loader access to the entire address space in all possible ways. Our core provides three distinct root capabilities when it boots: one for execution, sealing, and writing to memory. One of these is the source of all capabilities in a running system. There is no way to ever build a write-and-execute capability in our system because doing so would require adding either write or execute to one of our root capabilities because CHERI does not provide any mechanism for adding permissions.

The code must explicitly use the appropriate memory for each action, but this does not preclude it from having two capabilities—one that gives write access to a memory and one that enables execution of that memory. Even when you can write to and execute from the same memory, you must explicitly specify which operation you are performing and properly authorize it. This is a desirable quality from an intentionality perspective.

Increasing the safety of temporal memory

Numerous optimizations have been suggested as a result of our work on large CHERI systems, and we are hoping that they will enhance the temporal safety performance of server-class systems. We face a somewhat simpler issue with small systems. We are not concerned about aliasing because there is no virtual memory. We can scan the entirety of memory in a very short amount of time due to the limited physical memory. Additionally, because embedded systems frequently have a single core, handling races between memory access and object release is not necessary.

In order to determine whether memory has been deallocated, we offer a hardware implementation of the Cornucopia revocation bitmap, which uses an 8-byte tag per 8 bytes of SRAM. When using free memory, the allocator allocates bits and postpones reuse until after performing a revocation scan. The main CPU pipeline should check this bit whenever it loads a capability, clearing the tag bit if the capability points to memory that has been flagged as revoked. This is an effective innovation for small CHERI systems. This implies that the register file cannot contain any capabilities that point to a deallocated object. Registers never contain stale pointers as long as we spill and reload them on the context switch ( which is by definition ) and when we return from free ( as a result of calling into the memory allocator’s compartment ). As a result, we are no longer required to perform checks on data loads and stores ( though we would still need to do so if the multicore microcontroller did n’t explicitly serialize cores for free ).

This is sufficient for use-after-free protection, but temporary delegation is frequently helpful in a world with multiple compartments. I want to be able to reuse an object right away if I pass a pointer to it from one compartment to another without having to free the object to make sure the callee does n’t have one. We offer a lexically scoped delegation mechanism that enables cross-component calls to be delegated access to an object graph.

On top of the 2 bit information flow control mechanism, which has been a part of CHERI since the beginning, we implement lexical delegation. Global and store-local permissions are required for this. A capability that lacks global permission is referred to as a local capability and can only be stored by another capability.

Only stacks and the register-saving area, which is used for context switches, have the store-local permission in our system. This implies that a local capability can be transferred between compartments and stored only on the stack. The stack must then be cleared upon return in software, which is all that is required.

We added one additional permission to the local/global mechanism for our CHERI variant: permit, indirect, load, global. You can load capabilities with the global permission set if you have this permission. Any capability you load will be cleared on a global and permit-indirect-load-global level without this permission. This mechanism is comparable to the deep-immutability support that we added to Morello with Arm. If you remove the permit, indirect load, global, and global permissions from a capability, you can move it to another compartment and ensure that the callee does n’t find anything in it when it comes back. For the duration of a call, you can combine this with the deep-immutability model to temporarily grant read-only access to an intricate data structure.

threads and comparisons

Compartitions and threads are the two main concepts for isolation in our software model. A thread defines temporal ownership, while a compartment defines spatial ownership. Code and data ( global variables ) are combined to form compartments, which serve as entry points for functions. Threads are stack-owning, schedulable entities that call upon compartments. The system is operating on one thread in a single compartment at any given time.

The CPU defines a compartment as having two capability registers. The capability global pointer ( CGP ) specifies the range of ( mutable ) globals for that compartment, while the program counter capability (PCC ) defines the code ( and read-only data ). Direct jumps that do n’t alter the PCC value are function calls within a compartment. All globals can be accessed through the CGP register, with the compiler inserting a bounds restriction if you choose the global’s address.

A compartment in software also specifies a group of entry points that domain transitions can use as targets. In the source code, calls between compartments appear to be regular C function calls, but the compiler adds a call sequence that jumps over comma switches. Cross-compartment isolation is ensured by the switcher. Before finally jumping to an entry point pointed out by the callee’s export table, it saves call-save registers, clears temporary and unused argument registered, truncates the stack, and zeroes out the delegated portion of stack. The saved stack pointer and cross-compartment return address, which are not accessible to the main compartment code, are managed by the switcher. In the event that a compartment crashes, the dependable stack also enables calling from another compartment.

A part of the caller’s stack that was not explicitly passed as an argument cannot be accessed by the called compartment thanks to stack truncation. No secrets or capabilities are leaked between compartments thanks to stack zeroing, which also occurs on the back end.

While calling between compartments takes a few hundred cycles ( on average ), it is still quite fast compared to function calls. Although zeroing the stack sounds slow, keep in mind that it is an embedded system, so stack sizes typically range from 1 to 2KiB. One KiB of local SRAM can be quickly zeroed even on a 50 MHz embedded system that is relatively slow. Unfortunately, systems like CheriBSD on Morello, where the stack of DRAM is typically 8 MiB, make it difficult to use this technique.

There is no way to pass a stack pointer from one thread to another, even when both are running in the same compartment, because stacks and register-save area are the only local store-targets of the system. A global will trap if you try to store a pointer to an object in the heap. According to the intentionality principle, this offers strong non-interference guarantees between threads: only stores accessible through an intentionally shared object are visible in another thread.

shared libraries

Some embedded software would require a lot more memory because it would be necessary to duplicate every piece of code between compartments. We also offer the idea of a shared library to prevent this. This compartment can be thought of as being immutable because it contains code but has no globals that can change, making it safe to call upon a sentry capability.

An existing CHERI feature known as a sentry ( sealed entry ) capability enables an executable capability to be magically sealed and then unlocked automatically upon the jump instruction. This cannot be changed, just like any other sealed capability, so sealed capabilities give calling functions access to a function’s code or read-only globals.

The restriction on embedded software is much less severe than it would be for large systems if shared libraries were to own globals. This model even allows for the use of a JavaScript interpreter, enabling multiple compartments to run code that are mutually distrusting.

Some library routines require interrupts to be turned off. By entering a flag bit in the control and status register ( CSR ) on RISC-V, interrupts are turned off. By removing the access-system-registers CHERI permission, we can stop untrusted code from accessing this register, but this is a very granular control that gives more privileges than typical library routines should. To encode interrupt posture, we have instead expanded the sentry mechanism. There are three different types of sentry: those that enable interrupts on jump, those who disable them, and those whose interrupt status is unaffected. A jump-and-link instruction will always produce a return capability with an explicit interrupt posture. The interrupt status is controlled by structed programming because these are exposed to C as function attributes, making it very simple to reason about at the source level.

a kernel that is privilege-separated

The loader is the only part of the system that can operate with full authority, despite the fact that several other parts require more privileges than a typical compartment. As soon as the system boots up, the loader starts operating and is in charge of organizing the compartments. This indicates that it starts operating with the capabilities that, when combined, enable all accesses. Then, from them, it derives more constrained capabilities. The loader maintains the three root capabilities throughout its execution, but it uses a more constrained root for everything else it does. Nothing in the system will ever run with full privileges after the loader is finished until a new system reset.

The switcher is the next-most privileged part. All domain transitions are the result of this. Because it is in charge of enforcing some of the crucial safeguards that regular compartments rely on, it belongs to the trusted computing base (TCB). It has special access to the trusted stack register (SCR ), which stores the ability to identify a small stack used for tracking cross-comparticipation calls on each thread, thanks to its program counter capability. A pointer to the thread’s register save area is also included in the trusted stack. The switcher is in charge of saving the register state and sending the scheduler a sealed capability to the thread state when using the context switch (either through interrupt or by explicitly yielding ).

The switcher is small enough to be easily audited and only has the state that it has borrowed from the running thread using the trusted stack. It is easy to audit for security because it always operates with interrupts disabled. This is crucial because it might violate thread isolation by failing to seal the pointer to the thread state before passing it to scheduler or by improperly clearing state on compartment transition.

There are fewer than 200 RISC-V instructions in the switcher, which is the only component that deals with untrusted data and operates with access-system registers permission.

The scheduler cannot access a thread’s register state because of the sealing operation. Although the scheduler can direct the switcher’s next thread, compartment or thread isolation cannot be violated. The TCB does not care about confidentiality or integrity; it only cares about availability ( it may refuse to run any threads ). The interrupt controller must be configured in order for the scheduler to have access to the memory-mapped I/O ( MMIO ) space, but this only gives it control over availability. It can decide whether or not interrupts are delivered and which thread to schedule when they are.

The memory allocator is the last TCB element. Because it is in charge of establishing boundaries for objects, this is always a crucial component of the TCB for any CHERI system. You wo n’t even have spatial memory safety if it does this incorrectly. Since our system manages revocation as well, bugs may introduce exploitable use-after-free vulnerabilities.

Cross-compartment memory disclosure may result from bugs in the memory allocator because it is in charge of managing a heap that is shared between compartments. Keep in mind that not all compartments on embedded systems use the heap, so this is primarily the case. The heap memory is the only feature that the majority of allocators have. A small ( isolated ) component that offers the revocation service is the sole exception. This must be capable of invalidating dangling pointers and scanning all mutable memory. Either hardware or a loop of roughly ten RISC-V instructions can be used to implement the revocation service.

Keep the seals intact.

A capability with permit-seal permission can transform another capability into an opaque token that cannot be changed thanks to CHERI’s sealing mechanism. Only by performing an unseal operation with a capability whose bounds include that value and which has the permit-unsell permission can this sealed capability be restored to usable capability. It has an address field of the sealing capability embedded in its “object type” field. Since the object type on Morello is 18 bits, many different opaque types can be passed between compartments.

Only 3 bits are left over for object types in our 32-bit capability encoding. All of these are reserved for privileged components to use. There is one on the switcher for sealing thread state and unsealing pointers to invoked compartments. One is used by the scheduler to safeguard message queues. One is also present in the allocator, which we employ to deliver a software-defined capability mechanism.

A special entry point is used by the software-defined capability mechanism to allocate an object with a header that contains the value that would normally be entered into the capability’s otype field. This can be a full 32-bit value ( less the few values the hardware uses ) because it is allocated in heaps. The allocator gives this object a sealed capability and will give the portion of the object that is not the header word an unsealed capability if given permission to do so. With the exception of pointers to complete objects, this ensures that the header is tamperproof ( accessible only inside the allocator ) and permits a wide range of sealing types.

This mechanism, for instance, allows the network stack to protect various connections from one another by sealing connection states, returning them to the caller, and then unsealing them when called. As a result, we can expand the idea of intentionality into higher-level abstractions. For example, sending data over the network stack requires the presence of specialized software.

Safe languages, what about them?

Any OS’s foundational components, including an RTOS, involve taking risky actions. For instance, the memory allocator must create an idea of objects using a flat address range. Safe Rust is unable to convey these ideas, and it offers few advantages over contemporary C++ code that does so by using the type system.

We want to be able to adopt existing code without making any changes for the remaining code that runs in compartments. Rust network stacks, TLS layers, and JavaScript interpreters may have fewer memory safety bugs after being rewritten, but a new implementation is more likely to have logic bugs than an established and well-tested one. These codebases can be retrofitted with memory safety using CHERI without needing to be rewritten.

Safe languages are a much better option for writing new code. Many pieces of the performance-critical code will be perfectly suited to a safe systems language like Rust. A CHERI Rust target is being worked on. We anticipate being able to enforce properties between compartments from various mutually distrusting authors within a trust domain using the type system because Rust’s delegation model is very similar to the set of properties that we can enforce across compartment boundaries. Rust will be a great language to use in this setting for development.

However, Rust is not the end of the safe language spectrum. Control-plane software that is n’t performance ( throughput or latency ) sensitive is used in a lot of the code that runs on microcontrollers. The system currently runs a JavaScript VM, and we intend to support additional fully managed languages. For programmers looking for a completely garbage-collected type of safe environment, JavaScript code running in one compartment is protected from C code in other compartments.

Announcement for today

As the first step toward recommending it as an official RISC-V standard extension, we plan to publish a technical report over the coming weeks outlining how we have modified the ABIs and Risc-v CHERI ISA to embedded systems for external feedback, review, and collaboration. We want to release our software stack, including the LLVM-based compiler and RTOS, and move our ISA implementation to the Ibex core of the lowRISC project by the end of 2022.

Microsoft Research’s David Chisnall, Hongyan Xia, Wes Filardo, and Robert Norton

Microsoft Security Response Center Saar Amar

Azure Silicon Engineering &amp, Solutions, Yucong Tao, Kunyan Liu

Tony Chen, Platform, Azure Edge, andamp