1. RMM Locking Guidelines

This document outlines the locking requirements, discusses the implementation and provides guidelines for a deadlock free RMM implementation. Further, the document hitherto is based upon RMM Alpha-05 specification and is expected to change as the implementation proceeds.

1.1. Introduction

In order to meet the requirement for the RMM to be small, simple to reason about, and to co-exist with contemporary hypervisors which are already designed to manage system memory, the RMM does not include a memory allocator. It instead relies on an untrusted caller providing granules of memory used to hold both meta data to manage realms as well as code and data for realms.

To maintain confidentiality and integrity of these granules, the RMM implements memory access controls by maintaining awareness of the state of each granule (aka Granule State, ref Implementation) and enforcing rules on how memory granules can transition from one state to another and how a granule can be used depending on its state. For example, all granules that can be accessed by software outside the PAR of a realm are in a specific state, and a granule that holds meta data for a realm is in another specific state that prevents it from being used as data in a realm and accidentally corrupted by a realm, which could lead to internal failure in the RMM.

Due to this complex nature of the operations supported by the RMM, for example when managing page tables for realms, the RMM must be able to hold locks on multiple objects at the same time. It is a well known fact that holding multiple locks at the same time can easily lead to deadlocking the system, as for example illustrated by the dining philosophers problem [EWD310]. In traditional operating systems software such issues are avoided by defining a partial order on all system objects and always acquiring a lower-ordered object before a higher-ordered object. This solution was shown to be correct by Dijkstra [EWD625]. Solutions are typically obtained by assigning an arbitrary order based upon certain attributes of the objects, for example by using the memory address of the object.

Unfortunately, software such as the RMM cannot use these methods directly because the RMM receives an opaque pointer from the untrusted caller and it cannot know before locking the object if it is indeed of the expected state. Furthermore, MMU page tables are hierarchical data structures and operations on the page tables typically must be able to locate a leaf node in the hierarchy based on single value (a virtual address) and therefore must walk the page tables in their hierarchical order. This implies an order of objects in the same Granule State which is not known by a process executing in the RMM before holding at least one lock on object in the page table hierarchy. An obvious solution to these problems would be to use a single global lock for the RMM, but that would serialize all operations across all shared data structures in the system and severely impact performance.

1.2. Requirements

To address the synchronization needs of the RMM described above, we must employ locking and lock-free mechanisms which satisfies a number of properties. These are discussed below:

1.2.1. Critical Section

A critical section can be defined as a section of code within a process that requires access to shared resources and that must not be executed while another process is in a corresponding section of code [WS2001].

Further, access to shared resources without appropriate synchronization can lead to race conditions, which can be defined as a situation in which multiple threads or processes read and write a shared item and the final result depends on the relative timing of their execution [WS2001].

In terms of RMM, an access to a shared resource can be considered as a list of operations/instructions in program order that either reads from or writes to a shared memory location (e.g. the granule data structure or the memory granule described by the granule data structure, ref Implementation). It is also understood that this list of operations does not execute indefinitely, but eventually terminates.

We can now define our desired properties as follows:

1.2.2. Mutual Exclusion

Mutual exclusion can be defined as the requirement that when one process is in a critical section that accesses shared resources, no other process may be in a critical section that accesses any of those shared resources [WS2001].

The following example illustrates how an implementation might enforce mutual exclusion of critical sections using a lock on a valid granule data structure struct granule *a:

struct granule *a;
bool r;

r = try_lock(a);
if (!r) {
        return -ERROR;
}
critical_section(a);
unlock(a);
other_work();

We note that a process might fail to perform the lock operation on object a and return an error or successfully acquire the lock, execute the critical_section(), unlock() and then continue to make forward progress to other_work() function.

1.2.3. Deadlock Avoidance

A deadlock can be defined as a situation in which two or more processes are unable to proceed because each is waiting for one of the others to do something [WS2001].

In other words, one or more processes are trying to enter their critical sections but none of them make forward progress.

We can then define the deadlock avoidance property as the inverse scenario:

When one or more processes are trying to enter their critical sections, at least one of them makes forward progress.

A deadlock is a fatal event if it occurs in supervisory software such as the RMM. This must be avoided as it can render the system vulnerable to exploits and/or unresponsive which may lead to data loss, interrupted service and eventually economic loss.

1.2.4. Starvation Avoidance

Starvation can be defined as a situation in which a runnable process is overlooked indefinitely by the scheduler; although it is able to proceed, it is never chosen [WS2001].

Then starvation avoidance can be defined as, all processes that are trying to enter their critical sections eventually make forward progress.

Starvation must be avoided, because if one or more processes do not make forward progress, the PE on which the process runs will not perform useful work and will be lost to the user, resulting in similar issues like a deadlocked system.

1.2.5. Nested Critical Sections

A critical section for an object may be nested within the critical section for another object for the same process. In other words, a process may enter more than one critical section at the same time.

For example, if the RMM needs to copy data from one granule to another granule, and must be sure that both granules can only be modified by the process itself, it may be implemented in the following way:

struct granule *a;
struct granule *b;
bool r;

r = try_lock(a);
if (!r) {
        return -ERROR;
}

/* critical section for granule a -- ENTER */

r = try_lock(b);
if (r) {
        /* critical section for granule b -- ENTER */
        b->foo = a->foo;
        /* critical section for granule b -- EXIT */
        unlock(b);
}

/* critical section for granule a -- EXIT */
unlock(a);

1.3. Implementation

The RMM maintains granule states by defining a data structure for each memory granule in the system. Conceptually, the data structure contains the following fields:

  • Granule State

  • Lock

  • Reference Count

The Lock field provides mutual exclusion of processes executing in their critical sections which may access the shared granule data structure and the shared meta data which may be stored in the memory granule which is in one of the RD, REC, and Table states. Both the data structure describing the memory granule and the contents of the memory granule itself can be accessed by multiple PEs concurrently and we therefore require some concurrency protocol to avoid corruption of shared data structures. An alternative to using a lock providing mutual exclusion would be to design all operations that access shared data structures as lock-free algorithms, but due to the complexity of the data structures and the operation of the RMM we consider this too difficult to accomplish in practice.

The Reference Count field is used to keep track of references between granules. For example, an RD describes a realm, and a REC describes an execution context within that realm, and therefore an RD must always exist when a REC exists. To prevent the RMM from destroying an RD while a REC still exists, the RMM holds a reference count on the RD for each REC associated with the same realm, and only when the all the RECs in a realm have been destroyed and the reference count on an RD drops to zero, can the RD be destroyed and the granule be repurposed for other use.

Based on the above, we now describe the Granule State field and the current locking/refcount implementation:

  • UnDelegated: These are granules for which RMM does not prevent the PAS of the granule from being changed by another agent to any value. In this state, the granule content access is not protected by granule::lock, as it is always subject to reads and writes from Non-Realm worlds.

  • Delegated: These are granules with memory only accessible by the RMM. The granule content is protected by granule::lock. No reference counts are held on this granule state.

  • Realm Descriptor (RD): These are granules containing meta data describing a realm, and only accessible by the RMM. Granule content access is protected by granule::lock. A reference count is also held on this granule for each associated REC granule.

  • Realm Execution Context (REC): These are granules containing meta data describing a virtual PE running in a realm, and are only accessible by the RMM. The execution content access is not protected by granule::lock, because we cannot enter a realm while holding the lock. Further, the following rules apply with respect to the granule’s reference counts:

    • A reference count is held on this granule when a REC is running.

    • As REC cannot be run on two PEs at the same time, the maximum value of the reference count is one.

    • When the REC is entered, the reference count is incremented (set to 1) atomically while granule::lock is held.

    • When the REC exits, the reference counter is released (set to 0) atomically with store-release semantics without granule::lock being held.

    • The RMM can access the granule’s content on the entry and exit path from the REC while the reference is held.

  • Translation Table: These are granules containing meta data describing virtual to physical address translation for the realm, accessible by the RMM and the hardware Memory Management Unit (MMU). Granule content access is protected by granule::lock, but hardware translation table walks may read the RTT at any point in time. Multiple granules in this state can only be locked at the same time if they are part of the same tree, and only in topological order from root to leaf. The topological order of concatenated root level RTTs is from the lowest address to the highest address. The complete internal locking order for RTT granules is: RD -> [RTT] -> … -> RTT. A reference count is held on this granule for each entry in the RTT that refers to a granule:

    • Table s2tte.

    • Valid s2tte.

    • Valid_NS s2tte.

    • Assigned s2tte.

  • Data: These are granules containing realm data, accessible by the RMM and by the realm to which it belongs. Granule content access is not protected by granule::lock, as it is always subject to reads and writes from within a realm. A granule in this state is always referenced from exactly one entry in an RTT granule which must be locked before locking this granule. Only a single DATA granule can be locked at a time on a given PE. The complete internal locking order for DATA granules is: RD -> RTT -> RTT -> … -> DATA. No reference counts are held on this granule type.

1.3.1. Locking

The RMM uses spinlocks along with the object state for locking implementation. The lock provides similar exclusive acquire semantics known from trivial spinlock implementations, however also allows verification of whether the locked object is of an expected state.

The data structure for the spinlock can be described in C as follows:

typedef struct {
        unsigned int val;
} spinlock_t;

This data structure can be embedded in any object that requires synchronization of access, such as the struct granule described above.

The following operations are defined on spinlocks:

Typical spinlock operations
/*
 * Locks a spinlock with acquire memory ordering semantics or goes into
 * a tight loop (spins) and repeatedly checks the lock variable
 * atomically until it becomes available.
 */
void spinlock_acquire(spinlock_t *l);

/*
 * Unlocks a spinlock with release memory ordering semantics. Must only
 * be called if the calling PE already holds the lock.
 */
void spinlock_release(spinlock_t *l);

The above functions should not be directly used for locking/unlocking granules, instead the following should be used:

Granule locking operations
/*
 * Acquires a lock (or spins until the lock is available), then checks
 * if the granule is in the `expected_state`. If the `expected_state`
 * is matched, then returns `true`. Otherwise, releases the lock and
 * returns `false`.
 */
bool granule_lock_on_state_match(struct granule *g,
                                 enum granule_state expected_state);

/*
 * Used when we're certain of the state of an object (e.g. because we
 * hold a reference to it) or when locking objects whose reference is
 * obtained from another object, after that objects is locked.
 */
void granule_lock(struct granule *g,
                  enum granule_state expected_state);

/*
 * Obtains a pointer to a locked granule at `addr` if `addr` is a valid
 * granule physical address and the state of the granule at `addr` is
 * `expected_state`.
 */
struct granule *find_lock_granule(unsigned long addr,
                                  enum granule_state expected_state);

/* Find two granules and lock them in order of their address. */
return_code_t find_lock_two_granules(unsigned long addr1,
                                     enum granule_state expected_state1,
                                     struct granule **g1,
                                     unsigned long addr2,
                                     enum granule_state expected_state2,
                                     struct granule **g2);

/*
 * Obtain a pointer to a locked granule at `addr` which is unused
 * (refcount = 0), if `addr` is a valid granule physical address and the
 * state of the granule at `addr` is `expected_state`.
 */
struct granule *find_lock_unused_granule(unsigned long addr,
                                         enum granule_state
                                         expected_state);
Granule unlocking operations
/*
 * Release a spinlock held on a granule. Must only be called if the
 * calling PE already holds the lock.
 */
void granule_unlock(struct granule *g);

/*
 * Sets the state and releases a spinlock held on a granule. Must only
 * be called if the calling PE already holds the lock.
 */
void granule_unlock_transition(struct granule *g,
                               enum granule_state new_state);

1.3.2. Reference Counting

The reference count is implemented using the refcount variable within the granule structure to keep track of the references in between granules. For example, the refcount is used to prevent changes to the attributes of a parent granule which is referenced by child granules, ie. a parent with refcount not equal to zero.

Race conditions on the refcount variable are avoided by either locking the granule before accessing the variable or by lock-free mechanisms such as Single-Copy Atomic operations along with ARM weakly ordered ACQUIRE/RELEASE/RELAXED memory semantics to synchronize shared resources.

The following operations are defined on refcount:

Read a refcount value
/*
 * Single-copy atomic read of refcount variable with RELAXED memory
 * ordering semantics. Use this function if lock-free access to the
 * refcount is required with relaxed memory ordering constraints applied
 * at that point.
 */
unsigned long granule_refcount_read_relaxed(struct granule *g);

/*
 * Single-copy atomic read of refcount variable with ACQUIRE memory
 * ordering semantics. Use this function if lock-free access to the
 * refcount is required with acquire memory ordering constraints applied
 * at that point.
 */
unsigned long granule_refcount_read_acquire(struct granule *g);
Increment a refcount value
/*
 * Increments the granule refcount. Must be called with the granule
 * lock held.
 */
void __granule_get(struct granule *g);

/*
 * Increments the granule refcount by `val`. Must be called with the
 * granule lock held.
 */
void __granule_refcount_inc(struct granule *g, unsigned long val);

/* Atomically increments the reference counter of the granule.*/
void atomic_granule_get(struct granule *g);
Decrement a refcount value
/*
 * Decrements the granule refcount. Must be called with the granule
 * lock held.
 */
void __granule_put(struct granule *g);

/*
 * Decrements the granule refcount by `val`. Asserts if refcount can
 * become negative. Must be called with the granule lock held.
 */
void __granule_refcount_dec(struct granule *g, unsigned long val);

/* Atomically decrements the reference counter of the granule. */
void atomic_granule_put(struct granule *g);

/*
 * Atomically decrements the reference counter of the granule. Stores to
 * memory with RELEASE semantics.
 */
void atomic_granule_put_release(struct granule *g);
Directly access refcount value
/*
 * Directly reads/writes the refcount variable. Must be called with the
 * granule lock held.
 */
granule->refcount;

1.4. Guidelines

In order to meet the Requirements discussed above, this section stipulates some locking and lock-free algorithm implementation guidelines for developers.

1.4.1. Mutual Exclusion

The spinlock, acquire/release and atomic operations provide trivial mutual exclusion implementations for RMM. However, the following general guidelines should be taken into consideration:

  • Appropriate deadlock avoidance techniques should be incorporated when using multiple locks.

  • Lock-free access to shared resources should be atomic.

  • Memory ordering constraints should be used prudently to avoid performance degradation. For e.g. on an unlocked granule (e.g. REC), prior to the refcount update, if there are associated memory operations, then the update should be done with release semantics. However, if there are no associated memory accesses to the granule prior to the refcount update then release semantics will not be required.

1.4.2. Deadlock Avoidance

Deadlock avoidance is provided by defining a partial order on all objects in the system where the locking operation will eventually fail if the caller tries to acquire a lock of a different state object than expected. This means that no two processes will be expected to acquire locks in a different order than the defined partial order, and we can rely on the same reasoning for deadlock avoidance as shown by Dijkstra [EWD625].

To establish this partial order, the objects referenced by RMM can be classified into two categories:

  1. External: A granule state belongs to the external class iff _any_ parameter in _any_ RMI command is an address of a granule which is expected to be in that state. The following granule states are external:

    • GRANULE_STATE_NS

    • GRANULE_STATE_DELEGATED

    • GRANULE_STATE_RD

    • GRANULE_STATE_REC

  2. Internal: A granule state belongs to the internal class iff it is not an external. These are objects which are referenced from another object after that object is locked. Each internal object should be referenced from exactly one place. The following granule states are internal:

    • GRANULE_STATE_RTT

    • GRANULE_STATE_DATA

We now state the locking guidelines for RMM as:

  1. Granules expected to be in an external state must be locked before locking any granules in an internal state.

  2. Granules expected to be in an external state must be locked in order of their physical address, starting with the lowest address.

  3. Once a granule expected to be in an external state has been locked, its state must be checked against the expected state. If these do not match, the granule must be unlocked and no further granules may be locked within the currently-executing RMM command.

  4. Granules in an internal state must be locked in order of state:

    • RTT

    • DATA

  5. Granules in the same internal state must be locked in the Implementation defined order for that specific state.

  6. A granule’s state can be changed iff the granule is locked and the reference count is zero.

1.4.3. Starvation Avoidance

Currently, the lock-free implementation for RMI.REC.Enter provides Starvation Avoidance in RMM. However, for the locking implementation, Starvation Avoidance is yet to be accomplished. This can be added by a ticket or MCS style locking implementation [MCS].

1.4.4. Nested Critical Sections

Spinlocks provide support for nested critical sections. Processes can acquire multiple spinlocks at the same time, as long as the locking order is not violated.

1.5. References

EWD310

Dijkstra, E.W. Hierarchical ordering of sequential processes. EWD 310.

EWD625(1,2)

Dijkstra, E.W. Two starvation free solutions to a general exclusion problem. EWD 625.

MCS

Mellor-Crummey, John M. and Scott, Michael L. Algorithms for scalable synchronization on shared-memory multiprocessors. ACM TOCS, Volume 9, Issue 1, Feb. 1991.

WS2001(1,2,3,4,5)

Stallings, W. (2001). Operating systems: Internals and design principles. Upper Saddle River, N.J: Prentice Hall.