Rust's Mutex, Atomics and UnsafeCell – Spooky Action at a Distance?
August 07, 2024 – Leon Schuermann – #rust
A defining feature of Rust is its concept of aliasing ⊕
mutability. This rule governs that at any time a value may either
have multiple immutable shared references, or a single mutable
unique reference, and never both. While this greatly helps in
producing fast, efficient and correct code, it can be limiting. To
this end, Rust also features types that bend these rules, like
Mutex, RwLock,
Cell, RefCell, and
the ominous UnsafeCell types. In this post we
explore how these types interact with Rust's type system and concepts
of references and aliasing ⊕ mutability. We do so by looking at
how the AtomicUsize and Mutex types are implemented, how violating Rust's assumptions
can lead to incorrect optimizations by the compiler, and the
surprising global impact of synchronization primitives.
Table of Contents
1. Atomics From Scratch
William Henderson's blog features a great article on how to implement Rust's atomic types (and ultimately even Mutexes) from scratch. I encourage reading the full article, but will repeat the essential parts of the code here (for an AMD64 CPU) as well:
1: pub struct AtomicUsize { 2: inner: UnsafeCell<usize>, 3: } 4: 5: unsafe impl Send for AtomicUsize {} 6: unsafe impl Sync for AtomicUsize {} 7: 8: impl AtomicUsize { 9: pub const fn new(v: usize) -> Self { 10: Self { 11: inner: UnsafeCell::new(v), 12: } 13: } 14: pub fn load(&self) -> usize { 15: unsafe { *self.inner.get() } 16: } 17: pub fn store(&self, v: usize) { 18: unsafe { 19: asm!( 20: "lock; xchg [{address}], {v}", 21: address = in(reg) self.inner.get(), 22: v = in(reg) v 23: ); 24: } 25: } 26: }
We start by defining our AtomicUsize type to be a
wrapper around an UnsafeCell<usize>. This container
type allows us to modify its contents even if we hold just a shared reference to
it (a concept known as interior mutability). As put in William Henderson's
post:
The
usizeneeds to be within anUnsafeCellto effectively opt out of Rust’s borrow checker […]
Whereas UnsafeCell is the underlying primitive type
that causes the Rust compiler to provide these semantics for the objects
contained therein, it is more commonly used in safe wrappers such as
Cell or RefCell.
It is important to note that despite allowing interior mutability,
UnsafeCell does not perform any synchronization on its
own—one of the key distinguishing features of other types like Mutex. This restriction is visible by the fact that UnsafeCell does not implement the Sync
trait. Without this trait, Rust will not allow us to share a reference to this
type between threads, and thus guarantees that UnsafeCell will never have unsynchronized accesses.
However, this property is problematic for our AtomicUsize type whose entire purpose is to enable reading and
writing of a shared value among multiple threads! For this, the documentation of
UnsafeCell provides us with an escape hatch:
At all times, you must avoid data races. If multiple threads have access to the same UnsafeCell, then any writes must have a proper happens-before relation to all other accesses (or use atomics).
OK, so to safely use an UnsafeCell between threads we
will need to perform synchronization ourselves (establish a happens-before
relation) or use atomics. As the above is building an atomic usize primitive,
we should thus be safe to tell the Rust compiler that our own AtomicUsize type is indeed both Send (can be
moved between threads) and Sync (can be concurrently
used from multiple threads). Because the Rust compiler cannot check whether this
is actually the case, implementing these traits are unsafe operations.
The above example makes use of the fact that, in AMD64, a read-lock is not
required when all operations that modify a shared value use locking
instructions. As such, we can simply dereference the value stored in the
UnsafeCell:
Since our implementations of all the methods which change the value will be atomic, the CPU won’t let us observe the value in the middle of an operation. Therefore, to load the value, we can simply return the current value of the UnsafeCell.
For writing, William Henderson's version uses an AMD64 inline assembly instruction that performs a locking store operation.
2. Optimized to the Breaking Point
With our AtomicUsize implemented, let's write a basic
reader / writer test case: we spawn a thread that spin-waits
until the shared atomic assumes a non-zero value, and another that writes a 1 to
it after a short delay (Rust playground).
1: fn reader(a: &AtomicUsize) { 2: while a.load() == 0 { 3: // Wait until `a` contains a non-zero value. 4: } 5: } 6: 7: fn writer(a: &AtomicUsize) { 8: a.store(1); 9: } 10: 11: fn main() { 12: let shared_atomic = Arc::new(AtomicUsize::new(0)); 13: 14: // Start up the reader thread: 15: let shared_atomic_clone = shared_atomic.clone(); 16: let join_handle = std::thread::spawn( 17: move || reader(&shared_atomic_clone)); 18: 19: // Wait for 50ms: 20: std::thread::sleep(Duration::from_millis(50)); 21: 22: // Write a non-zero value to the shared atomic: 23: writer(&shared_atomic); 24: 25: // Wait for the reader thread to exit: 26: join_handle.join().unwrap() 27: }
When we run this example with the above AtomicUsize
implementation in a debug build, it runs for about 50ms—as expected. However,
once we turn on more aggressive compiler optimizations by building in release
mode, the program does not quit and fully consumes one CPU core … hm, what's
going on here?
If we look at the assembly of our fn reader (by
selecting "Show Assembly" in this Rust playground) we see that Rust generates
the following machine code:
1: fn_reader: 2: cmpq $0, (%rdi) 3: je .LBB24_1 4: retq 5: 6: .LBB24_1: 7: jmp .LBB24_1
Even though our source code calls a.load() for each
loop iteration, it seems like the generated function only reads the
usize value (at an adress in %rdi) once and, if it
happens to be equal to 0, jumps into an infinite loop at symbol
.LBB24_1. That's not at all what we want!
It seems that the Rust compiler determines that it should be enough to read the
value returned by a.load() once, and then assumes that
it may never change. If it was 0 when entering this function, because the
function never modifies it, the compiler thus assumes that it will always stay
at this value and never return from the while
loop. This seems quite counter-intuitive given that the entire purpose behind
UnsafeCell is to allow interior mutability. Thus, Rust
should need to expect that its underlying value changes even though we only hold
an immutable (&) reference to it.
We can observe similar behavior with the following minimal example
(Playground). Here we use Rust's Cell type instead of
UnsafeCell for convenience; Cell is nothing more than a safe wrapper around UnsafeCell.
1: pub fn cell_test(a: &Cell<usize>) -> bool { 2: let first = a.get(); 3: let second = a.get(); 4: first == second 5: }
Looking at the generated assembly, Rust turns this function into a simple "return true":
1: fn_cell_test: 2: movb 1, %al 3: retq
3. UnsafeCell Revisited
With the above behavior, one might wonder what an UnsafeCell is actually useful for? We cannot—by default—share it between
threads and clearly its concepts of interior mutability do not extend to give
any guarantees in the face of concurrent accesses to its memory. So what's the
point? To illustrate this, we can extend the above example like so:
1: use std::cell::Cell; 2: 3: pub fn cell_test(a: &Cell<usize>, writer: &dyn Fn()) -> bool { 4: let first = a.get(); 5: writer(); 6: let second = a.get(); 7: first == second 8: } 9: 10: pub fn main() { 11: let a = Cell::new(0); 12: println!( 13: "Cell contents identical? {:?}", 14: cell_test(&a, &|| { a.set(1) }) 15: ); 16: }
We extend our fn cell_test to take an
additional writer function reference
argument. This writer function is then called
in between our first and second read of the Cell<usize>.
After this change, we can observe that Rust instead generates the following assembly1 (Playground):
1: fn_cell_test: 2: pushq %r14 3: pushq %rbx 4: pushq %rax 5: movq %rdi, %rbx 6: movq (%rdi), %r14 7: movq %rsi, %rdi 8: callq *40(%rdx) 9: cmpq (%rbx), %r14 10: sete %al 11: addq $8, %rsp 12: popq %rbx 13: popq %r14 14: retq
There's a lot more happening here. The important bits are:
- on line 5, we copy the pointer to our
Cell<usize>, initially passed in register%rdi, into%rbx, - on line 6, we read the contents of the
Cell<usize>into register%r14, - on line 8, we invoke the
writerfunction, - and finally, on lines 9 and 10 we compare the current contents of
the
Cell<usize>to the value we read on line 6, and set the return value (%al) totrue($1) orfalse($0) using theseteinstruction.
This makes sense: we're handing out two shared references to the
Cell<usize>, one passed to fn cell_test directly, and one embedded in the closure constructed on line 14 of
fn main. When we invoke writer
on line 5, because it also holds to a reference to this Cell, we must assume that its contents have been changed and thus re-read
it.
We can force the compiler to generate quite similar assembly when we replace the
invocation of writer with a call to std::hint::black_box:
1: pub fn cell_test(a: &Cell<usize>) -> bool { 2: let first = a.get(); 3: std::hint::black_box(a); 4: let second = a.get(); 5: first == second 6: }
From its documentation, std::hint::black_box is
[an] identity function that hints to the compiler to be maximally pessimistic about what
black_boxcould do.
In this case, one of the possible effects that the compiler assumes
black_box to have is performing an a.set(1) operation. Hence it makes sense that black_box would force the compiler to re-read the Cell's contents on the second call to a.get().
However, things get even more interesting when we replace this with a call to
std::hint::black_box(()). In this case, the Rust
compiler will be maximally pessimistic about what black_box could do to its function argument, an instance of the unit
type. Its documentation doesn't say anything about what could happen to other
variables such as a. Yet, when we compile the following
code…
1: pub fn cell_test(a: &Cell<usize>) -> bool { 2: let first = a.get(); 3: std::hint::black_box(()); 4: let second = a.get(); 5: first == second 6: }
…we see that the Cell is indeed read twice. Curious!
1: fn_cell_test: 2: movq (%rdi), %rax 3: cmpq (%rdi), %rax 4: sete %al 5: retq
From these findings we can derive two properties of UnsafeCell:
- For part of the code where the Rust compiler assumes that it has full
visibility over all operations carried out on all references that are
accessible, it may make assumptions about an
UnsafeCell's contents not changing. - Across any code path where Rust does not have this degree of visibility
(e.g., by calling into an opaque function, foreign function, or invoking a
black_box), it instead assumes that anUnsafeCell's contents may have changed.
It is important to note that Rust generally assumes that an UnsafeCell is not shared across multiple threads (apart from the
happens-before condition mentioned above). Thus, even though multiple
references may exist for any UnsafeCell at any time, as
long as the compiler determines that the current thread does not modify a
particular reference, and no other code is invoked that could modify any other
reference to this UnsafeCell, its contents will not
change.
This explains the behavior of our AtomicUsize
example. As part of the load function, we're simply accessing and dereferencing
the contents of the inner UnsafeCell. Rust does not
assume that this value is shared with any other concurrent thread, and by having
full visibility of the operations carried out within the reader thread, it determines that its value may never be modified within
this function; hence reading its value once ought to be sufficient.
4. Concurrency and UnsafeCell
This raises the question: given that UnsafeCell does
not deliver our desired semantics, how are AtomicUsize,
Mutex, and friends actually implemented in Rust? Well
… using UnsafeCell!
Looking at the core::atomic module with macros
expanded, the implementation of AtomicUsize looks
roughly like this:
1: pub struct AtomicUsize { 2: v: UnsafeCell<usize>, 3: }
This seems virtually identical to how our own AtomicUsize is implemented. However, there is a crucial difference when
we look at the AtomicUsize::load function:
1: impl AtomicUsize{ 2: pub fn load(&self, order: Ordering) -> usize { 3: unsafe { atomic_load(self.v.get(), order) } 4: } 5: ... 6: } 7: 8: #[inline] 9: unsafe fn atomic_load<T: Copy>(dst: *const T, order: Ordering) -> T { 10: match order { 11: Relaxed => intrinsics::atomic_load_relaxed(dst), 12: Acquire => intrinsics::atomic_load_acquire(dst), 13: ... 14: } 15: }
In addition to the ability to specify a desired ordering or consistency
model, this implementation uses compiler intrinsics to generate the
corresponding atomic operations. This means that the compiler will automatically
generate the appropriate instructions for the target architecture to perform
these atomic operations. These intrinsics can also enforce other high-level
constraints, such as on the order of operations. In this case, the
Relaxed ordering model corresponds to our custom
implementation of the AtomicUsize type.
Here is a Rust playground that uses the standard library's AtomicUsize type with Ordering::Relaxed
instead. Looking at the generated assembly, we can observe that Rust turns both
the load and store operations
into simple reads and writes with the movq instruction:
1: fn_reader: 2: movq (%rdi), %rax 3: testq %rax, %rax 4: je fn_reader 5: retq 6: 7: fn_writer: 8: movq $1, (%rdi) 9: retq
Superficially, it seems that our implementation and the Rust standard library's
should thus be functionally equivalent! We're generating essentially the same
machine code, and rely on the target-architecture specific guarantee that
naturally aligned load and store operations of usize
values are always atomic.
However, there is another crucial difference: the generated assembly re-reads
the UnsafeCell's value in each loop iteration. Thus,
this AtomicUsize implementation generates actually
correct code—despite producing effectively equivalent instructions otherwise!
Now seems like a good time to revisit the UnsafeCell's
documentation concerning concurrency:
At all times, you must avoid data races. If multiple threads have access to the same UnsafeCell, then any writes must have a proper happens-before relation to all other accesses (or use atomics).
The phrasing here is unfortunate in two regards:
- When we don't have a proper happens-before relation (we'll get to that
later), we need to use atomic operations for concurrent accesses on the
UnsafeCell's memory instead. However, clearly these atomic operations must not only be used for writes, but also for reads! - It is not only important that the generated machine code instructions are
atomic (as is the case with our custom
AtomicUsize). We also need to communicate to the Rust compiler that these instructions are used as atomic operations. Somehow, something magic about theatomic_load_intrinsics causes the compiler to not assume that the memory behind this reference cannot change.
We can confirm the latter by looking at the LLVM intermediate representation (IR) that the Rust compiler generates. This format is then used by the LLVM compiler backend to generate optimized machine code for various architectures. However, for those optimizations to be correct, the Rust compiler has to encode a bunch of information on program behavior into this LLVM IR.
Rust generates the following slightly cryptic LLVM IR for fn reader using our custom AtomicUsize:
1: ; Function Attrs: nofree noinline norecurse nosync nounwind nonlazybind memory(argmem: read) uwtable 2: define dso_local void @fn_reader(ptr nocapture noundef nonnull readonly align 8 %a) unnamed_addr #4 { 3: start: 4: %_2.pr = load i64, ptr %a, align 8 5: %0 = icmp eq i64 %_2.pr, 0 6: br i1 %0, label %bb1, label %bb3 7: 8: bb1: ; preds = %start, %bb1 9: br label %bb1 10: 11: bb3: ; preds = %start 12: ret void 13: }
Let's see what changes if we swap this out for the standard library's
AtomicUsize:
1: ; Function Attrs: nofree noinline norecurse nounwind nonlazybind memory(argmem: readwrite) uwtable 2: define dso_local void @fn_reader(ptr nocapture noundef nonnull readonly align 8 %a) unnamed_addr #4 { 3: start: 4: br label %bb1 5: 6: bb1: ; preds = %bb1, %start 7: %0 = load atomic i64, ptr %a monotonic, align 8 8: %1 = icmp eq i64 %0, 0 9: br i1 %1, label %bb1, label %bb3 10: 11: bb3: ; preds = %bb1 12: ret void 13: }
The changes on line 4 and 7 respectively make sense: instead of a
load instruction, Rust generates a load atomic LLVM instruction. The added monotonic here is the desired consistency model, where monotonic corresponds to Rust's Ordering::Relaxed.
The branching behavior also changes: whereas in the former version the label
bb1: forms a simple infinite loop, the latter performs
a read every time it jumps back to bb1:.
However, the addition of the nosync function attribute
for our custom AtomicUsize version is perhaps most
telling. Here's what LLVM's language reference says about this attribute:
This function attribute indicates that the function does not communicate (synchronize) with another thread through memory or other well-defined means. Synchronization is considered possible in the presence of atomic accesses that enforce an order, thus not “unordered” and “monotonic”, volatile accesses, as well as convergent function calls. […]
If a nosync function does ever synchronize with another thread, the behavior is undefined.
This means that Rust compiler intrinsics such as intrinsics::atomic_load_relaxed implicitly tell the compiler that code
may use these atomic operations to synchronize with other concurrent
code. Without these operations, Rust simply assumes that variables are not
concurrently modified by other code and is thus allowed to reason about them as
if they aren't shared with other threads at all. Ultimately, this is safe as
UnsafeCell is not Sync—by
default, it cannot be shared between threads. And our custom AtomicUsize is unsound, as we promise to the compiler that
AtomicUsize is safe to share between threads, but do
not adequately instruct the compiler to synchronize all accesses to underlying
memory. This is regardless of whether or not the generated machine code uses
atomic instructions.
5. Spooky Action at a Distance?
Finally, let's look at how Mutex is implemented on top
of UnsafeCell. The Mutex type
in the standard library is implemented based on an UnsafeCell and a platform-specific mutex locking mechanism (we can ignore
poison for now):
1: pub struct Mutex<T: ?Sized> { 2: inner: sys::Mutex, 3: poison: poison::Flag, 4: data: UnsafeCell<T>, 5: }
Here, sys::Mutex is platform specific, and happens to
use the futex implementation on Linux:
1: // std::sys::sync::mutex::futex::Mutex 2: pub struct Mutex { 3: futex: AtomicU32, 4: }
The atomic integer type within this futex::Mutex also
varies between systems and happens to be AtomicU32 for
UNIX. Recall from the AtomicUsize example above that
these atomic types in turn are just another wrapper around an UnsafeCell. It's UnsafeCell all the way down!
To get access to the contents of a mutex we need to lock it. The
Mutex::lock function is implemented as follows:
1: pub fn lock(&self) -> LockResult<MutexGuard<'_, T>> { 2: unsafe { 3: self.inner.lock(); 4: MutexGuard::new(self) 5: } 6: }
After calling lock() on the platform-specific inner
Mutex struct, the user is provided a MutexGuard object. This MutexGuard is simply a
wrapper that retains a reference to the original Mutex. Notably, it provides entirely unsynchronized access to the
underlying data (which is simply contained in a UnsafeCell):
1: impl<T: ?Sized> Deref for MutexGuard<'_, T> { 2: type Target = T; 3: 4: fn deref(&self) -> &T { 5: unsafe { &*self.lock.data.get() } 6: } 7: }
This function looks a lot like our custom AtomicUsize::load implementation. In fact, what happens here is quite
similar to this first example: a reference to an UnsafeCell is shared between threads, and accesses to the
UnsafeCell's contents do not use any special intrinsics
or atomic operations. So … this surely isn't sound in practice?!
Of course, Rust's Mutex implementation is correct. To
understand why, we need to look into the implementation of the inner.lock() method. Here's the implementation of futex::Mutex::lock:
1: pub fn lock(&self) { 2: if self.futex.compare_exchange(UNLOCKED, LOCKED, Acquire, Relaxed).is_err() { 3: self.lock_contended(); 4: } 5: }
Essentially, the futex implementation uses an atomic integer shared between
threads to record the current lock state of the mutex. When attempting to lock
the mutex, it uses a compare-exchange operation to atomically write a value of
1 (locked) into this integer, if and only if the current value of the atomic
integer currently contains a value of 0 (unlocked). If the mutex is
currently locked, it asks the operating system to inform it when the lock is
free again (self.lock_contended()). See this excellent
post by Eli Bendersky for more context on futex.
Unfortunately, the implementation of AtomicU32::compare_exchange (as invoked on self.futex) is too complex to fully depict here. However, ultimately this
function ends up calling into the following helper function:
1: unsafe fn atomic_compare_exchange<T: Copy>( 2: dst: *mut T, 3: old: T, 4: new: T, 5: success: Ordering, 6: failure: Ordering, 7: ) -> Result<T, T> { 8: let (val, ok) = unsafe { 9: match (success, failure) { 10: (Relaxed, Relaxed) => { 11: intrinsics::atomic_cxchg_relaxed_relaxed(dst, old, new) 12: } 13: (Relaxed, Acquire) => { 14: intrinsics::atomic_cxchg_relaxed_acquire(dst, old, new) 15: } 16: ...
The implementation here again uses compiler intrinsics to generate the
actual underlying machine code. By compiling a simplified version of
the above (Playground), we can confirm that these intrinsics generate
corresponding lock cmpxchgb instructions
before the value stored in the UnsafeCell
is accessed:
1: fn_reader: 2: movb $1, %cl 3: .LBB24_1: 4: xorl %eax, %eax 5: lock cmpxchgb %cl, 8(%rdi) 6: jne .LBB24_1 7: cmpq $0, (%rdi) 8: movb $0, 8(%rdi) 9: je .LBB24_1 10: retq
Thus, if used correctly, a compare_exchange
operation seems to be sufficient to synchronize accesses to a Rust
UnsafeCell across threads.
…But wait! Our Mutex holds not one, but two UnsafeCells internally. And we only used a compare_exchange on one of the UnsafeCells, namely the one holding information on whether the
Mutex is locked or not. The other
UnsafeCell holding the actual data that the
mutex is supposed to be synchronized never has any atomic or locking
instructions used on it. In fact, we're still using the exact same
problematic code snippet (*self.inner.get())
that led to problems with our AtomicUsize in
the first place!
What we're observing here is that a local operation performed on one value has implicit impact on how the compiler treats assumptions around another, completely independent value. I think that this can be quite surprising and unintuitive; you might call it "spooky action at a distance".
Indeed, the atomic_cxchg compiler intrinsic
again does more than meets the eye: in addition to generating the
appropriate atomic machine instruction(s), it can also establish an
ordering, or happens-before relation of other program
operations. For example, when performing an atomic load operation with
an Acquire ordering, the program is granted
the following guarantee, per Rust's documentation:
When coupled with a load, if the loaded value was written by a store operation with
Release(or stronger) ordering, then all subsequent operations become ordered after that store. In particular, all subsequent loads will see data written before the store.
It is important to understand that these ordering requirements can not only influence the types of atomic machine instructions ultimately executed by the hardware. They can also influence other compiler assumptions and program optimizations, and in particular, limit the flexibility that a compiler has to re-order or elide operations that access memory.
Revisiting the example of our Mutex
implementation (Playground), there are two basic guarantees we need to
maintain:
- We must never give out concurrent access to the
datafield. We do this by acquiring a unique lock with a shared atomic value. - By the next time a thread acquires a lock on the
Mutex, all changes made by the previous holder of the lock must be visible in this new thread.
And memory ordering specifications help us achieve this second
goal. In practice, when we load an atomic value with Ordering::Acquire, we prohibit the compiler to move reads
on variables that could be shared with other threads to before
this operation. As hinted at by the Rust documentation, this is not
enough on its own though: the compiler would still be able to move
writes beyond the point where the mutex is unlocked by the previous
holder of the lock. These writes would thus not necessarily be visible
to the new lock holder, infringing on our guarantees. For this reason,
we release the lock by performing another atomic operation, this time
with Release ordering—preventing this exact
optimization. So long as these atomic operations are performed in
tandem on the same atomic value, the compiler will provide us these
guarantees for all, global variables that may potentially be shared
with other threads.
6. Conclusion
Concurrency and synchronization are tricky subjects on their own. As we have seen above, things get even more nuanced when we throw Rust's concepts around borrowing, references, and its compiler optimizations into the mix. While many of these basic concepts were familiar to me from both practical experience and theoretical computer science lectures, seeing how they play in to the actual implementation of concurrency primitives in a high-level language such as Rust is still interesting (and at times surprising). I hope that this post can demystify some of the "magic" behavior and optimizations you may observe around these constructs.
One of those particularly nebulous constructs to me has always been
UnsafeCell: sure, it "opts out" of the Rust
borrow checker, but what other effects does it have? We need to
"avoid data races", but how to do so exactly? Reasoning about this is
hard, in part because we need to consider both Rust's high-level
language invariants, low-level compiler optimization effects, and
their interactions. Some slightly clearer language in
UnsafeCell's documentation could help a great
amount here. For instance, a "happens-before" relation is well-defined
by LLVM, but there is no clear documentation on which Rust compiler
intrinsics establish it.
While they make sense when thinking about, something particularly surprising are the non-local "spooky" effects that certain compiler intrinsics have on other program behavior. Rust usually requires developers to think locally ("is this reference still alive here?" or "have this variable's contents been moved?"). Instead, these intrinsics have implicit, global effects on program behavior after compiler optimizations, and these effects do not manifest in Rust's type system at all.
7. Aside: What About VolatileCell?
Next to these concurrency primitives that Rust ships, some users
decide to develop their own. There is parking_lot, a crate with more
efficient implementations of synchronization primitives. If you want
to avoid using locks at all, the lockfree crate might be interesting
to you. And one of those special types that is popular among embedded
developers is VolatileCell (like in vcell,
or its Tock equivalent). In this section, we will try to apply some of
the concepts learned above to this construct. Note that while much of
the content of this section is trying to reason about the safety of
VolatileCell, I do not claim for any of
this to be authoritative information. These are mostly just notes I
wrote down while reasoning about this myself.
Embedded systems or bare-metal code commonly operate over memory that is not backed by RAM in the conventional sense. Instead, these memory address represent registers that are provided by peripherals, so-called "Memory Mapped I/O" (MMIO).
In general, these peripherals run in parallel to the CPU—thus, it may be reasonable to effectively model them to be separate threads in your system. They "share" certain MMIO memory addresses with your program and may, at times, even be able to write to arbitrary regular program memory too. Similar to threads, they should follow a contract for when it is safe to read from or write to certain memory.
However, despite behaving similar to a separate thread, these devices can differ in one significant regard: memory accesses (that is, both read or write operations) may also have arbitrary, device-specific side effects. A common example is that of a serial console (UART) controller. These devices feature internal queues that hold on to a limited amount of received bytes, which are later read by the CPU. While a separate thread might expose this as, e.g., a ring-buffer protected by a Mutex, such MMIO peripherals can implement a more efficient, lock-free way of exposing this information. Following example of a read-queue implemented within a serial console controller, it may expose the current queue's head element for every read to a specific queue register, but on each read also immediately discard (dequeue) this head element.
When paired with compiler optimizations as we've seen above, this can be problematic. The compiler makes many assumptions on how memory will behave, which can lead it to elide certain accesses, reorder them, or even insert spurious accesses when it believes this to be safe. These peripherals, though, don't share the compilers understanding around memory behavior. Thus, an optimizing compiler can translate a driver that correctly implements a device's hardware contract into something incorrect, or even dangerous (e.g., when a peripheral can override program memory).
To avoid these issues, Rust provides volatile memory operations (i.e., reads and writes). Volatile operations provide some unique guarantees:
Volatile operations are intended to act on I/O memory, and are guaranteed to not be elided or reordered by the compiler across other volatile operations.
The compiler shouldn’t change the relative order or number of volatile memory operations.
However, notably, volatile operations are independent from atomic operations:
Just like in C, whether an operation is volatile has no bearing whatsoever on questions involving concurrent access from multiple threads. Volatile accesses behave exactly like non-atomic accesses in that regard. In particular, a race between a
read_volatileand any other operation (reading or writing) on the same location is undefined behavior.
These properties make their interactions with synchronization,
ordering, and other compiler optimizations all the more
interesting. For this post, we'll focus only on three things that have
recently come up in discussions around Tock's use of VolatileCell. VolatileCell is not
something that Rust provides; you can view its implementation here. In
fact, its soundness is subject to debate (e.g., in the Rust unsafe
code guidelines) and is the motivation for this post.
7.1. Introducing VolatileCell
VolatileCell's purpose is to make a
developer's life easier: peripherals that expose an MMIO-based
interface typically do so by having their registers laid out such
that it can be modeled like a #[repr(C)] struct, like this:
1: #[repr(C)] 2: pub struct UartRegisters { 3: /// Transmit Data Register 4: txdata: u32, 5: /// Receive Data Register 6: rxdata: u32, 7: /// Transmit Control Register 8: txctrl: u32, 9: /// Receive Control Register 10: rxctrl: u32, 11: /// Interrupt Enable Register 12: ie: u32, 13: /// Interrupt Pending Register 14: ip: u32, 15: /// Baud Rate Divisor Register 16: div: u32, 17: }
If we then cast a pointer at the address at which the peripheral's
MMIO interface is exposed in memory (its base address) to a
&mut UartRegisters reference, we have a
convenient way to access its registers.
Unfortunately, doing so would not be sound. Given that the peripheral can act like a different thread and change the values of these registers independent of the CPU, creating a unique, mutable reference to this struct would infringe on Rust's "no mutable aliasing" requirement. Despite that, regular memory accesses are subject to the compiler optimizations described above; for these registers, we need all accesses to be volatile.
This is the niche that VolatileCell fills!
It's quite simple, and like all things we discuss today wraps an
UnsafeCell internally:
1: pub struct VolatileCell<T> { 2: value: UnsafeCell<T>, 3: }
To implement these volatile accesses, it features custom
get and set methods
respectively:
1: impl<T> VolatileCell<T> { 2: pub fn get(&self) -> T { 3: // self.value.get() returns a *mut T pointer here: 4: unsafe { ptr::read_volatile(self.value.get()) } 5: } 6: 7: pub fn set(&self, value: T) { 8: unsafe { ptr::write_volatile(self.value.get(), value) } 9: } 10: }
Thus, in essence, VolatileCell combines the
properties of interior mutability with those of volatile memory
accesses. With it, we can transform the above struct into a similar
version, while still benefiting from a convenient API:
1: pub struct UartRegisters { 2: /// Transmit Data Register 3: txdata: VolatileCell<u32>, 4: /// Receive Data Register 5: rxdata: VolatileCell<u32>, 6: ...
Neat.
7.2. VolatileCell and dereferenceable
Unfortunately, VolatileCell has issues. I
encourage you to read the discussion on the Rust unsafe code
guidelines, but in short, this problem comes from the fact that all
references in Rust are marked as dereferenceable by default.
This dereferenceability means that the compiler is free to, for instance as part of certain optimizations, insert a spurious read to the reference's contents. And performing such a spurious read on a memory location where reads can have side effects can mean that for many uses, using this abstraction can be dangerous and unsound.
Unfortunately, we do not have a good answer around this problem yet. For now, it seems that the only way to safely interact with such volatile memory is through volatile pointer operations, and never creating a Rust reference to the memory in question.
7.3. VolatileCell and Concurrency
Another interesting concern is that of concurrency. As peripherals can
be modeled as to behave effectively like another thread, we should
make sure that VolatileCell is safe to use in
these contexts—that is, making sure that it is safe to have a
VolatileCell defined over memory that is
being changed by hardware, and having all accesses be properly
synchronized.
Our analysis of Rust's Mutex gives us
confidence that the mere existence of VolatileCell over memory that is being modified by hardware
should not be of concern. In particular, this issue was raised on a PR
for the Tock operating system: given that UnsafeCell does not make any guarantees about thread safety,
and Rust is always free to dereference its contents based on the
dereferenceable attribute, how can it possibly be safe to retain a
reference to it in the first place? However, we can observe that
Mutex does exactly this: it holds a reference
to an UnsafeCell even when it may be modified
concurrently by a different thread. Importantly, Mutex does ensure that any direct, intentional accesses of
its value are properly synchronized.
This brings us to our second question on synchronization. Here, the following statement is particularly worrying:
In particular, a race between a
read_volatileand any other operation (reading or writing) on the same location is undefined behavior.
Does using read_volatile over MMIO memory
that may arbitrary change contents across any two accesses count as a
"data race"?
Unfortunately, atomic operations do not help here either:
Since C++ does not support mixing atomic and non-atomic accesses, or non-synchronized different-sized accesses to the same data, Rust does not support those operations either.
I do not see a clear answer to this question. In practice, many developers rely on volatile accesses are for interacting with such hardware-modified registers without issues. One of the reasons for this might be that on most hardware platforms, MMIO read operations are always consistent: registers are updated atomically by peripherals, and a CPU will never be exposed to a partial write. It would be great if the Rust documentation could provide more guidance for these questions.
7.4. VolatileCell and Happens-Before Relations
Finally, one important aspect to consider is that of volatile operations and their interactions with regular memory accesses. Again, Rust provides the following guarantees for volatile memory operations [emphasis added]:
Volatile operations are intended to act on I/O memory, and are guaranteed to not be elided or reordered by the compiler across other volatile operations.
This is a substantially different guarantee than the ordering-constraints imposed by atomic operations: while atomics can establish a happens-before relation between two parts of parallel threads in a program that affects a global set of variables, ordering guarantees around volatile operations are constrained to only these operations. This makes sense for many use-cases of volatile operations. For instance, it may matter that an "interrupt clear" flag is written before data is removed from a receive queue, but generally a peripheral should not be affected by other, independent memory accesses outside of its MMIO address space.
However, certain peripherals break this assumption, most prominently ones that perform "Direct Memory Access" (DMA). A DMA-capable peripheral not only has control over its MMIO registers, but can also access (a subset of) the same RAM that is accessible to the CPU. These devices can read from or write to a specified memory region after being instructed to do so through a write to a MMIO register.
This interaction between volatile accesses to an MMIO register and access to regular memory can create issues in the face of compiler optimizations, or hardware-reordering of memory accesses, though. For example, given that the compiler does not make promises of the order of volatile operations relative to other memory accesses, it could happen that a peripheral could read from a buffer prepared by the CPU, when those buffer writes have not actually been committed to memory yet.
For this reason, we need to establish an additional, explicit ordering
guarantee between other memory operations and these volatile
accesses. Rust provides us the std::sync::atomic::fence function to do just that. This
function works similar to atomic operations and takes an
Ordering argument. It will ensure that
compiler optimizations respect the requested ordering guarantees and,
if applicable, will also emit a CPU instruction that prevents
hardware-based reordering of accesses beyond this point.
Thus, when kicking off an operation that reads from memory written
by the CPU, we must use a fence operation
with an Ordering::Release argument, to ensure
that all writes become visible to another thread (or the hardware)
that is synchronizing with our current thread. For reading data that
the hardware has written into memory, we must use a fence operation with Ordering::Acquire
instead. This ensures that we are seeing all changes since the point
at which the hardware has informed us that an operation has finished
(e.g., through a volatile read or interrupt).
It should be noted that this relies on a fairly liberal interpretation
of the following comment in std::sync::atomic::fence's documentation:
A fence ‘A’ which has (at least) Release ordering semantics, synchronizes with a fence ‘B’ with (at least) Acquire semantics, if and only if there exist operations X and Y, both operating on some atomic object ‘M’ such that A is sequenced before X, Y is sequenced before B and Y observes the change to M. This provides a happens-before dependence between A and B.
Here, we assume that the volatile reads and writes to MMIO registers count as operations that are operating on an atomic object. However, this seems to be a fair assumption to make on many hardware platforms in practice, and clarifying it likely also ties in to the considerations of the previous subsection.
Footnotes:
In practice, Rust knows which exact function we pass into the
writer argument and generates a fn_cell_test.specialized.1: symbol that it calls instead. Because our
fn cell_test is marked pub
though, Rust also generates the general fn_cell_test:
symbol that does not make any assumptions on the function passed into its second
argument.