Rust's Mutex, Atomics and UnsafeCell – Spooky Action at a Distance?
August 07, 2024 – Leon Schuermann – #rust
A defining feature of Rust is its concept of aliasing ⊕
mutability. This rule governs that at any time a value may either
have multiple immutable shared references, or a single mutable
unique reference, and never both. While this greatly helps in
producing fast, efficient and correct code, it can be limiting. To
this end, Rust also features types that bend these rules, like
Mutex
, RwLock
,
Cell
, RefCell
, and
the ominous UnsafeCell
types. In this post we
explore how these types interact with Rust's type system and concepts
of references and aliasing ⊕ mutability. We do so by looking at
how the AtomicUsize
and Mutex
types are implemented, how violating Rust's assumptions
can lead to incorrect optimizations by the compiler, and the
surprising global impact of synchronization primitives.
Table of Contents
1. Atomics From Scratch
William Henderson's blog features a great article on how to implement Rust's atomic types (and ultimately even Mutexes) from scratch. I encourage reading the full article, but will repeat the essential parts of the code here (for an AMD64 CPU) as well:
1: pub struct AtomicUsize { 2: inner: UnsafeCell<usize>, 3: } 4: 5: unsafe impl Send for AtomicUsize {} 6: unsafe impl Sync for AtomicUsize {} 7: 8: impl AtomicUsize { 9: pub const fn new(v: usize) -> Self { 10: Self { 11: inner: UnsafeCell::new(v), 12: } 13: } 14: pub fn load(&self) -> usize { 15: unsafe { *self.inner.get() } 16: } 17: pub fn store(&self, v: usize) { 18: unsafe { 19: asm!( 20: "lock; xchg [{address}], {v}", 21: address = in(reg) self.inner.get(), 22: v = in(reg) v 23: ); 24: } 25: } 26: }
We start by defining our AtomicUsize
type to be a
wrapper around an UnsafeCell<usize>
. This container
type allows us to modify its contents even if we hold just a shared reference to
it (a concept known as interior mutability). As put in William Henderson's
post:
The
usize
needs to be within anUnsafeCell
to effectively opt out of Rust’s borrow checker […]
Whereas UnsafeCell
is the underlying primitive type
that causes the Rust compiler to provide these semantics for the objects
contained therein, it is more commonly used in safe wrappers such as
Cell
or RefCell
.
It is important to note that despite allowing interior mutability,
UnsafeCell
does not perform any synchronization on its
own—one of the key distinguishing features of other types like Mutex
. This restriction is visible by the fact that UnsafeCell
does not implement the Sync
trait. Without this trait, Rust will not allow us to share a reference to this
type between threads, and thus guarantees that UnsafeCell
will never have unsynchronized accesses.
However, this property is problematic for our AtomicUsize
type whose entire purpose is to enable reading and
writing of a shared value among multiple threads! For this, the documentation of
UnsafeCell
provides us with an escape hatch:
At all times, you must avoid data races. If multiple threads have access to the same UnsafeCell, then any writes must have a proper happens-before relation to all other accesses (or use atomics).
OK, so to safely use an UnsafeCell
between threads we
will need to perform synchronization ourselves (establish a happens-before
relation) or use atomics. As the above is building an atomic usize primitive,
we should thus be safe to tell the Rust compiler that our own AtomicUsize
type is indeed both Send
(can be
moved between threads) and Sync
(can be concurrently
used from multiple threads). Because the Rust compiler cannot check whether this
is actually the case, implementing these traits are unsafe
operations.
The above example makes use of the fact that, in AMD64, a read-lock is not
required when all operations that modify a shared value use locking
instructions. As such, we can simply dereference the value stored in the
UnsafeCell
:
Since our implementations of all the methods which change the value will be atomic, the CPU won’t let us observe the value in the middle of an operation. Therefore, to load the value, we can simply return the current value of the UnsafeCell.
For writing, William Henderson's version uses an AMD64 inline assembly instruction that performs a locking store operation.
2. Optimized to the Breaking Point
With our AtomicUsize
implemented, let's write a basic
reader / writer test case: we spawn a thread that spin-waits
until the shared atomic assumes a non-zero value, and another that writes a 1
to
it after a short delay (Rust playground).
1: fn reader(a: &AtomicUsize) { 2: while a.load() == 0 { 3: // Wait until `a` contains a non-zero value. 4: } 5: } 6: 7: fn writer(a: &AtomicUsize) { 8: a.store(1); 9: } 10: 11: fn main() { 12: let shared_atomic = Arc::new(AtomicUsize::new(0)); 13: 14: // Start up the reader thread: 15: let shared_atomic_clone = shared_atomic.clone(); 16: let join_handle = std::thread::spawn( 17: move || reader(&shared_atomic_clone)); 18: 19: // Wait for 50ms: 20: std::thread::sleep(Duration::from_millis(50)); 21: 22: // Write a non-zero value to the shared atomic: 23: writer(&shared_atomic); 24: 25: // Wait for the reader thread to exit: 26: join_handle.join().unwrap() 27: }
When we run this example with the above AtomicUsize
implementation in a debug build, it runs for about 50ms—as expected. However,
once we turn on more aggressive compiler optimizations by building in release
mode, the program does not quit and fully consumes one CPU core … hm, what's
going on here?
If we look at the assembly of our fn reader
(by
selecting "Show Assembly" in this Rust playground) we see that Rust generates
the following machine code:
1: fn_reader: 2: cmpq $0, (%rdi) 3: je .LBB24_1 4: retq 5: 6: .LBB24_1: 7: jmp .LBB24_1
Even though our source code calls a.load()
for each
loop iteration, it seems like the generated function only reads the
usize
value (at an adress in %rdi
) once and, if it
happens to be equal to 0
, jumps into an infinite loop at symbol
.LBB24_1
. That's not at all what we want!
It seems that the Rust compiler determines that it should be enough to read the
value returned by a.load()
once, and then assumes that
it may never change. If it was 0
when entering this function, because the
function never modifies it, the compiler thus assumes that it will always stay
at this value and never return from the while
loop. This seems quite counter-intuitive given that the entire purpose behind
UnsafeCell
is to allow interior mutability. Thus, Rust
should need to expect that its underlying value changes even though we only hold
an immutable (&
) reference to it.
We can observe similar behavior with the following minimal example
(Playground). Here we use Rust's Cell
type instead of
UnsafeCell
for convenience; Cell
is nothing more than a safe wrapper around UnsafeCell
.
1: pub fn cell_test(a: &Cell<usize>) -> bool { 2: let first = a.get(); 3: let second = a.get(); 4: first == second 5: }
Looking at the generated assembly, Rust turns this function into a simple "return true":
1: fn_cell_test: 2: movb 1, %al 3: retq
3. UnsafeCell
Revisited
With the above behavior, one might wonder what an UnsafeCell
is actually useful for? We cannot—by default—share it between
threads and clearly its concepts of interior mutability do not extend to give
any guarantees in the face of concurrent accesses to its memory. So what's the
point? To illustrate this, we can extend the above example like so:
1: use std::cell::Cell; 2: 3: pub fn cell_test(a: &Cell<usize>, writer: &dyn Fn()) -> bool { 4: let first = a.get(); 5: writer(); 6: let second = a.get(); 7: first == second 8: } 9: 10: pub fn main() { 11: let a = Cell::new(0); 12: println!( 13: "Cell contents identical? {:?}", 14: cell_test(&a, &|| { a.set(1) }) 15: ); 16: }
We extend our fn cell_test
to take an
additional writer
function reference
argument. This writer
function is then called
in between our first and second read of the Cell<usize>
.
After this change, we can observe that Rust instead generates the following assembly1 (Playground):
1: fn_cell_test: 2: pushq %r14 3: pushq %rbx 4: pushq %rax 5: movq %rdi, %rbx 6: movq (%rdi), %r14 7: movq %rsi, %rdi 8: callq *40(%rdx) 9: cmpq (%rbx), %r14 10: sete %al 11: addq $8, %rsp 12: popq %rbx 13: popq %r14 14: retq
There's a lot more happening here. The important bits are:
- on line 5, we copy the pointer to our
Cell<usize>
, initially passed in register%rdi
, into%rbx
, - on line 6, we read the contents of the
Cell<usize>
into register%r14
, - on line 8, we invoke the
writer
function, - and finally, on lines 9 and 10 we compare the current contents of
the
Cell<usize>
to the value we read on line 6, and set the return value (%al
) totrue
($1
) orfalse
($0
) using thesete
instruction.
This makes sense: we're handing out two shared references to the
Cell<usize>
, one passed to fn cell_test
directly, and one embedded in the closure constructed on line 14 of
fn main
. When we invoke writer
on line 5, because it also holds to a reference to this Cell
, we must assume that its contents have been changed and thus re-read
it.
We can force the compiler to generate quite similar assembly when we replace the
invocation of writer
with a call to std::hint::black_box
:
1: pub fn cell_test(a: &Cell<usize>) -> bool { 2: let first = a.get(); 3: std::hint::black_box(a); 4: let second = a.get(); 5: first == second 6: }
From its documentation, std::hint::black_box
is
[an] identity function that hints to the compiler to be maximally pessimistic about what
black_box
could do.
In this case, one of the possible effects that the compiler assumes
black_box
to have is performing an a.set(1)
operation. Hence it makes sense that black_box
would force the compiler to re-read the Cell
's contents on the second call to a.get()
.
However, things get even more interesting when we replace this with a call to
std::hint::black_box(())
. In this case, the Rust
compiler will be maximally pessimistic about what black_box
could do to its function argument, an instance of the unit
type. Its documentation doesn't say anything about what could happen to other
variables such as a
. Yet, when we compile the following
code…
1: pub fn cell_test(a: &Cell<usize>) -> bool { 2: let first = a.get(); 3: std::hint::black_box(()); 4: let second = a.get(); 5: first == second 6: }
…we see that the Cell
is indeed read twice. Curious!
1: fn_cell_test: 2: movq (%rdi), %rax 3: cmpq (%rdi), %rax 4: sete %al 5: retq
From these findings we can derive two properties of UnsafeCell
:
- For part of the code where the Rust compiler assumes that it has full
visibility over all operations carried out on all references that are
accessible, it may make assumptions about an
UnsafeCell
's contents not changing. - Across any code path where Rust does not have this degree of visibility
(e.g., by calling into an opaque function, foreign function, or invoking a
black_box
), it instead assumes that anUnsafeCell
's contents may have changed.
It is important to note that Rust generally assumes that an UnsafeCell
is not shared across multiple threads (apart from the
happens-before condition mentioned above). Thus, even though multiple
references may exist for any UnsafeCell
at any time, as
long as the compiler determines that the current thread does not modify a
particular reference, and no other code is invoked that could modify any other
reference to this UnsafeCell
, its contents will not
change.
This explains the behavior of our AtomicUsize
example. As part of the load function, we're simply accessing and dereferencing
the contents of the inner UnsafeCell
. Rust does not
assume that this value is shared with any other concurrent thread, and by having
full visibility of the operations carried out within the reader
thread, it determines that its value may never be modified within
this function; hence reading its value once ought to be sufficient.
4. Concurrency and UnsafeCell
This raises the question: given that UnsafeCell
does
not deliver our desired semantics, how are AtomicUsize
,
Mutex
, and friends actually implemented in Rust? Well
… using UnsafeCell
!
Looking at the core::atomic
module with macros
expanded, the implementation of AtomicUsize
looks
roughly like this:
1: pub struct AtomicUsize { 2: v: UnsafeCell<usize>, 3: }
This seems virtually identical to how our own AtomicUsize
is implemented. However, there is a crucial difference when
we look at the AtomicUsize::load
function:
1: impl AtomicUsize{ 2: pub fn load(&self, order: Ordering) -> usize { 3: unsafe { atomic_load(self.v.get(), order) } 4: } 5: ... 6: } 7: 8: #[inline] 9: unsafe fn atomic_load<T: Copy>(dst: *const T, order: Ordering) -> T { 10: match order { 11: Relaxed => intrinsics::atomic_load_relaxed(dst), 12: Acquire => intrinsics::atomic_load_acquire(dst), 13: ... 14: } 15: }
In addition to the ability to specify a desired ordering or consistency
model, this implementation uses compiler intrinsics to generate the
corresponding atomic operations. This means that the compiler will automatically
generate the appropriate instructions for the target architecture to perform
these atomic operations. These intrinsics can also enforce other high-level
constraints, such as on the order of operations. In this case, the
Relaxed
ordering model corresponds to our custom
implementation of the AtomicUsize
type.
Here is a Rust playground that uses the standard library's AtomicUsize
type with Ordering::Relaxed
instead. Looking at the generated assembly, we can observe that Rust turns both
the load
and store
operations
into simple reads and writes with the movq
instruction:
1: fn_reader: 2: movq (%rdi), %rax 3: testq %rax, %rax 4: je fn_reader 5: retq 6: 7: fn_writer: 8: movq $1, (%rdi) 9: retq
Superficially, it seems that our implementation and the Rust standard library's
should thus be functionally equivalent! We're generating essentially the same
machine code, and rely on the target-architecture specific guarantee that
naturally aligned load and store operations of usize
values are always atomic.
However, there is another crucial difference: the generated assembly re-reads
the UnsafeCell
's value in each loop iteration. Thus,
this AtomicUsize
implementation generates actually
correct code—despite producing effectively equivalent instructions otherwise!
Now seems like a good time to revisit the UnsafeCell
's
documentation concerning concurrency:
At all times, you must avoid data races. If multiple threads have access to the same UnsafeCell, then any writes must have a proper happens-before relation to all other accesses (or use atomics).
The phrasing here is unfortunate in two regards:
- When we don't have a proper happens-before relation (we'll get to that
later), we need to use atomic operations for concurrent accesses on the
UnsafeCell
's memory instead. However, clearly these atomic operations must not only be used for writes, but also for reads! - It is not only important that the generated machine code instructions are
atomic (as is the case with our custom
AtomicUsize
). We also need to communicate to the Rust compiler that these instructions are used as atomic operations. Somehow, something magic about theatomic_load_
intrinsics causes the compiler to not assume that the memory behind this reference cannot change.
We can confirm the latter by looking at the LLVM intermediate representation (IR) that the Rust compiler generates. This format is then used by the LLVM compiler backend to generate optimized machine code for various architectures. However, for those optimizations to be correct, the Rust compiler has to encode a bunch of information on program behavior into this LLVM IR.
Rust generates the following slightly cryptic LLVM IR for fn reader
using our custom AtomicUsize
:
1: ; Function Attrs: nofree noinline norecurse nosync nounwind nonlazybind memory(argmem: read) uwtable 2: define dso_local void @fn_reader(ptr nocapture noundef nonnull readonly align 8 %a) unnamed_addr #4 { 3: start: 4: %_2.pr = load i64, ptr %a, align 8 5: %0 = icmp eq i64 %_2.pr, 0 6: br i1 %0, label %bb1, label %bb3 7: 8: bb1: ; preds = %start, %bb1 9: br label %bb1 10: 11: bb3: ; preds = %start 12: ret void 13: }
Let's see what changes if we swap this out for the standard library's
AtomicUsize
:
1: ; Function Attrs: nofree noinline norecurse nounwind nonlazybind memory(argmem: readwrite) uwtable 2: define dso_local void @fn_reader(ptr nocapture noundef nonnull readonly align 8 %a) unnamed_addr #4 { 3: start: 4: br label %bb1 5: 6: bb1: ; preds = %bb1, %start 7: %0 = load atomic i64, ptr %a monotonic, align 8 8: %1 = icmp eq i64 %0, 0 9: br i1 %1, label %bb1, label %bb3 10: 11: bb3: ; preds = %bb1 12: ret void 13: }
The changes on line 4 and 7 respectively make sense: instead of a
load
instruction, Rust generates a load atomic
LLVM instruction. The added monotonic
here is the desired consistency model, where monotonic
corresponds to Rust's Ordering::Relaxed
.
The branching behavior also changes: whereas in the former version the label
bb1:
forms a simple infinite loop, the latter performs
a read every time it jumps back to bb1:
.
However, the addition of the nosync
function attribute
for our custom AtomicUsize
version is perhaps most
telling. Here's what LLVM's language reference says about this attribute:
This function attribute indicates that the function does not communicate (synchronize) with another thread through memory or other well-defined means. Synchronization is considered possible in the presence of atomic accesses that enforce an order, thus not “unordered” and “monotonic”, volatile accesses, as well as convergent function calls. […]
If a nosync function does ever synchronize with another thread, the behavior is undefined.
This means that Rust compiler intrinsics such as intrinsics::atomic_load_relaxed
implicitly tell the compiler that code
may use these atomic operations to synchronize with other concurrent
code. Without these operations, Rust simply assumes that variables are not
concurrently modified by other code and is thus allowed to reason about them as
if they aren't shared with other threads at all. Ultimately, this is safe as
UnsafeCell
is not Sync
—by
default, it cannot be shared between threads. And our custom AtomicUsize
is unsound, as we promise to the compiler that
AtomicUsize
is safe to share between threads, but do
not adequately instruct the compiler to synchronize all accesses to underlying
memory. This is regardless of whether or not the generated machine code uses
atomic instructions.
5. Spooky Action at a Distance?
Finally, let's look at how Mutex
is implemented on top
of UnsafeCell
. The Mutex
type
in the standard library is implemented based on an UnsafeCell
and a platform-specific mutex locking mechanism (we can ignore
poison
for now):
1: pub struct Mutex<T: ?Sized> { 2: inner: sys::Mutex, 3: poison: poison::Flag, 4: data: UnsafeCell<T>, 5: }
Here, sys::Mutex
is platform specific, and happens to
use the futex implementation on Linux:
1: // std::sys::sync::mutex::futex::Mutex 2: pub struct Mutex { 3: futex: AtomicU32, 4: }
The atomic integer type within this futex::Mutex
also
varies between systems and happens to be AtomicU32
for
UNIX. Recall from the AtomicUsize
example above that
these atomic types in turn are just another wrapper around an UnsafeCell
. It's UnsafeCell
all the way down!
To get access to the contents of a mutex we need to lock it. The
Mutex::lock
function is implemented as follows:
1: pub fn lock(&self) -> LockResult<MutexGuard<'_, T>> { 2: unsafe { 3: self.inner.lock(); 4: MutexGuard::new(self) 5: } 6: }
After calling lock()
on the platform-specific inner
Mutex
struct, the user is provided a MutexGuard
object. This MutexGuard
is simply a
wrapper that retains a reference to the original Mutex
. Notably, it provides entirely unsynchronized access to the
underlying data (which is simply contained in a UnsafeCell
):
1: impl<T: ?Sized> Deref for MutexGuard<'_, T> { 2: type Target = T; 3: 4: fn deref(&self) -> &T { 5: unsafe { &*self.lock.data.get() } 6: } 7: }
This function looks a lot like our custom AtomicUsize::load
implementation. In fact, what happens here is quite
similar to this first example: a reference to an UnsafeCell
is shared between threads, and accesses to the
UnsafeCell
's contents do not use any special intrinsics
or atomic operations. So … this surely isn't sound in practice?!
Of course, Rust's Mutex
implementation is correct. To
understand why, we need to look into the implementation of the inner.lock()
method. Here's the implementation of futex::Mutex::lock
:
1: pub fn lock(&self) { 2: if self.futex.compare_exchange(UNLOCKED, LOCKED, Acquire, Relaxed).is_err() { 3: self.lock_contended(); 4: } 5: }
Essentially, the futex
implementation uses an atomic integer shared between
threads to record the current lock state of the mutex. When attempting to lock
the mutex, it uses a compare-exchange operation to atomically write a value of
1
(locked) into this integer, if and only if the current value of the atomic
integer currently contains a value of 0
(unlocked). If the mutex is
currently locked, it asks the operating system to inform it when the lock is
free again (self.lock_contended()
). See this excellent
post by Eli Bendersky for more context on futex
.
Unfortunately, the implementation of AtomicU32::compare_exchange
(as invoked on self.futex
) is too complex to fully depict here. However, ultimately this
function ends up calling into the following helper function:
1: unsafe fn atomic_compare_exchange<T: Copy>( 2: dst: *mut T, 3: old: T, 4: new: T, 5: success: Ordering, 6: failure: Ordering, 7: ) -> Result<T, T> { 8: let (val, ok) = unsafe { 9: match (success, failure) { 10: (Relaxed, Relaxed) => { 11: intrinsics::atomic_cxchg_relaxed_relaxed(dst, old, new) 12: } 13: (Relaxed, Acquire) => { 14: intrinsics::atomic_cxchg_relaxed_acquire(dst, old, new) 15: } 16: ...
The implementation here again uses compiler intrinsics to generate the
actual underlying machine code. By compiling a simplified version of
the above (Playground), we can confirm that these intrinsics generate
corresponding lock cmpxchgb
instructions
before the value stored in the UnsafeCell
is accessed:
1: fn_reader: 2: movb $1, %cl 3: .LBB24_1: 4: xorl %eax, %eax 5: lock cmpxchgb %cl, 8(%rdi) 6: jne .LBB24_1 7: cmpq $0, (%rdi) 8: movb $0, 8(%rdi) 9: je .LBB24_1 10: retq
Thus, if used correctly, a compare_exchange
operation seems to be sufficient to synchronize accesses to a Rust
UnsafeCell
across threads.
…But wait! Our Mutex holds not one, but two UnsafeCell
s internally. And we only used a compare_exchange
on one of the UnsafeCell
s, namely the one holding information on whether the
Mutex
is locked or not. The other
UnsafeCell
holding the actual data that the
mutex is supposed to be synchronized never has any atomic or locking
instructions used on it. In fact, we're still using the exact same
problematic code snippet (*self.inner.get()
)
that led to problems with our AtomicUsize
in
the first place!
What we're observing here is that a local operation performed on one value has implicit impact on how the compiler treats assumptions around another, completely independent value. I think that this can be quite surprising and unintuitive; you might call it "spooky action at a distance".
Indeed, the atomic_cxchg
compiler intrinsic
again does more than meets the eye: in addition to generating the
appropriate atomic machine instruction(s), it can also establish an
ordering, or happens-before relation of other program
operations. For example, when performing an atomic load operation with
an Acquire
ordering, the program is granted
the following guarantee, per Rust's documentation:
When coupled with a load, if the loaded value was written by a store operation with
Release
(or stronger) ordering, then all subsequent operations become ordered after that store. In particular, all subsequent loads will see data written before the store.
It is important to understand that these ordering requirements can not only influence the types of atomic machine instructions ultimately executed by the hardware. They can also influence other compiler assumptions and program optimizations, and in particular, limit the flexibility that a compiler has to re-order or elide operations that access memory.
Revisiting the example of our Mutex
implementation (Playground), there are two basic guarantees we need to
maintain:
- We must never give out concurrent access to the
data
field. We do this by acquiring a unique lock with a shared atomic value. - By the next time a thread acquires a lock on the
Mutex
, all changes made by the previous holder of the lock must be visible in this new thread.
And memory ordering specifications help us achieve this second
goal. In practice, when we load an atomic value with Ordering::Acquire
, we prohibit the compiler to move reads
on variables that could be shared with other threads to before
this operation. As hinted at by the Rust documentation, this is not
enough on its own though: the compiler would still be able to move
writes beyond the point where the mutex is unlocked by the previous
holder of the lock. These writes would thus not necessarily be visible
to the new lock holder, infringing on our guarantees. For this reason,
we release the lock by performing another atomic operation, this time
with Release
ordering—preventing this exact
optimization. So long as these atomic operations are performed in
tandem on the same atomic value, the compiler will provide us these
guarantees for all, global variables that may potentially be shared
with other threads.
6. Conclusion
Concurrency and synchronization are tricky subjects on their own. As we have seen above, things get even more nuanced when we throw Rust's concepts around borrowing, references, and its compiler optimizations into the mix. While many of these basic concepts were familiar to me from both practical experience and theoretical computer science lectures, seeing how they play in to the actual implementation of concurrency primitives in a high-level language such as Rust is still interesting (and at times surprising). I hope that this post can demystify some of the "magic" behavior and optimizations you may observe around these constructs.
One of those particularly nebulous constructs to me has always been
UnsafeCell
: sure, it "opts out" of the Rust
borrow checker, but what other effects does it have? We need to
"avoid data races", but how to do so exactly? Reasoning about this is
hard, in part because we need to consider both Rust's high-level
language invariants, low-level compiler optimization effects, and
their interactions. Some slightly clearer language in
UnsafeCell
's documentation could help a great
amount here. For instance, a "happens-before" relation is well-defined
by LLVM, but there is no clear documentation on which Rust compiler
intrinsics establish it.
While they make sense when thinking about, something particularly surprising are the non-local "spooky" effects that certain compiler intrinsics have on other program behavior. Rust usually requires developers to think locally ("is this reference still alive here?" or "have this variable's contents been moved?"). Instead, these intrinsics have implicit, global effects on program behavior after compiler optimizations, and these effects do not manifest in Rust's type system at all.
7. Aside: What About VolatileCell
?
Next to these concurrency primitives that Rust ships, some users
decide to develop their own. There is parking_lot
, a crate with more
efficient implementations of synchronization primitives. If you want
to avoid using locks at all, the lockfree
crate might be interesting
to you. And one of those special types that is popular among embedded
developers is VolatileCell
(like in vcell
,
or its Tock equivalent). In this section, we will try to apply some of
the concepts learned above to this construct. Note that while much of
the content of this section is trying to reason about the safety of
VolatileCell
, I do not claim for any of
this to be authoritative information. These are mostly just notes I
wrote down while reasoning about this myself.
Embedded systems or bare-metal code commonly operate over memory that is not backed by RAM in the conventional sense. Instead, these memory address represent registers that are provided by peripherals, so-called "Memory Mapped I/O" (MMIO).
In general, these peripherals run in parallel to the CPU—thus, it may be reasonable to effectively model them to be separate threads in your system. They "share" certain MMIO memory addresses with your program and may, at times, even be able to write to arbitrary regular program memory too. Similar to threads, they should follow a contract for when it is safe to read from or write to certain memory.
However, despite behaving similar to a separate thread, these devices can differ in one significant regard: memory accesses (that is, both read or write operations) may also have arbitrary, device-specific side effects. A common example is that of a serial console (UART) controller. These devices feature internal queues that hold on to a limited amount of received bytes, which are later read by the CPU. While a separate thread might expose this as, e.g., a ring-buffer protected by a Mutex, such MMIO peripherals can implement a more efficient, lock-free way of exposing this information. Following example of a read-queue implemented within a serial console controller, it may expose the current queue's head element for every read to a specific queue register, but on each read also immediately discard (dequeue) this head element.
When paired with compiler optimizations as we've seen above, this can be problematic. The compiler makes many assumptions on how memory will behave, which can lead it to elide certain accesses, reorder them, or even insert spurious accesses when it believes this to be safe. These peripherals, though, don't share the compilers understanding around memory behavior. Thus, an optimizing compiler can translate a driver that correctly implements a device's hardware contract into something incorrect, or even dangerous (e.g., when a peripheral can override program memory).
To avoid these issues, Rust provides volatile memory operations (i.e., reads and writes). Volatile operations provide some unique guarantees:
Volatile operations are intended to act on I/O memory, and are guaranteed to not be elided or reordered by the compiler across other volatile operations.
The compiler shouldn’t change the relative order or number of volatile memory operations.
However, notably, volatile operations are independent from atomic operations:
Just like in C, whether an operation is volatile has no bearing whatsoever on questions involving concurrent access from multiple threads. Volatile accesses behave exactly like non-atomic accesses in that regard. In particular, a race between a
read_volatile
and any other operation (reading or writing) on the same location is undefined behavior.
These properties make their interactions with synchronization,
ordering, and other compiler optimizations all the more
interesting. For this post, we'll focus only on three things that have
recently come up in discussions around Tock's use of VolatileCell
. VolatileCell
is not
something that Rust provides; you can view its implementation here. In
fact, its soundness is subject to debate (e.g., in the Rust unsafe
code guidelines) and is the motivation for this post.
7.1. Introducing VolatileCell
VolatileCell
's purpose is to make a
developer's life easier: peripherals that expose an MMIO-based
interface typically do so by having their registers laid out such
that it can be modeled like a #[repr(C)]
struct, like this:
1: #[repr(C)] 2: pub struct UartRegisters { 3: /// Transmit Data Register 4: txdata: u32, 5: /// Receive Data Register 6: rxdata: u32, 7: /// Transmit Control Register 8: txctrl: u32, 9: /// Receive Control Register 10: rxctrl: u32, 11: /// Interrupt Enable Register 12: ie: u32, 13: /// Interrupt Pending Register 14: ip: u32, 15: /// Baud Rate Divisor Register 16: div: u32, 17: }
If we then cast a pointer at the address at which the peripheral's
MMIO interface is exposed in memory (its base address) to a
&mut UartRegisters
reference, we have a
convenient way to access its registers.
Unfortunately, doing so would not be sound. Given that the peripheral can act like a different thread and change the values of these registers independent of the CPU, creating a unique, mutable reference to this struct would infringe on Rust's "no mutable aliasing" requirement. Despite that, regular memory accesses are subject to the compiler optimizations described above; for these registers, we need all accesses to be volatile.
This is the niche that VolatileCell
fills!
It's quite simple, and like all things we discuss today wraps an
UnsafeCell
internally:
1: pub struct VolatileCell<T> { 2: value: UnsafeCell<T>, 3: }
To implement these volatile accesses, it features custom
get
and set
methods
respectively:
1: impl<T> VolatileCell<T> { 2: pub fn get(&self) -> T { 3: // self.value.get() returns a *mut T pointer here: 4: unsafe { ptr::read_volatile(self.value.get()) } 5: } 6: 7: pub fn set(&self, value: T) { 8: unsafe { ptr::write_volatile(self.value.get(), value) } 9: } 10: }
Thus, in essence, VolatileCell
combines the
properties of interior mutability with those of volatile memory
accesses. With it, we can transform the above struct into a similar
version, while still benefiting from a convenient API:
1: pub struct UartRegisters { 2: /// Transmit Data Register 3: txdata: VolatileCell<u32>, 4: /// Receive Data Register 5: rxdata: VolatileCell<u32>, 6: ...
Neat.
7.2. VolatileCell
and dereferenceable
Unfortunately, VolatileCell
has issues. I
encourage you to read the discussion on the Rust unsafe code
guidelines, but in short, this problem comes from the fact that all
references in Rust are marked as dereferenceable
by default.
This dereferenceability means that the compiler is free to, for instance as part of certain optimizations, insert a spurious read to the reference's contents. And performing such a spurious read on a memory location where reads can have side effects can mean that for many uses, using this abstraction can be dangerous and unsound.
Unfortunately, we do not have a good answer around this problem yet. For now, it seems that the only way to safely interact with such volatile memory is through volatile pointer operations, and never creating a Rust reference to the memory in question.
7.3. VolatileCell
and Concurrency
Another interesting concern is that of concurrency. As peripherals can
be modeled as to behave effectively like another thread, we should
make sure that VolatileCell
is safe to use in
these contexts—that is, making sure that it is safe to have a
VolatileCell
defined over memory that is
being changed by hardware, and having all accesses be properly
synchronized.
Our analysis of Rust's Mutex
gives us
confidence that the mere existence of VolatileCell
over memory that is being modified by hardware
should not be of concern. In particular, this issue was raised on a PR
for the Tock operating system: given that UnsafeCell
does not make any guarantees about thread safety,
and Rust is always free to dereference its contents based on the
dereferenceable
attribute, how can it possibly be safe to retain a
reference to it in the first place? However, we can observe that
Mutex
does exactly this: it holds a reference
to an UnsafeCell
even when it may be modified
concurrently by a different thread. Importantly, Mutex
does ensure that any direct, intentional accesses of
its value are properly synchronized.
This brings us to our second question on synchronization. Here, the following statement is particularly worrying:
In particular, a race between a
read_volatile
and any other operation (reading or writing) on the same location is undefined behavior.
Does using read_volatile
over MMIO memory
that may arbitrary change contents across any two accesses count as a
"data race"?
Unfortunately, atomic operations do not help here either:
Since C++ does not support mixing atomic and non-atomic accesses, or non-synchronized different-sized accesses to the same data, Rust does not support those operations either.
I do not see a clear answer to this question. In practice, many developers rely on volatile accesses are for interacting with such hardware-modified registers without issues. One of the reasons for this might be that on most hardware platforms, MMIO read operations are always consistent: registers are updated atomically by peripherals, and a CPU will never be exposed to a partial write. It would be great if the Rust documentation could provide more guidance for these questions.
7.4. VolatileCell
and Happens-Before Relations
Finally, one important aspect to consider is that of volatile operations and their interactions with regular memory accesses. Again, Rust provides the following guarantees for volatile memory operations [emphasis added]:
Volatile operations are intended to act on I/O memory, and are guaranteed to not be elided or reordered by the compiler across other volatile operations.
This is a substantially different guarantee than the ordering-constraints imposed by atomic operations: while atomics can establish a happens-before relation between two parts of parallel threads in a program that affects a global set of variables, ordering guarantees around volatile operations are constrained to only these operations. This makes sense for many use-cases of volatile operations. For instance, it may matter that an "interrupt clear" flag is written before data is removed from a receive queue, but generally a peripheral should not be affected by other, independent memory accesses outside of its MMIO address space.
However, certain peripherals break this assumption, most prominently ones that perform "Direct Memory Access" (DMA). A DMA-capable peripheral not only has control over its MMIO registers, but can also access (a subset of) the same RAM that is accessible to the CPU. These devices can read from or write to a specified memory region after being instructed to do so through a write to a MMIO register.
This interaction between volatile accesses to an MMIO register and access to regular memory can create issues in the face of compiler optimizations, or hardware-reordering of memory accesses, though. For example, given that the compiler does not make promises of the order of volatile operations relative to other memory accesses, it could happen that a peripheral could read from a buffer prepared by the CPU, when those buffer writes have not actually been committed to memory yet.
For this reason, we need to establish an additional, explicit ordering
guarantee between other memory operations and these volatile
accesses. Rust provides us the std::sync::atomic::fence
function to do just that. This
function works similar to atomic operations and takes an
Ordering
argument. It will ensure that
compiler optimizations respect the requested ordering guarantees and,
if applicable, will also emit a CPU instruction that prevents
hardware-based reordering of accesses beyond this point.
Thus, when kicking off an operation that reads from memory written
by the CPU, we must use a fence
operation
with an Ordering::Release
argument, to ensure
that all writes become visible to another thread (or the hardware)
that is synchronizing with our current thread. For reading data that
the hardware has written into memory, we must use a fence
operation with Ordering::Acquire
instead. This ensures that we are seeing all changes since the point
at which the hardware has informed us that an operation has finished
(e.g., through a volatile read or interrupt).
It should be noted that this relies on a fairly liberal interpretation
of the following comment in std::sync::atomic::fence
's documentation:
A fence ‘A’ which has (at least) Release ordering semantics, synchronizes with a fence ‘B’ with (at least) Acquire semantics, if and only if there exist operations X and Y, both operating on some atomic object ‘M’ such that A is sequenced before X, Y is sequenced before B and Y observes the change to M. This provides a happens-before dependence between A and B.
Here, we assume that the volatile reads and writes to MMIO registers count as operations that are operating on an atomic object. However, this seems to be a fair assumption to make on many hardware platforms in practice, and clarifying it likely also ties in to the considerations of the previous subsection.
Footnotes:
In practice, Rust knows which exact function we pass into the
writer
argument and generates a fn_cell_test.specialized.1:
symbol that it calls instead. Because our
fn cell_test
is marked pub
though, Rust also generates the general fn_cell_test:
symbol that does not make any assumptions on the function passed into its second
argument.