[RFC] AtomicPerByte (aka "atomic memcpy") #3301

m-ou-se · 2022-08-14T17:39:35Z

bjorn3 · 2022-08-14T17:57:35Z

ibraheemdev · 2022-08-14T19:50:27Z

This could mention the atomic-maybe-uninit crate in the alternatives section (cc @taiki-e).

5225225 · 2022-08-14T19:56:24Z

With some way for the language to be able to express "this type is valid for any bit pattern", which project safe transmute presumably will provide (and that exists in the ecosystem as bytemuck and zerocopy and probably others), I'm wondering if it would be better to return an AtomicPerByteRead<T>(MaybeUninit<T>) which we/the ecosystem could provide a safe into_inner (returning a T) if T is valid for any bit pattern.

This would also require removing the safe uninit method. But you could always presumably do an AtomicPerByte<MaybeUninit<T>> with no runtime cost to passing MaybeUninit::uninit() to new.

That's extra complexity, but means that with some help from the ecosystem/future stdlib work, this can be used in 100% safe code, if the data is fine with being torn.

Lokathor · 2022-08-14T20:10:23Z

The "uninit" part of MaybeUninit is essentially not a bit pattern though. That's the problem. Even if a value is valid "for all bit patterns", you can't unwrap uninit memory into that type.

not without the fabled and legendary Freeze Intrinsic anyway.

T-Dark0 · 2022-08-14T20:14:29Z

On the other hand, AnyBitPatternOrPointerFragment isn't a type we have, nor really a type we strictly need for this. Assuming tearing can't deinitialize initialized memory, then MaybeUninit would suffice I think?

programmerjake · 2022-08-15T01:01:21Z

note that LLVM already implements this operation:
llvm.memcpy.element.unordered.atomic Intrinsic
with an additional fence operation for acquire/release.

comex · 2022-08-15T02:26:31Z

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

text/3301-atomic-memcpy.md

taiki-e · 2022-08-15T05:30:32Z

text/3301-atomic-memcpy.md

+- In order for this to be efficient, we need an additional intrinsic hooking into
+  special support in LLVM. (Which LLVM needs to have anyway for C++.)


How do you plan to implement this until LLVM implements this?

I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem.

(Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.)

I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:

We wait until LLVM supports this. (Or contribute it to LLVM.) This feature is delayed until some point in the future when we can rely on an LLVM version that includes it.

Until LLVM supports it, we use a theoretically unsound but known-to-work-today hack like ptr::{read_volatile, write_volatile} combined with a fence. In the standard library we can more easily rely on implementation details of today's compiler.

We use the existing llvm.memcpy.element.unordered.atomic, after figuring out the consequences of the unordered property.

Until LLVM supports appears, we implement it in the library using a loop of AtomicUsize::load()/store()s and a fence, possibly using an efficient inline assembly alternative for some popular architectures.

I'm not fully sure yet which of these are feasible.

text/3301-atomic-memcpy.md

m-ou-se · 2022-08-15T08:12:50Z

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

I'm very familiar with the standard Rust and C++ memory orderings, but I don't know much about llvm's unordered ordering. Could you give an example of unexpected results we might get if we were to implement AtomicPerByte<T>::{read, write} using llvm's unordered primitive and a fence? Thanks!

(It seems monotonic is behaves identically to unordered for loads and stores?)

text/3301-atomic-memcpy.md

ojeda · 2022-08-15T10:48:45Z

text/3301-atomic-memcpy.md

+  but it's easy to accidentally cause undefined behavior by using `load`
+  to make an extra copy of data that shouldn't be copied.
+
+- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`?


Given these options and considering what the C++ paper chose, AtomicPerByte sounds OK and has the advantage of having Atomic as a prefix.

AtomicPerByteMaybeUninit or AtomicPerByteManuallyDrop to also resolve the other concern around dropping? Those are terrible names though...

ojeda · 2022-08-15T10:56:59Z

cc @ojeda

Thanks! Cc'ing @wedsonaf since he will like it :)

thomcc · 2022-08-15T17:11:50Z

Unordered is not monotonic (as in, it has no total order across all accesses), so LLVM is free to reorder loads/stores in ways it would not be allowed to with Relaxed (it behaves a lot more like a non-atomic variable in this sense)

In practical terms, in single-thread scenarios it behaves as expected, but when you load an atomic variable with unordered where the previous writer was another thread, you basically have to be prepared for it to hand you back any value previously written by that thread, due to the reordering allowed.

Concretely, I don't know how we'd implement relaxed ordering by fencing without having that fence have a cost on weakly ordered machines (e.g. without implementing it as an overly-strong acquire/release fence).

That said, I think we could add an intrinsic to LLVM that does what we want here. I just don't think it already exists.

(FWIW, another part of the issue is that this stuff is not that well specified, but it's likely described by the "plain" accesses explained in https://www.cs.tau.ac.il/~orilahav/papers/popl17.pdf)

thomcc · 2022-08-15T19:45:35Z

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

I think we can easily implement this with relaxed in compiler-builtins though, but it should get a new intrinsic, since many platforms can implement it more efficiently.

bjorn3 · 2022-08-15T20:12:07Z

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

thomcc · 2022-08-15T20:26:01Z

I'm not sure we'd want unordered, as mentioned above...

thomcc · 2022-08-16T02:25:12Z

To clarify on the difference between relaxed and unordered (in terms of loads and stores), if you have

static ATOM: AtomicU8 = AtomicU8::new(0);
const O: Ordering = ???;

fn thread1() {
    ATOM.store(1, O);
    ATOM.store(2, O);
}

fn thread2() {
    let a = ATOM.load(O);
    let b = ATOM.load(O);
    assert!(a <= b);
}

thread2 will never assert if O is Relaxed, but it could if O is (the hypothetical) Unordered.

In other words, for unordered, it would be legal for 2 to be stored before 1, or for b to be loaded before a. In terms of fences, there's no fence that "upgrades" unordered to relaxed, although I believe (but am not certain) that stronger fences do apply to it.

programmerjake · 2022-08-16T03:16:12Z

something that could work but not be technically correct is:
compiler acquire fence
unordered atomic memcpy
compiler release fence

those fences are no-ops at runtime, but prevent the compiler from reordering the unordered atomics -- assuming your on any modern cpu (except Alpha iirc) it will behave like relaxed atomics because that's what standard load/store instructions do.

thomcc · 2022-08-16T03:19:51Z

Those fences aren't always no-ops at runtime, they actually emit code on several platforms (rust-lang/rust#62256). It's also unclear what can and can't be reordered across compiler fences (rust-lang/unsafe-code-guidelines#347), certainly plain stores can in some cases (this is easy to show happening in godbolt).

Either way, my point has not been that we can't implement this. We absolutely can and it's probably even straightforward. My point is just that I don't really think those existing intrinsics help us do that.

tschuett · 2022-08-18T20:08:11Z

I like MaybeAtomic, but following C++ with AtomicPerByte sounds reasonable.
The LLVM guys started something similar in 2016:
https://reviews.llvm.org/D27133

text/3301-atomic-memcpy.md

RalfJung · 2022-08-20T16:24:57Z

text/3301-atomic-memcpy.md

+        loop {
+            let s1 = self.seq.load(Acquire);
+            let data = read_data(&self.data, Acquire);
+            let s2 = self.seq.load(Relaxed);


There's something very subtle here that I had not appreciated until a few weeks ago: we have to ensure that the load here cannot return an outdated value that would prevent us from noticing a seqnum bump.

The reason this is the case is that if there is a concurrent write, and if any
part of data reads from that write, then we have a release-acquire pair, so then we are guaranteed to see at least the first fetch_add from write, and thus we will definitely see a version conflict. OTOH if the s1 reads-from some second fetch_add in write, then that forms a release-acquire pair, and we will definitely see the full data.

So, all the release/acquire are necessary here. (I know this is not a seqlock tutorial, and @m-ou-se is certainly aware of this, but it still seemed worth pointing out -- many people reading this will not be aware of this.)

(This is related to this comment by @cbeuw.)

Yeah exactly. This is why people are sometimes asking for a "release-load" operation. This second load operation needs to happen "after" the read_data() part, but the usual (incorrect) read_data implementation doesn't involve atomic operations or a memory ordering, so they attempt to solve this issue with a memory ordering on that final load, which isn't possible. The right solution is a memory ordering on the read_data() operation.

Under a reordering based atomic model (as CPUs use), a release load makes sense and works. Release loads don't really work unless they are also RMWs (fetch_add(0)) under the C11 model.

Yeah, the famous seqlock paper discusses "read dont-modify write" operations.

RalfJung · 2022-08-20T16:31:12Z

text/3301-atomic-memcpy.md

+while the second one is basically a memory fence followed by series of `AtomicU8::store`s.
+Except the implementation can be much more efficient.
+The implementation is allowed to load/store the bytes in any order,
+and doesn't have to operate on individual bytes.


The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved. That would still allow merging adjacent writes (I think), but it would not allow reordering bytes. I wonder if we could get away with that, or if implementations actually need the ability to reorder.

For a memcpy (meaning the two regions are exclusive) you generally want to copy using increasing address order ("forward") on all hardware I've ever heard of. Even if a forward copy isn't faster (which it often is), it's still the same speed as a reverse copy.

I suspect the "any order is allowed" is just left in as wiggle room for potentially strange situations where somehow a reverse order copy would improve performance.

The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved.

In the C++ paper they are basically as:

for (size_t i = 0; i < count; ++i) { reinterpret_cast<char*>(dest)[i] = atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order::relaxed); } atomic_thread_fence(order);

and

atomic_thread_fence(order); for (size_t i = 0; i < count; ++i) { atomic_ref<char>(reinterpret_cast<char*>(dest)[i]).store( reinterpret_cast<char*>(source)[i], memory_order::relaxed); }

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

Yes, relaxed loads/stores to different locations can be reordered, so specifying their order is moot under the as-if rule.

In the C++ paper they are basically as:

Hm... but usually fences and accesses are far from equivalent. If we specify them like this, calling code can rely on the presence of these fences. For example changing a 4-byte atomic acquire memcpy to an AtomicU32 acquire load would not be correct (even if we know everything is initialized and aligned etc).

Fence make all preceding/following relaxed accesses potentially induce synchronization, whereas release/acquire accesses only do that for that particular access.

RalfJung · 2022-08-20T16:39:57Z

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

Yeah, I don't think we should expose Unordered to users in any way until we are ready and willing to have our own concurrency memory model separate from that of C++ (or until C++ has something like unordered, and it's been shown to also make sense formally). There are some formal memory models with "plain" memory accesses, which are similar to unordered (no total mo order but race conditions allowed), but I have no idea if those are an accurate model of LLVM's unordered accesses. Both serve the same goal though, so there's a high chance they are at least related: both aim to model Java's regular memory accesses.

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

Well I sure hope we're not using them in any way that actually becomes observable in program behavior, as that would be unsound.

VorpalBlade · 2024-12-16T12:21:16Z

Why not have multiple store methods (perhaps not all 6, but enough to cover the use cases)? They could dispatch to the same underlying intrinsic internally.

It isn't like rust doesn't already do this in the standard library: foo, foo_mut, unchecked_foo etc. Though perhaps coming up with suitable names will be just as difficult.

m-ou-se · 2024-12-16T14:13:32Z

Because that would just result in confusion and unexpected behaviour. E.g. it's unclear what reasonable behaviour would be for types that need to be dropped.

programmerjake · 2024-12-16T18:02:11Z

what if the only option was:

pub fn store(&self, value: &MaybeUninit<T>, ordering: Ordering);

and to make storing a copy more ergonomic, MaybeUninit gains:

impl<T> MaybeUninit<T> { // maybe have ?Sized bound? icr if that works with unions
    pub const fn from_ref(v: &T) -> &Self {
        // Safety: &Self can't be written to, so this works
        unsafe { &*(v as *const T as *const Self) }
    }
}

that way if you want to store a copy of some type, you just use: a.store(MaybeUninit::from_ref(&my_value), Ordering::Relaxed)
and my_value will still be dropped later. or you can just use a reference if that's all you have access to.

and if you want my_value to not be dropped, just write:
a.store(&MaybeUninit::new(my_value), Ordering::Relaxed)

do remember that atomic memcpy is not terribly common so being a bit more verbose is fine.

DemiMarie · 2024-12-16T21:00:36Z

What about only providing the intrinsic as an unsafe raw pointer operation, and letting users write their own higher-level wrappers?

arielb1 · 2024-12-17T16:40:20Z

What about only providing the intrinsic as an unsafe raw pointer operation, and letting users write their own higher-level wrappers?

The intrinsic seems more fundamental to me than the API around it.

ais523 · 2025-04-19T21:08:52Z

I'd like to suggest an alternative approach to solving the same problem (which I was thinking of suggesting before I saw this thread): an unsafe intrinsic (which I think of as read_racy) that behaves as follows:

The intrinsic takes a raw pointer *const T as an argument. This could be of any Rust type (it doesn't have to be an atomic and doesn't have to be made of UnsafeCells).
The return value is a MaybeUninit<T>, specified to be chosen as follows:
- if the memory referenced by the pointer has been/is being/will be written to in a way that could cause a read through the pointer to form a data race with the write, it returns a MaybeUninit<T> holding an uninitialised value;
- if the memory referenced by the pointer is currently mutably borrowed (even by another thread), it returns a MaybeUninit<T> holding an uninitialised value (and does not cause "access to a mutably borrowed value" undefined behaviour – conceptually the read does not occur in this case, although of course in practice the CPU would likely read potentially garbage data from the memory in question);
- in other cases, it returns a MaybeUninit<T> holding the bit pattern of the memory referenced by the pointer.

By combining this with an acquire-ordered load before doing a read_racy of the memory and a release-ordered load after doing the read_racy, it becomes possible to implement sequence locks and similar code (i.e. first you attempt a read, then you discover whether it worked or not) – of course, this would need release-ordered loads to be added to the language, although as mentioned above they can be simulated by adding 0. (The read_racy itself would not be atomic – you synchronize it by using the ability of an acquire…release sequence to synchronize reads that happen between the acquire and release.) The big advantage of this approach is that you don't need to have any special handling of the pointed-to T; a sequence lock can safely coexist with arbitrary safe code that operates on the same memory, as long as it ensures that no such code was running before attempting to assume_init() the bytes.

This should also be very easy to implement – it's basically just an assembly-level load instruction that's "opaque" to the compiler, preventing it performing optimisations related to knowledge of what address is being loaded. (I think it can be implemented as a load instruction written with inline assembly, that the compiler has to assume could place arbitrary bits into the returned value because it can't see that it's a load instruction.) If there is no race, then the load instruction will load the pointer value. If there is a race, then the load instruction might or might not return useful data, but it will load some sequence of bits, which is valid to store in a MaybeUnint as long as you don't actually try to do anything with the data. Thus, it complies with the specification I wrote above.

This approach seems to be more powerful than requiring T to be of a particular type (you can use it to, e.g., write a memory allocator that uses sequence locks to protect the memory being allocated), and simpler than the existing listed alternatives (because it doesn't, e.g., need an UnsafeCell).

DemiMarie · 2025-04-19T22:22:46Z

This is sufficient for synchronization, but not for functions like copy_from_user that access data another (potentially malicious) process might be concurrently mutating.

RalfJung · 2025-04-20T09:54:55Z

@ais523 Allowing racy reads on non-atomic accesses without UB has some very non-trivial consequences and would be a huge departure from some of the fundamental principles that the C++ memory model (which we inherit) is based on. This paper explores this a bit by having two languages where the first has full UB on read-write races but the second makes them return "poison" similar to what you suggested. We should not do this unless either C++ also does it, or we are ready to make our memory model independent from that of C++ (with all the consequences that entails, e.g. making it impossible to use atomic operation on memory shared with C++ code).

This should also be very easy to implement

You could hardly be further from the truth here. ;) Remember that "implementing" any change to the concurrency memory model requires making sure that the model even still makes any sense and supports all the desired optimizations, which typically requires months of work by an expert (and there's very few experts that are able to do that kind of work; I am not one of them).

Suggesting to "just" change something fundamental about the concurrency memory model is like suggesting to "just" change some detail about a rocket engine. These are non-trivial pieces of engineering and you can't "just" change anything about them without great care.

ais523 · 2025-04-20T13:31:22Z

@RalfJung: I agree that changing the memory model is a bad idea. My suggestion is designed to avoid needing to change the memory model, via confining the racy reads to a particular intrinsic/function that the compiler can't optimise around (and thus can't exploit the fact that the read would be undefined behaviour if done normally) and whose observable behaviour always matches something that could be done in the existing memory model.

I agree that "very easy to implement" is quite different from "very easy to prove correct"! Nonetheless, I don't think this is too hard to prove correct on the basis of "the executable output by the compiler must match the behaviour of the source program". The idea is that, from the compiler's point of view, read_racy is an opaque/FFI function that takes in a pointer, and does one of two things (based on a condition that the compiler doesn't know):

either it reads the pointer, and returns the value stored there;
or it ignores the pointer and returns an arbitrary value.

The compiler cannot take advantage of the "maybe the pointer isn't read" case to, e.g., move reads and writes around in a way that would stop the read working, because it doesn't know whether or not the opaque function reads the pointer, and has to assume (in any case where it can't prove a race exists) that there might be no race and the funciton might be reading the pointer.

The compiler also cannot take advantage of the "maybe the pointer is read" case to assume no race and optimise on that basis, again because it doesn't know whether or not the opaque function reads the pointer; if the function chose to ignore the pointer on that call, there would be no race, and thus there would be no optimisation-enabling UB for it to exploit.

Another way to think about it is to imagine that we have a magic function that lets us know whether or not a read could race (or access mutably borrowed memory), and read_racy gets implemented as follows:

unsafe fn read_racy<T>(ptr: *const T) -> MaybeUninit<T> {
    if (the_read_will_race(ptr)) {
        MaybeUninit::uninit()
    } else {
        unsafe { core::ptr::read(ptr as *const MaybeUninit<T>) }
    }
}

Assuming the existence of the_read_will_race, the function works correctly entirely within the existing memory model – it has no data races, because it only ever reads data in a situation where no race exists.

Although the function in question can't be implemented in Rust, due to there being no working the_read_will_race function, it can be implemented in any language with a load instruction that returns an arbitrary or uninitialised value when a data race happens – the load instruction happens to implement both branches of the if at the same time, meaning that the the_read_will_race call can be optimised out (and in turn meaning that hte function doesn't need to be defined). Most notably, LLVM defines its load instruction to read undef in the case of a data race, rather than causing undefined behaviour, so the function in question can be implemented using raw LLVM IR.

As for the paper you linked, it's basically discussing "what would happen if the memory model allowed any read to race with writes, producing an undefined value rather than undefined behaviour?" and its conclusion was "you would miss optimisations". By confining reads that can race with writes to a particular function/intrinsic, you avoid the missed optmisations in the code in general. The compiler will optimise less around a read_racy call, but that's the entire reason why it exists – to stop compiler optimisations that are correct in general but wrong when a racy read occurs..

DemiMarie · 2025-04-20T17:20:36Z

@RalfJung What if one was okay with these operations being opaque to the optimizer? That would allow them to desugar to asm!, which Rust already supports.

RalfJung · 2025-04-21T18:30:01Z

@ais523 you are proposing to add an operation that does not currently exist in the memory model and that cannot be implemented with the operations that do exist. That is changing (more specifically, extending) the memory model.

ais523 · 2025-04-21T22:11:53Z

So the situation here is basically that we have two implementations of `read_racy`: a) an implementation that is entirely legal in the memory model and that has the desired semantics, but which is a hypothetical implementation that would be near-impossible to write in practice due to requiring a race-predicting function b) an implementation that does the same thing and can be implemented in practice, but contains operations outside the memory model and the trick is that the function is opaque to the compiler so that it can't tell which we're using. The compiler has to take into account the possibility that the function might be implemented as a), so it can't make any optimisations that would cause function a) to break or otherwise exploit the potential for races in any way. That means that if we actually implement the function as b) (in a way that is opaque to the compiler), everything will still work correctly because the compiler can't do anything with it that it couldn't do with function a). I think it's reasonable to question whether this method of implementing things is legitimate, i.e. "the compiler can't distinguish between this function and an implementation of the same function that doesn't break the memory model, and thus must always compile the function correctly even if its internal implementation breaks the memory model, because there's no way it could vary its behaviour between the two cases". To me, this is one of those things that obviously works in practice, but adding the principle itself may be something of a large step to take.

RalfJung · 2025-04-22T10:25:26Z

@DemiMarie it is definitely not legal to do this with inline asm; the requirements for inline asm blocks are not met. And in fact, there are optimizations which are incompatible with the existence of an operation to do non-atomic reads where races are not full UB, see Figure 3 in this paper: the second step there, "remove redundant read of g by the GVN pass", relies on full UB of racy non-atomic reads.

@ais523

I think it's reasonable to question whether this method of implementing things is legitimate

This is not a reliable way to build a compiler -- so no, this is not legitimate. You must present a consistent formal semantics and show that it has all the desired properties, and the compiler must implement those semantics. Hand-waving something involving "this is opaque and hence" does not suffice (unless you can produce a proper proof of correctness of your reasoning principle, of course). This is the reasoning we are applying to inline asm blocks, and to ensure soundness they are subject to a tight restriction, which means they are not suited to add the operation you are proposing.

Also I think this is getting off-topic for this RFC. GIven how terrible Github is at threading, we should keep discussion here focused on the proposed new primitive, and explore possible alternatives elsewhere (a separate issue, a thread on IRLO, a topic on Zulip).

programmerjake · 2025-04-22T15:18:45Z

see Figure 3 in this paper: the second step there, "remove redundant read of g by the GVN pass", relies on full UB of racy non-atomic reads.

note the paper says they decided that step wasn't allowed in LLVM:

As a result of reporting this bug, the LLVM developers decided to restrict the second transformation rather than the first one, which means that the intended LLVM memory model is subtly different from the C11 model.

RalfJung · 2025-04-22T15:20:12Z

Yes, LLVM does not use the C++ memory model, they have their own. In the LLVM memory model, data races are not UB, they behave like read_racy. However, Rust uses the C++ memory model, not the LLVM memory model -- and the LLVM model has been explored very little, so I'd caution against adopting it for Rust (aside from the fact that, as noted above, diverging from the C++ model would be an interop hazard).

arielb1 · 2025-04-22T18:19:28Z

You can definitely do an "inline assembly memcpy" from a raw pointer to e.g. a stack variable.

I am quite sure that:

It is defined to at least put some non-deterministic but stable bytes in that address.
If the memory is not being concurrently modified, it will put the right bytes in that address.

the second step there, "remove redundant read of g by the GVN pass", relies on full UB of racy non-atomic reads.

AFAICT, LLVM at least pretends it lowers a C read into an "UB on race" read, but in addition to that, it supports a "poison on race" read [and you can turn "poison on race" to "non-deterministic but stable bytes on race" via freeze], where LLVM is allowed to convert a program with a conditional "UB on race" read to a program with an unconditional "poison on race" read and a conditional use. I don't see a contradiction in that.

I personally believe that a good internal IR needs to have all of "UB on race", "poison on race", and "nondet but stable on race" reads, but I don't see why a semantics for Rust (as opposed for an internal IR) needs to have "poison on race".

This RFC argues that the surface language needs to have "nondet but stable on race" as well. Tho not "poison on race", I am still not sure if there is a need for poison in the surface semantics.

RalfJung · 2025-04-22T19:01:26Z

If that memory is also accessed outside inline asm blocks, then no you cannot do that.

arielb1 · 2025-04-22T19:39:35Z

~~Why not? If the access is not racy, then this is well defined code that returns the right value. If the access is racy, then it’s the same as the assembly pulling the numbers from the environment, which is also well defined.~~ And now I realize angelic nondeterminism has weird interactions with malloc, which shouldn't matter here, but are hard to prove don't matter. Tho there is a fairly strong reason that inline assembly "works" for seqlocks - you could have an "justification" of "suspend all other threads; if the seqlock is even, take a snapshot. If the seqlock is odd, return nondeterministic bytes; unsuspend all other threads", and if you consider "suspend all other threads" to be a valid operation (AFAICT it is), it will do the same thing as an AtomicPerByte memcpy [of course, this requires the inline asm to be able to access the seqlock, but this normally holds]. I do think we'll need to have "load a value from memory. If it's racy, return something nondeterministic" loads. Does the mere existence of these loads actually break any optimizations? And if they exist, inline asm has a "joker" behavior of picking the right way of doing it. Figure 3 in the paper only shows that you have to pick whether any single load is a "nondeterministic on race" or "UB on race", not that you can't have both. There are a bunch of optimizations you can do with "UB on race" loads but can't do with "nondeterministic on race" loads like duplicating/deduplicating them, but AFAICT you can't do them with run-once asm blocks either, and the mere existence of nondeterministic on race" loads does not make them less valid.

RalfJung · 2025-04-23T08:12:46Z

~Why not? If the access is not racy, then this is well defined code that returns the right value. If the access is racy, then it’s the same as the assembly pulling the numbers from the environment, which is also well defined.

There is no AM operation that can check if an access is racy, so this is not a valid line of reasoning. You cannot use inline assembly to extend the AM with new operations, as correctness proofs involving the AM can and do rely on knowing the full set of operations that can be performed by arbitrary AM code.

Correctness of optimizations assumes that all code accessing AM-visible state is Rust code, therefore inline asm can only perform actions on the AM state that can also be performed by Rust code. See this link I already posted above for more context.

comex · 2025-04-23T23:45:18Z

I was going to retort that it's not a valid line of reasoning only according to a model which you invented but don't have time to explain. You're usually right about things, but a non-explanation doesn't really move the conversation forward and I had some doubts. But after reading some of your old comments, particularly about angelic and demonic nondeterminism, I became pretty convinced you're right. It's impossible to formalize an asm block like this in terms of case analysis or nondeterminism without, at minimum, making reasoning about the Abstract Machine much more complicated. It would be much simpler to add the specific race-tolerant read operation that's needed.

The formalization can't "just" be branching on non-observable state (namely, whether there was a race); at least, you can't branch on non-observable state in general because that would break the ability of optimizations to change it. In this case it's hard to see a concrete incorrect result from an optimization changing whether there was a race. But that seems to be largely because of how unobservable this particular type of state is. First, even though the formalization would be using a branch, the asm block's outputs don't allow normal code to fully determine whether there was a race. If the data was corrupted, you can conclude that there was a race, but if it wasn't corrupted, you can't conclude that there wasn't. Second, even if you do learn whether there was a race, you can't necessarily derive a contradiction from it. The original semantics are "nondeterministically races or not". An optimization will usually either leave that unchanged or refine it to "never races", but never observing a race doesn't let you prove that the optimization changed anything (because you might have just gotten lucky). In theory, under very specific scenarios, an optimization could change "nondeterministically races" to "always races", and then the program could derive a contradiction if it doesn't race, but again, a program can never prove that there wasn't a race. However, this reasoning is probably not airtight. At minimum it would require a more complex proof, and it would be very special-purpose reasoning; it wouldn't allow the correctness of the asm block to fall out of general principles.

edit: definitely not airtight: on further review, part of what I wrote made no sense!

The idea of angelic nondeterminism seems compelling: let's just say that the asm block angelically-nondeterministically either performs an ordinary read or makes up a value. If there's a race, the ordinary read would be UB, so the angelic choice is forced to go to the make-up-a-value branch. This somewhat sidesteps the observability issue. But as you've said, it runs into the issue that operations that perform demonic nondeterminism, such as malloc, couldn't be reordered across the asm block. Or at least, it's nontrivial to prove that they could be.

Let's see if I can explain why reordering is a problem. Suppose a program obtains a pair of bools a and b where a comes from an angelic choice and b comes from a demonic choice, and after obtaining them, the program executes UB if a != b. If the angelic choice comes first, then the demonic choice can set b to !a force the program to be undefined. But if the demonic choice comes first, then the angelic choice can set a to b to force the program to be defined. So the definedness of the program depends on the order, so reordering must be disallowed.

In reality, things may not be that dire (I think). You can't just have a normal bool be based on an angelic choice, because that's totally unimplementable without a time machine. Angelic choices, if they exist in the AM, can only affect specific pieces of unobservable state, such as "whether a race occurred" or "which provenance a pointer has". Given the limitations on how you can manipulate that state, perhaps it's actually impossible to express the equivalent of "defined if a == b", i.e. the kind of conditional UB that would break with reordering. But I have no idea how to prove that. Or maybe it is possible to express a == b, but there's some way to limit the scope of nondeterminism in order to nevertheless preserve reorderability. But again I have no idea how to prove that.

So okay, I agree that the Rust AM would be a lot simpler if it didn't support angelic nondeterminism at all. I am less sure that with_exposed_provenance can be formalized without it, but I guess we'll see.

DemiMarie · 2025-04-24T01:15:08Z

I think there are two distinct needs being conflated here:

Some lock-free algorithms need to perform operations that are currently not possible to implement in Rust without far too much overhead.
Low-level software needs “read/write/copy to/copy from this memory address” primitives that have machine semantics, including never directly having undefined behavior (but potentially causing undefined behavior in other software).

My understanding is that these needs are very distinct. The first is still within the confines of the abstract machine, so if a program’s behavior can be changed by optimizations then it has undefined behavior. The second, however, is fundamentally an I/O operation, and it permits observing runtime behavior that is not guaranteed by the language and which can be altered by optimizations. rust-lang/unsafe-code-guidelines#321 (which I should make a full RFC out of) is intended for these purposes.

ais523 · 2025-04-24T02:44:45Z

@comex wrote:

I was going to retort that it's not a valid line of reasoning only according to a model which you invented but [don't have time to explain](https://internals.rust-lang.org/t/how-does-inline-assembly-and-the-physical-machine-fit-into-the-abstract-machine-model-or-exist-outside-of-it/22545/5). You're usually right about things, but a non-explanation doesn't really move the conversation forward and I had some doubts. But after reading some of your old comments, particularly about angelic and demonic nondeterminism, I became pretty convinced you're right. It's impossible to formalize an asm block like this in terms of case analysis or nondeterminism without, at minimum, making reasoning about the Abstract Machine much more complicated. It would be much simpler to add the specific race-tolerant read operation that's needed.

I've been trying to formally prove that the implementation of `read_racy` as an opaque read instruction is correct, and in particular to work out the assumptions needed. Most of them are trivially justifiable (e.g. you need undefined behaviour after an opaque function call to not affect events before the function call, but this is trivially true for Rust because the function call could exit the program or enter an infinite loop, preventing the undefined behaviour being reachable and thus meaning it doesn't affect the program). However, there does seem to be one required assumption which, while intuitively reasonable, doesn't seem to be obviously required for a compiler to be meaningfully said to implement Rust. This is a requirement that the behaviour of one possible program execution isn't affected by what potentially "could have happened" in program executions that did not occur, i.e. if you remove nondeterminism from a program by forcing one of the possible nondeterministic choices to occur rather than one of its alternatives, that doesn't remove possible executions in which the choice you forced actually did occur. This sort of "I'm assuming this program couldn't do X, because it could have done Y instead" optimisation might be unlikely, but it's possible to see how, if a compiler implemented it, it could break this sort of implementation of `read_racy`; you can imagine an optimisation that checks to see whether a function could hypothetically be called in parallel with a write to a memory address, and somehow proves that it cannot determine whether or not the write has happened yet using any of the permitted synchronisation mechanisms, then assumes that it doesn't read that address even in program executions where the write did not actually occur (and is later proven to have not occurred). Such an optimisation might break programs like this (pseudocode): ``` set "flag" to false write some valid value to x // first write run in parallel: { let y = racy_read(x) let f = atomically seq-cst-read-plus-no-op-write "flag" if !f { y.assume_init(); } } and: { atomically seq-cst-set "flag" to true write to x // second write } ``` The compiler may reason that a) the call to `racy_read` could happen in parallel with the second write to `x`, and b) there is no change to an atomic after the second time `x` is written, so the call also can't block on the second write to `x` and must return before it is written, thus must race with the write and thus can't read `x`. This would, theoretically, allow the compiler to move the `read_racy` call backwards to a location before the first write to `x`, which would break the program. This is the sort of reasoning that I don't expect compilers to use in practice: it is very hard to prove such optimisations correct, as they need to try to prove that there is no possible way that the opaque function could determine (within the memory model) that the write didn't happen yet (and almost any useful implementation of this would "release the lock" after the second write, giving something that the opaque function might hypothetically be blocking on and destroying the reasoning behind the optimisation). Still, there is one case where LLVM actually does use similar reasoning: the case of bug #130388. LLVM considers the address of a local (stack) variable to not be equal to any other address, on the basis that it could have been allocated at a different address, even if it actually wasn't. As seen in that bug, the assumption in question is incorrect (it is possible to observe the variable's address via comparing to all possible addresses, which LLVM doesn't take into account). But this is worrying because it means that LLVM is willing to try to use this sort of reasoning for optimisations, even when doing so is unsound. As such, the "this is opaque, so any compile that misoptimised this would also misoptimise a permitted implementation of the function" argument might not apply to a hypothetical future version of LLVM (even though it probably applies in the present version, especially as LLVM IR has a racy-ready instruction and thus LLVM won't assume that an opaque function call wouldn't try to do that). This is, therefore, probably enough doubt to not attempt that route for implementation. As such, I agree that the best route if we wanted a `racy_read` function (which would be useful because there are programs that probably can't be correctly written without it, such as async-signal-safe memory allocators) is probably to expand to a memory model that contains both races-are-undefined and races-are-uninitialised read operations (using the races-are-undefined one by default, and the races-are-uninitialised one only when explicitly requested). This doesn't break any optimisations except for those that use "what could have happened reasoning", e.g. the GVN remove-redundant-read optimisation would still work as long as the first read was a races-are-undefined sort of read, so it'd be very likely to not cause any meaningful reduction in the optimisation power of the language. But it likely would be a lot of work, not so much in adapting the compiler (which almost certainly correctly implements it already) but in formalising precisely what would and wouldn't be allowed, and verifying that everything the compiler does is legal in the new model.

comex · 2025-04-24T05:23:07Z

even when doing so is unsound.

I wouldn't worry about LLVM optimizations that are known to be unsound. Pretty much all of Rust formalization is premised on "this would be sound if LLVM was sound" rather than "this is sound with actual LLVM optimizations". I mean, LLVM still optimizes inttoptr(ptrtoint(x)) to x.

As such, the "this is opaque, so any compile that misoptimised this would also misoptimise a permitted implementation of the function"

This is a good point. Ralf's basic approach, of formalizing inline asm blocks in terms of equivalent Rust code, fundamentally loses the ability to reason in terms of opaqueness in the way you've stated. I am not sure how valuable that ability is.

To be clear, opaqueness itself is well-settled. The optimizer cannot reason about the instructions inside an asm block because they might be patched at runtime. But for formal modeling purposes, if you're modeling code that includes patchable asm blocks, presumably you would model it in terms of the actual possibilities of what might be there at runtime. So you usually wouldn't need to invoke opaqueness at that step.

And it's easy to make mistakes when invoking opaqueness. For one thing, asm blocks are not totally opaque. As an extreme example, suppose an an asm block is marked as pure, readonly, nomem and only takes one u8 as input. A silly but legal optimization would be to precompute the outputs for all possible inputs at process startup, then replace all uses with a table lookup. Now, the "racy read" asm blocks under discussion would not be nomem, but you would probably want to be able to mark them pure and readonly. That might allow the compiler to insert speculative invocations for inputs that don't occur in the original program, if the compiler can prove that they could have occurred given different nondeterministic choices. (If the assembly triggered a fault or made unexpected memory writes given those inputs, then it would be violating the pure condition.)

As a more practical example, opaqueness is often used to justify empty asm blocks as optimization barriers, but a lot of uses of optimization barriers are unsound in the face of value speculation - i.e., if the compiler replaces foo(x) with if x == 42 { foo(42) } else { foo(x) }, and then heavily optimizes the foo(42) part. In most cases that just means broken benchmarks, which aren't a big deal, since we're talking about theory rather than practice and in theory we guarantee nothing about performance characteristics. But if you're trying to use an optimization barrier for correctness, then value speculation might give you UB. Especially if x is a pointer. (Yes, if x is a pointer then that transform wouldn't be valid in general, but it would be valid under some conditions.) ...Which is why black_box probably does deserve to live under std::hint, as much as I dislike that fact.

Still, there must be some programs that can be legitimately justified based on opaqueness. Perhaps racy loads are an example - I haven't thought through your argument enough to judge - but it doesn't matter too much either way, because it's pretty clear that extending the Abstract Machine is a better path forward. Same goes for freeze. But my question is: In a future where we have all those nice things, would there still be useful programs that could only be justified using opaqueness, and not through Ralf's method?

DemiMarie · 2025-04-24T22:55:05Z

Debuggers are one case where opaqueness is really needed. My proposal for core::arch::{load, store} explicitly allows reading from freed memory without UB, and in-process debuggers actually need to be able to do that (because one might be debugging the allocator).

VorpalBlade · 2025-04-25T06:43:48Z

Debuggers are one case where opaqueness is really needed. My proposal for core::arch::{load, store} explicitly allows reading from freed memory without UB, and in-process debuggers actually need to be able to do that (because one might be debugging the allocator).

How would that work if the address has been unmapped? That would typically trigger a sigsegv or similar.
How would it work if you read memory with side effects (an mmio register for an UART input for example)?
Some memory might have hardware UB. For example if I remember correctly, you are not allowed to access certain memory from interrupt handlers while the microcontroller is in low power mode on ESP32 (I might have mixed up the details here, I saw that when I was looking for something else in the docs).

DemiMarie · 2025-04-25T08:22:54Z

Debuggers are one case where opaqueness is really needed. My proposal for core::arch::{load, store} explicitly allows reading from freed memory without UB, and in-process debuggers actually need to be able to do that (because one might be debugging the allocator).

How would that work if the address has been unmapped? That would typically trigger a sigsegv or similar.

Triggering SIGSEGV is the intended behavior here. Debuggers can generally handle this gracefully.

How would it work if you read memory with side effects (an mmio register for an UART input for example)?

The side effects would (and should) happen.

Some memory might have hardware UB. For example if I remember correctly, you are not allowed to access certain memory from interrupt handlers while the microcontroller is in low power mode on ESP32 (I might have mixed up the details here, I saw that when I was looking for something else in the docs).

That’s one reason the operation is unsafe. It has no Rust-level UB, but it also doesn’t guarantee that it doesn’t trigger UB elsewhere. Its semantics are whatever the underlying hardware gives you. The compiler promises to perform exactly the operation you tell it to (unless there is UB elsewhere), but that breaks something, you get to break both pieces.

Add atomic memcpy RFC.

d5393a7

m-ou-se added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Aug 14, 2022

Add number in atomic memcpy rfc.

e864d8d

m-ou-se mentioned this pull request Aug 14, 2022

What about: seqlocks, load-release/store-acquire? rust-lang/unsafe-code-guidelines#323

Open

taiki-e reviewed Aug 15, 2022

View reviewed changes

Fix typo.

d12abe9

ojeda reviewed Aug 15, 2022

View reviewed changes

m-ou-se added 2 commits August 15, 2022 13:18

Fix types of C++ API.

eb68c3a

Better wording.

e802133

cbeuw reviewed Aug 19, 2022

View reviewed changes

text/3301-atomic-memcpy.md Show resolved Hide resolved

RalfJung reviewed Aug 20, 2022

View reviewed changes

RalfJung mentioned this pull request Jan 8, 2025

Can we have VolatileCell rust-lang/unsafe-code-guidelines#411

Open

ojeda mentioned this pull request Mar 4, 2025

Rust wanted features Rust-for-Linux/linux#354

Open

41 tasks

This was referenced Mar 12, 2025

atomic_load_unordered and atomic_store_unordered should not be used rust-lang/compiler-builtins#788

Open

remove element_unordered_atomic intrinsics rust-lang/compiler-builtins#789

Merged

		- In order for this to be efficient, we need an additional intrinsic hooking into
		special support in LLVM. (Which LLVM needs to have anyway for C++.)

[RFC] AtomicPerByte (aka "atomic memcpy") #3301

Are you sure you want to change the base?

[RFC] AtomicPerByte (aka "atomic memcpy") #3301

Conversation

m-ou-se commented Aug 14, 2022 • edited Loading

bjorn3 commented Aug 14, 2022

ibraheemdev commented Aug 14, 2022 • edited Loading

5225225 commented Aug 14, 2022 • edited Loading

Lokathor commented Aug 14, 2022

T-Dark0 commented Aug 14, 2022

programmerjake commented Aug 15, 2022

comex commented Aug 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-ou-se commented Aug 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ojeda commented Aug 15, 2022

thomcc commented Aug 15, 2022 • edited Loading

thomcc commented Aug 15, 2022

bjorn3 commented Aug 15, 2022

thomcc commented Aug 15, 2022

thomcc commented Aug 16, 2022

programmerjake commented Aug 16, 2022

thomcc commented Aug 16, 2022

tschuett commented Aug 18, 2022

RalfJung Aug 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ibraheemdev Aug 23, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RalfJung commented Aug 20, 2022 • edited Loading

VorpalBlade commented Dec 16, 2024

m-ou-se commented Dec 16, 2024

programmerjake commented Dec 16, 2024

DemiMarie commented Dec 16, 2024

arielb1 commented Dec 17, 2024

ais523 commented Apr 19, 2025

DemiMarie commented Apr 19, 2025

RalfJung commented Apr 20, 2025

ais523 commented Apr 20, 2025

DemiMarie commented Apr 20, 2025

RalfJung commented Apr 21, 2025 via email

ais523 commented Apr 21, 2025 via email

RalfJung commented Apr 22, 2025 • edited Loading

programmerjake commented Apr 22, 2025

RalfJung commented Apr 22, 2025 • edited Loading

arielb1 commented Apr 22, 2025 • edited Loading

RalfJung commented Apr 22, 2025 via email

arielb1 commented Apr 22, 2025 via email • edited Loading

RalfJung commented Apr 23, 2025 • edited Loading

comex commented Apr 23, 2025 • edited Loading

DemiMarie commented Apr 24, 2025

ais523 commented Apr 24, 2025 via email

comex commented Apr 24, 2025

DemiMarie commented Apr 24, 2025

VorpalBlade commented Apr 25, 2025

DemiMarie commented Apr 25, 2025

m-ou-se commented Aug 14, 2022 •

edited

Loading

ibraheemdev commented Aug 14, 2022 •

edited

Loading

5225225 commented Aug 14, 2022 •

edited

Loading

m-ou-se commented Aug 15, 2022 •

edited

Loading

thomcc commented Aug 15, 2022 •

edited

Loading

RalfJung Aug 20, 2022 •

edited

Loading

ibraheemdev Aug 23, 2022 •

edited

Loading

RalfJung commented Aug 20, 2022 •

edited

Loading

RalfJung commented Apr 22, 2025 •

edited

Loading

RalfJung commented Apr 22, 2025 •

edited

Loading

arielb1 commented Apr 22, 2025 •

edited

Loading

arielb1 commented Apr 22, 2025 via email •

edited

Loading

RalfJung commented Apr 23, 2025 •

edited

Loading

comex commented Apr 23, 2025 •

edited

Loading