Skip to content

[RFC] AtomicPerByte (aka "atomic memcpy") #3301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

m-ou-se
Copy link
Member

@m-ou-se m-ou-se commented Aug 14, 2022

@m-ou-se m-ou-se added the T-libs-api Relevant to the library API team, which will review and decide on the RFC. label Aug 14, 2022
@bjorn3
Copy link
Member

bjorn3 commented Aug 14, 2022

cc @ojeda

@ibraheemdev
Copy link
Member

ibraheemdev commented Aug 14, 2022

This could mention the atomic-maybe-uninit crate in the alternatives section (cc @taiki-e).

@5225225
Copy link

5225225 commented Aug 14, 2022

With some way for the language to be able to express "this type is valid for any bit pattern", which project safe transmute presumably will provide (and that exists in the ecosystem as bytemuck and zerocopy and probably others), I'm wondering if it would be better to return an AtomicPerByteRead<T>(MaybeUninit<T>) which we/the ecosystem could provide a safe into_inner (returning a T) if T is valid for any bit pattern.

This would also require removing the safe uninit method. But you could always presumably do an AtomicPerByte<MaybeUninit<T>> with no runtime cost to passing MaybeUninit::uninit() to new.

That's extra complexity, but means that with some help from the ecosystem/future stdlib work, this can be used in 100% safe code, if the data is fine with being torn.

@Lokathor
Copy link
Contributor

The "uninit" part of MaybeUninit is essentially not a bit pattern though. That's the problem. Even if a value is valid "for all bit patterns", you can't unwrap uninit memory into that type.

not without the fabled and legendary Freeze Intrinsic anyway.

@T-Dark0
Copy link

T-Dark0 commented Aug 14, 2022

On the other hand, AnyBitPatternOrPointerFragment isn't a type we have, nor really a type we strictly need for this. Assuming tearing can't deinitialize initialized memory, then MaybeUninit would suffice I think?

@programmerjake
Copy link
Member

note that LLVM already implements this operation:
llvm.memcpy.element.unordered.atomic Intrinsic
with an additional fence operation for acquire/release.

@comex
Copy link

comex commented Aug 15, 2022

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

Comment on lines +180 to +181
- In order for this to be efficient, we need an additional intrinsic hooking into
special support in LLVM. (Which LLVM needs to have anyway for C++.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you plan to implement this until LLVM implements this?

I don't think it is necessary to explain the implementation details in the RFC, but if we provide an unsound implementation until the as yet unmerged C++ proposal is implemented in LLVM in the future, that seems to be a problem.

(Also, if the language provides the functionality necessary to implement this soundly in Rust, the ecosystem can implement this soundly as well without inline assembly.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked into the details yet of what's possible today with LLVM. There's a few possible outcomes:

  • We wait until LLVM supports this. (Or contribute it to LLVM.) This feature is delayed until some point in the future when we can rely on an LLVM version that includes it.
  • Until LLVM supports it, we use a theoretically unsound but known-to-work-today hack like ptr::{read_volatile, write_volatile} combined with a fence. In the standard library we can more easily rely on implementation details of today's compiler.
  • We use the existing llvm.memcpy.element.unordered.atomic, after figuring out the consequences of the unordered property.
  • Until LLVM supports appears, we implement it in the library using a loop of AtomicUsize::load()/store()s and a fence, possibly using an efficient inline assembly alternative for some popular architectures.

I'm not fully sure yet which of these are feasible.

@m-ou-se
Copy link
Member Author

m-ou-se commented Aug 15, 2022

The trouble with that intrinsic is that unordered is weaker than monotonic aka Relaxed, and it can't easily be upgraded. There's no "relaxed fence" if the ordering you want is Relaxed; and even if the ordering you want is Acquire or Release, combining unordered atomic accesses with fences doesn't produce quite the same result. Fences provide additional guarantees regarding other memory accessed before/after the atomic access, but they don't do anything to restore the missing "single total order" per address of the atomic accesses themselves.

I'm very familiar with the standard Rust and C++ memory orderings, but I don't know much about llvm's unordered ordering. Could you give an example of unexpected results we might get if we were to implement AtomicPerByte<T>::{read, write} using llvm's unordered primitive and a fence? Thanks!

(It seems monotonic is behaves identically to unordered for loads and stores?)

but it's easy to accidentally cause undefined behavior by using `load`
to make an extra copy of data that shouldn't be copied.

- Naming: `AtomicPerByte`? `TearableAtomic`? `NoDataRace`? `NotQuiteAtomic`?
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given these options and considering what the C++ paper chose, AtomicPerByte sounds OK and has the advantage of having Atomic as a prefix.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AtomicPerByteMaybeUninit or AtomicPerByteManuallyDrop to also resolve the other concern around dropping? Those are terrible names though...

@ojeda
Copy link

ojeda commented Aug 15, 2022

cc @ojeda

Thanks! Cc'ing @wedsonaf since he will like it :)

@thomcc
Copy link
Member

thomcc commented Aug 15, 2022

Unordered is not monotonic (as in, it has no total order across all accesses), so LLVM is free to reorder loads/stores in ways it would not be allowed to with Relaxed (it behaves a lot more like a non-atomic variable in this sense)

In practical terms, in single-thread scenarios it behaves as expected, but when you load an atomic variable with unordered where the previous writer was another thread, you basically have to be prepared for it to hand you back any value previously written by that thread, due to the reordering allowed.

Concretely, I don't know how we'd implement relaxed ordering by fencing without having that fence have a cost on weakly ordered machines (e.g. without implementing it as an overly-strong acquire/release fence).

That said, I think we could add an intrinsic to LLVM that does what we want here. I just don't think it already exists.

(FWIW, another part of the issue is that this stuff is not that well specified, but it's likely described by the "plain" accesses explained in https://www.cs.tau.ac.il/~orilahav/papers/popl17.pdf)

@thomcc
Copy link
Member

thomcc commented Aug 15, 2022

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

I think we can easily implement this with relaxed in compiler-builtins though, but it should get a new intrinsic, since many platforms can implement it more efficiently.

@bjorn3
Copy link
Member

bjorn3 commented Aug 15, 2022

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

@thomcc
Copy link
Member

thomcc commented Aug 15, 2022

I'm not sure we'd want unordered, as mentioned above...

@thomcc
Copy link
Member

thomcc commented Aug 16, 2022

To clarify on the difference between relaxed and unordered (in terms of loads and stores), if you have

static ATOM: AtomicU8 = AtomicU8::new(0);
const O: Ordering = ???;

fn thread1() {
    ATOM.store(1, O);
    ATOM.store(2, O);
}

fn thread2() {
    let a = ATOM.load(O);
    let b = ATOM.load(O);
    assert!(a <= b);
}

thread2 will never assert if O is Relaxed, but it could if O is (the hypothetical) Unordered.

In other words, for unordered, it would be legal for 2 to be stored before 1, or for b to be loaded before a. In terms of fences, there's no fence that "upgrades" unordered to relaxed, although I believe (but am not certain) that stronger fences do apply to it.

@programmerjake
Copy link
Member

something that could work but not be technically correct is:
compiler acquire fence
unordered atomic memcpy
compiler release fence

those fences are no-ops at runtime, but prevent the compiler from reordering the unordered atomics -- assuming your on any modern cpu (except Alpha iirc) it will behave like relaxed atomics because that's what standard load/store instructions do.

@thomcc
Copy link
Member

thomcc commented Aug 16, 2022

Those fences aren't always no-ops at runtime, they actually emit code on several platforms (rust-lang/rust#62256). It's also unclear what can and can't be reordered across compiler fences (rust-lang/unsafe-code-guidelines#347), certainly plain stores can in some cases (this is easy to show happening in godbolt).

Either way, my point has not been that we can't implement this. We absolutely can and it's probably even straightforward. My point is just that I don't really think those existing intrinsics help us do that.

@tschuett
Copy link

I like MaybeAtomic, but following C++ with AtomicPerByte sounds reasonable.
The LLVM guys started something similar in 2016:
https://reviews.llvm.org/D27133

loop {
let s1 = self.seq.load(Acquire);
let data = read_data(&self.data, Acquire);
let s2 = self.seq.load(Relaxed);
Copy link
Member

@RalfJung RalfJung Aug 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's something very subtle here that I had not appreciated until a few weeks ago: we have to ensure that the load here cannot return an outdated value that would prevent us from noticing a seqnum bump.

The reason this is the case is that if there is a concurrent write, and if any
part of data reads from that write, then we have a release-acquire pair, so then we are guaranteed to see at least the first fetch_add from write, and thus we will definitely see a version conflict. OTOH if the s1 reads-from some second fetch_add in write, then that forms a release-acquire pair, and we will definitely see the full data.

So, all the release/acquire are necessary here. (I know this is not a seqlock tutorial, and @m-ou-se is certainly aware of this, but it still seemed worth pointing out -- many people reading this will not be aware of this.)

(This is related to this comment by @cbeuw.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah exactly. This is why people are sometimes asking for a "release-load" operation. This second load operation needs to happen "after" the read_data() part, but the usual (incorrect) read_data implementation doesn't involve atomic operations or a memory ordering, so they attempt to solve this issue with a memory ordering on that final load, which isn't possible. The right solution is a memory ordering on the read_data() operation.

Copy link
Member

@ibraheemdev ibraheemdev Aug 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under a reordering based atomic model (as CPUs use), a release load makes sense and works. Release loads don't really work unless they are also RMWs (fetch_add(0)) under the C11 model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the famous seqlock paper discusses "read dont-modify write" operations.

while the second one is basically a memory fence followed by series of `AtomicU8::store`s.
Except the implementation can be much more efficient.
The implementation is allowed to load/store the bytes in any order,
and doesn't have to operate on individual bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved. That would still allow merging adjacent writes (I think), but it would not allow reordering bytes. I wonder if we could get away with that, or if implementations actually need the ability to reorder.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a memcpy (meaning the two regions are exclusive) you generally want to copy using increasing address order ("forward") on all hardware I've ever heard of. Even if a forward copy isn't faster (which it often is), it's still the same speed as a reverse copy.

I suspect the "any order is allowed" is just left in as wiggle room for potentially strange situations where somehow a reverse order copy would improve performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "load/store bytes in any order" part is quite tricky, and I think means that the specification needs to be more complicated to allow for that.

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

I was originally thinking this would be specified as a series of AtomicU8 load/store with the respective order, no fence involved.

In the C++ paper they are basically as:

for (size_t i = 0; i < count; ++i) {
  reinterpret_cast<char*>(dest)[i] =
      atomic_ref<char>(reinterpret_cast<char*>(source)[i]).load(memory_order::relaxed);
}
atomic_thread_fence(order);

and

atomic_thread_fence(order);
for (size_t i = 0; i < count; ++i) {
  atomic_ref<char>(reinterpret_cast<char*>(dest)[i]).store(
      reinterpret_cast<char*>(source)[i], memory_order::relaxed);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A loop of relaxed load/store operations followed/preceded by an acquire/release fence already effectively allows for the relaxed operations to happen in any order, right?

Yes, relaxed loads/stores to different locations can be reordered, so specifying their order is moot under the as-if rule.

In the C++ paper they are basically as:

Hm... but usually fences and accesses are far from equivalent. If we specify them like this, calling code can rely on the presence of these fences. For example changing a 4-byte atomic acquire memcpy to an AtomicU32 acquire load would not be correct (even if we know everything is initialized and aligned etc).

Fence make all preceding/following relaxed accesses potentially induce synchronization, whereas release/acquire accesses only do that for that particular access.

@RalfJung
Copy link
Member

RalfJung commented Aug 20, 2022

CC @RalfJung who has stronger opinions on Unordered (and is the one who provided that link in the past).

Yeah, I don't think we should expose Unordered to users in any way until we are ready and willing to have our own concurrency memory model separate from that of C++ (or until C++ has something like unordered, and it's been shown to also make sense formally). There are some formal memory models with "plain" memory accesses, which are similar to unordered (no total mo order but race conditions allowed), but I have no idea if those are an accurate model of LLVM's unordered accesses. Both serve the same goal though, so there's a high chance they are at least related: both aim to model Java's regular memory accesses.

We already have unordered atomic memcpy intrinsics in compiler-builtins. For 1, 2, 4 and 8 byte access sizes.

Well I sure hope we're not using them in any way that actually becomes observable in program behavior, as that would be unsound.

@VorpalBlade
Copy link

Why not have multiple store methods (perhaps not all 6, but enough to cover the use cases)? They could dispatch to the same underlying intrinsic internally.

It isn't like rust doesn't already do this in the standard library: foo, foo_mut, unchecked_foo etc. Though perhaps coming up with suitable names will be just as difficult.

@m-ou-se
Copy link
Member Author

m-ou-se commented Dec 16, 2024

Because that would just result in confusion and unexpected behaviour. E.g. it's unclear what reasonable behaviour would be for types that need to be dropped.

@programmerjake
Copy link
Member

what if the only option was:

pub fn store(&self, value: &MaybeUninit<T>, ordering: Ordering);

and to make storing a copy more ergonomic, MaybeUninit gains:

impl<T> MaybeUninit<T> { // maybe have ?Sized bound? icr if that works with unions
    pub const fn from_ref(v: &T) -> &Self {
        // Safety: &Self can't be written to, so this works
        unsafe { &*(v as *const T as *const Self) }
    }
}

that way if you want to store a copy of some type, you just use: a.store(MaybeUninit::from_ref(&my_value), Ordering::Relaxed)
and my_value will still be dropped later. or you can just use a reference if that's all you have access to.

and if you want my_value to not be dropped, just write:
a.store(&MaybeUninit::new(my_value), Ordering::Relaxed)

do remember that atomic memcpy is not terribly common so being a bit more verbose is fine.

@DemiMarie
Copy link

What about only providing the intrinsic as an unsafe raw pointer operation, and letting users write their own higher-level wrappers?

@arielb1
Copy link
Contributor

arielb1 commented Dec 17, 2024

What about only providing the intrinsic as an unsafe raw pointer operation, and letting users write their own higher-level wrappers?

The intrinsic seems more fundamental to me than the API around it.

@ais523
Copy link

ais523 commented Apr 19, 2025

I'd like to suggest an alternative approach to solving the same problem (which I was thinking of suggesting before I saw this thread): an unsafe intrinsic (which I think of as read_racy) that behaves as follows:

  • The intrinsic takes a raw pointer *const T as an argument. This could be of any Rust type (it doesn't have to be an atomic and doesn't have to be made of UnsafeCells).
  • The return value is a MaybeUninit<T>, specified to be chosen as follows:
    • if the memory referenced by the pointer has been/is being/will be written to in a way that could cause a read through the pointer to form a data race with the write, it returns a MaybeUninit<T> holding an uninitialised value;
    • if the memory referenced by the pointer is currently mutably borrowed (even by another thread), it returns a MaybeUninit<T> holding an uninitialised value (and does not cause "access to a mutably borrowed value" undefined behaviour – conceptually the read does not occur in this case, although of course in practice the CPU would likely read potentially garbage data from the memory in question);
    • in other cases, it returns a MaybeUninit<T> holding the bit pattern of the memory referenced by the pointer.

By combining this with an acquire-ordered load before doing a read_racy of the memory and a release-ordered load after doing the read_racy, it becomes possible to implement sequence locks and similar code (i.e. first you attempt a read, then you discover whether it worked or not) – of course, this would need release-ordered loads to be added to the language, although as mentioned above they can be simulated by adding 0. (The read_racy itself would not be atomic – you synchronize it by using the ability of an acquire…release sequence to synchronize reads that happen between the acquire and release.) The big advantage of this approach is that you don't need to have any special handling of the pointed-to T; a sequence lock can safely coexist with arbitrary safe code that operates on the same memory, as long as it ensures that no such code was running before attempting to assume_init() the bytes.

This should also be very easy to implement – it's basically just an assembly-level load instruction that's "opaque" to the compiler, preventing it performing optimisations related to knowledge of what address is being loaded. (I think it can be implemented as a load instruction written with inline assembly, that the compiler has to assume could place arbitrary bits into the returned value because it can't see that it's a load instruction.) If there is no race, then the load instruction will load the pointer value. If there is a race, then the load instruction might or might not return useful data, but it will load some sequence of bits, which is valid to store in a MaybeUnint as long as you don't actually try to do anything with the data. Thus, it complies with the specification I wrote above.

This approach seems to be more powerful than requiring T to be of a particular type (you can use it to, e.g., write a memory allocator that uses sequence locks to protect the memory being allocated), and simpler than the existing listed alternatives (because it doesn't, e.g., need an UnsafeCell).

@DemiMarie
Copy link

This is sufficient for synchronization, but not for functions like copy_from_user that access data another (potentially malicious) process might be concurrently mutating.

@RalfJung
Copy link
Member

@ais523 Allowing racy reads on non-atomic accesses without UB has some very non-trivial consequences and would be a huge departure from some of the fundamental principles that the C++ memory model (which we inherit) is based on. This paper explores this a bit by having two languages where the first has full UB on read-write races but the second makes them return "poison" similar to what you suggested. We should not do this unless either C++ also does it, or we are ready to make our memory model independent from that of C++ (with all the consequences that entails, e.g. making it impossible to use atomic operation on memory shared with C++ code).

This should also be very easy to implement

You could hardly be further from the truth here. ;) Remember that "implementing" any change to the concurrency memory model requires making sure that the model even still makes any sense and supports all the desired optimizations, which typically requires months of work by an expert (and there's very few experts that are able to do that kind of work; I am not one of them).

Suggesting to "just" change something fundamental about the concurrency memory model is like suggesting to "just" change some detail about a rocket engine. These are non-trivial pieces of engineering and you can't "just" change anything about them without great care.

@ais523
Copy link

ais523 commented Apr 20, 2025

@RalfJung: I agree that changing the memory model is a bad idea. My suggestion is designed to avoid needing to change the memory model, via confining the racy reads to a particular intrinsic/function that the compiler can't optimise around (and thus can't exploit the fact that the read would be undefined behaviour if done normally) and whose observable behaviour always matches something that could be done in the existing memory model.

I agree that "very easy to implement" is quite different from "very easy to prove correct"! Nonetheless, I don't think this is too hard to prove correct on the basis of "the executable output by the compiler must match the behaviour of the source program". The idea is that, from the compiler's point of view, read_racy is an opaque/FFI function that takes in a pointer, and does one of two things (based on a condition that the compiler doesn't know):

  • either it reads the pointer, and returns the value stored there;
  • or it ignores the pointer and returns an arbitrary value.

The compiler cannot take advantage of the "maybe the pointer isn't read" case to, e.g., move reads and writes around in a way that would stop the read working, because it doesn't know whether or not the opaque function reads the pointer, and has to assume (in any case where it can't prove a race exists) that there might be no race and the funciton might be reading the pointer.

The compiler also cannot take advantage of the "maybe the pointer is read" case to assume no race and optimise on that basis, again because it doesn't know whether or not the opaque function reads the pointer; if the function chose to ignore the pointer on that call, there would be no race, and thus there would be no optimisation-enabling UB for it to exploit.

Another way to think about it is to imagine that we have a magic function that lets us know whether or not a read could race (or access mutably borrowed memory), and read_racy gets implemented as follows:

unsafe fn read_racy<T>(ptr: *const T) -> MaybeUninit<T> {
    if (the_read_will_race(ptr)) {
        MaybeUninit::uninit()
    } else {
        unsafe { core::ptr::read(ptr as *const MaybeUninit<T>) }
    }
}

Assuming the existence of the_read_will_race, the function works correctly entirely within the existing memory model – it has no data races, because it only ever reads data in a situation where no race exists.

Although the function in question can't be implemented in Rust, due to there being no working the_read_will_race function, it can be implemented in any language with a load instruction that returns an arbitrary or uninitialised value when a data race happens – the load instruction happens to implement both branches of the if at the same time, meaning that the the_read_will_race call can be optimised out (and in turn meaning that hte function doesn't need to be defined). Most notably, LLVM defines its load instruction to read undef in the case of a data race, rather than causing undefined behaviour, so the function in question can be implemented using raw LLVM IR.

As for the paper you linked, it's basically discussing "what would happen if the memory model allowed any read to race with writes, producing an undefined value rather than undefined behaviour?" and its conclusion was "you would miss optimisations". By confining reads that can race with writes to a particular function/intrinsic, you avoid the missed optmisations in the code in general. The compiler will optimise less around a read_racy call, but that's the entire reason why it exists – to stop compiler optimisations that are correct in general but wrong when a racy read occurs..

@DemiMarie
Copy link

@RalfJung What if one was okay with these operations being opaque to the optimizer? That would allow them to desugar to asm!, which Rust already supports.

@RalfJung
Copy link
Member

RalfJung commented Apr 21, 2025 via email

@ais523
Copy link

ais523 commented Apr 21, 2025 via email

@RalfJung
Copy link
Member

RalfJung commented Apr 22, 2025

@DemiMarie it is definitely not legal to do this with inline asm; the requirements for inline asm blocks are not met. And in fact, there are optimizations which are incompatible with the existence of an operation to do non-atomic reads where races are not full UB, see Figure 3 in this paper: the second step there, "remove redundant read of g by the GVN pass", relies on full UB of racy non-atomic reads.

@ais523

I think it's reasonable to question whether this method of implementing things is legitimate

This is not a reliable way to build a compiler -- so no, this is not legitimate. You must present a consistent formal semantics and show that it has all the desired properties, and the compiler must implement those semantics. Hand-waving something involving "this is opaque and hence" does not suffice (unless you can produce a proper proof of correctness of your reasoning principle, of course). This is the reasoning we are applying to inline asm blocks, and to ensure soundness they are subject to a tight restriction, which means they are not suited to add the operation you are proposing.

Also I think this is getting off-topic for this RFC. GIven how terrible Github is at threading, we should keep discussion here focused on the proposed new primitive, and explore possible alternatives elsewhere (a separate issue, a thread on IRLO, a topic on Zulip).

@programmerjake
Copy link
Member

see Figure 3 in this paper: the second step there, "remove redundant read of g by the GVN pass", relies on full UB of racy non-atomic reads.

note the paper says they decided that step wasn't allowed in LLVM:

As a result of reporting this bug, the LLVM developers decided to restrict the second transformation rather than the first one, which means that the intended LLVM memory model is subtly different from the C11 model.

@RalfJung
Copy link
Member

RalfJung commented Apr 22, 2025

Yes, LLVM does not use the C++ memory model, they have their own. In the LLVM memory model, data races are not UB, they behave like read_racy. However, Rust uses the C++ memory model, not the LLVM memory model -- and the LLVM model has been explored very little, so I'd caution against adopting it for Rust (aside from the fact that, as noted above, diverging from the C++ model would be an interop hazard).

@arielb1
Copy link
Contributor

arielb1 commented Apr 22, 2025

You can definitely do an "inline assembly memcpy" from a raw pointer to e.g. a stack variable.

I am quite sure that:

  1. It is defined to at least put some non-deterministic but stable bytes in that address.
  2. If the memory is not being concurrently modified, it will put the right bytes in that address.

the second step there, "remove redundant read of g by the GVN pass", relies on full UB of racy non-atomic reads.

AFAICT, LLVM at least pretends it lowers a C read into an "UB on race" read, but in addition to that, it supports a "poison on race" read [and you can turn "poison on race" to "non-deterministic but stable bytes on race" via freeze], where LLVM is allowed to convert a program with a conditional "UB on race" read to a program with an unconditional "poison on race" read and a conditional use. I don't see a contradiction in that.

I personally believe that a good internal IR needs to have all of "UB on race", "poison on race", and "nondet but stable on race" reads, but I don't see why a semantics for Rust (as opposed for an internal IR) needs to have "poison on race".

This RFC argues that the surface language needs to have "nondet but stable on race" as well. Tho not "poison on race", I am still not sure if there is a need for poison in the surface semantics.

@RalfJung
Copy link
Member

RalfJung commented Apr 22, 2025 via email

@arielb1
Copy link
Contributor

arielb1 commented Apr 22, 2025 via email

@RalfJung
Copy link
Member

RalfJung commented Apr 23, 2025

~Why not? If the access is not racy, then this is well defined code that returns the right value. If the access is racy, then it’s the same as the assembly pulling the numbers from the environment, which is also well defined.

There is no AM operation that can check if an access is racy, so this is not a valid line of reasoning. You cannot use inline assembly to extend the AM with new operations, as correctness proofs involving the AM can and do rely on knowing the full set of operations that can be performed by arbitrary AM code.

Correctness of optimizations assumes that all code accessing AM-visible state is Rust code, therefore inline asm can only perform actions on the AM state that can also be performed by Rust code. See this link I already posted above for more context.

@comex
Copy link

comex commented Apr 23, 2025

I was going to retort that it's not a valid line of reasoning only according to a model which you invented but don't have time to explain. You're usually right about things, but a non-explanation doesn't really move the conversation forward and I had some doubts. But after reading some of your old comments, particularly about angelic and demonic nondeterminism, I became pretty convinced you're right. It's impossible to formalize an asm block like this in terms of case analysis or nondeterminism without, at minimum, making reasoning about the Abstract Machine much more complicated. It would be much simpler to add the specific race-tolerant read operation that's needed.

The formalization can't "just" be branching on non-observable state (namely, whether there was a race); at least, you can't branch on non-observable state in general because that would break the ability of optimizations to change it. In this case it's hard to see a concrete incorrect result from an optimization changing whether there was a race. But that seems to be largely because of how unobservable this particular type of state is. First, even though the formalization would be using a branch, the asm block's outputs don't allow normal code to fully determine whether there was a race. If the data was corrupted, you can conclude that there was a race, but if it wasn't corrupted, you can't conclude that there wasn't. Second, even if you do learn whether there was a race, you can't necessarily derive a contradiction from it. The original semantics are "nondeterministically races or not". An optimization will usually either leave that unchanged or refine it to "never races", but never observing a race doesn't let you prove that the optimization changed anything (because you might have just gotten lucky). In theory, under very specific scenarios, an optimization could change "nondeterministically races" to "always races", and then the program could derive a contradiction if it doesn't race, but again, a program can never prove that there wasn't a race. However, this reasoning is probably not airtight. At minimum it would require a more complex proof, and it would be very special-purpose reasoning; it wouldn't allow the correctness of the asm block to fall out of general principles.

edit: definitely not airtight: on further review, part of what I wrote made no sense!

The idea of angelic nondeterminism seems compelling: let's just say that the asm block angelically-nondeterministically either performs an ordinary read or makes up a value. If there's a race, the ordinary read would be UB, so the angelic choice is forced to go to the make-up-a-value branch. This somewhat sidesteps the observability issue. But as you've said, it runs into the issue that operations that perform demonic nondeterminism, such as malloc, couldn't be reordered across the asm block. Or at least, it's nontrivial to prove that they could be.

Let's see if I can explain why reordering is a problem. Suppose a program obtains a pair of bools a and b where a comes from an angelic choice and b comes from a demonic choice, and after obtaining them, the program executes UB if a != b. If the angelic choice comes first, then the demonic choice can set b to !a force the program to be undefined. But if the demonic choice comes first, then the angelic choice can set a to b to force the program to be defined. So the definedness of the program depends on the order, so reordering must be disallowed.

In reality, things may not be that dire (I think). You can't just have a normal bool be based on an angelic choice, because that's totally unimplementable without a time machine. Angelic choices, if they exist in the AM, can only affect specific pieces of unobservable state, such as "whether a race occurred" or "which provenance a pointer has". Given the limitations on how you can manipulate that state, perhaps it's actually impossible to express the equivalent of "defined if a == b", i.e. the kind of conditional UB that would break with reordering. But I have no idea how to prove that. Or maybe it is possible to express a == b, but there's some way to limit the scope of nondeterminism in order to nevertheless preserve reorderability. But again I have no idea how to prove that.

So okay, I agree that the Rust AM would be a lot simpler if it didn't support angelic nondeterminism at all. I am less sure that with_exposed_provenance can be formalized without it, but I guess we'll see.

@DemiMarie
Copy link

I think there are two distinct needs being conflated here:

  1. Some lock-free algorithms need to perform operations that are currently not possible to implement in Rust without far too much overhead.
  2. Low-level software needs “read/write/copy to/copy from this memory address” primitives that have machine semantics, including never directly having undefined behavior (but potentially causing undefined behavior in other software).

My understanding is that these needs are very distinct. The first is still within the confines of the abstract machine, so if a program’s behavior can be changed by optimizations then it has undefined behavior. The second, however, is fundamentally an I/O operation, and it permits observing runtime behavior that is not guaranteed by the language and which can be altered by optimizations. rust-lang/unsafe-code-guidelines#321 (which I should make a full RFC out of) is intended for these purposes.

@ais523
Copy link

ais523 commented Apr 24, 2025 via email

@comex
Copy link

comex commented Apr 24, 2025

even when doing so is unsound.

I wouldn't worry about LLVM optimizations that are known to be unsound. Pretty much all of Rust formalization is premised on "this would be sound if LLVM was sound" rather than "this is sound with actual LLVM optimizations". I mean, LLVM still optimizes inttoptr(ptrtoint(x)) to x.

As such, the "this is opaque, so any compile that misoptimised this would also misoptimise a permitted implementation of the function"

This is a good point. Ralf's basic approach, of formalizing inline asm blocks in terms of equivalent Rust code, fundamentally loses the ability to reason in terms of opaqueness in the way you've stated. I am not sure how valuable that ability is.

To be clear, opaqueness itself is well-settled. The optimizer cannot reason about the instructions inside an asm block because they might be patched at runtime. But for formal modeling purposes, if you're modeling code that includes patchable asm blocks, presumably you would model it in terms of the actual possibilities of what might be there at runtime. So you usually wouldn't need to invoke opaqueness at that step.

And it's easy to make mistakes when invoking opaqueness. For one thing, asm blocks are not totally opaque. As an extreme example, suppose an an asm block is marked as pure, readonly, nomem and only takes one u8 as input. A silly but legal optimization would be to precompute the outputs for all possible inputs at process startup, then replace all uses with a table lookup. Now, the "racy read" asm blocks under discussion would not be nomem, but you would probably want to be able to mark them pure and readonly. That might allow the compiler to insert speculative invocations for inputs that don't occur in the original program, if the compiler can prove that they could have occurred given different nondeterministic choices. (If the assembly triggered a fault or made unexpected memory writes given those inputs, then it would be violating the pure condition.)

As a more practical example, opaqueness is often used to justify empty asm blocks as optimization barriers, but a lot of uses of optimization barriers are unsound in the face of value speculation - i.e., if the compiler replaces foo(x) with if x == 42 { foo(42) } else { foo(x) }, and then heavily optimizes the foo(42) part. In most cases that just means broken benchmarks, which aren't a big deal, since we're talking about theory rather than practice and in theory we guarantee nothing about performance characteristics. But if you're trying to use an optimization barrier for correctness, then value speculation might give you UB. Especially if x is a pointer. (Yes, if x is a pointer then that transform wouldn't be valid in general, but it would be valid under some conditions.) ...Which is why black_box probably does deserve to live under std::hint, as much as I dislike that fact.

Still, there must be some programs that can be legitimately justified based on opaqueness. Perhaps racy loads are an example - I haven't thought through your argument enough to judge - but it doesn't matter too much either way, because it's pretty clear that extending the Abstract Machine is a better path forward. Same goes for freeze. But my question is: In a future where we have all those nice things, would there still be useful programs that could only be justified using opaqueness, and not through Ralf's method?

@DemiMarie
Copy link

Debuggers are one case where opaqueness is really needed. My proposal for core::arch::{load, store} explicitly allows reading from freed memory without UB, and in-process debuggers actually need to be able to do that (because one might be debugging the allocator).

@VorpalBlade
Copy link

Debuggers are one case where opaqueness is really needed. My proposal for core::arch::{load, store} explicitly allows reading from freed memory without UB, and in-process debuggers actually need to be able to do that (because one might be debugging the allocator).

  1. How would that work if the address has been unmapped? That would typically trigger a sigsegv or similar.
  2. How would it work if you read memory with side effects (an mmio register for an UART input for example)?
  3. Some memory might have hardware UB. For example if I remember correctly, you are not allowed to access certain memory from interrupt handlers while the microcontroller is in low power mode on ESP32 (I might have mixed up the details here, I saw that when I was looking for something else in the docs).

@DemiMarie
Copy link

Debuggers are one case where opaqueness is really needed. My proposal for core::arch::{load, store} explicitly allows reading from freed memory without UB, and in-process debuggers actually need to be able to do that (because one might be debugging the allocator).

  1. How would that work if the address has been unmapped? That would typically trigger a sigsegv or similar.

Triggering SIGSEGV is the intended behavior here. Debuggers can generally handle this gracefully.

  1. How would it work if you read memory with side effects (an mmio register for an UART input for example)?

The side effects would (and should) happen.

  1. Some memory might have hardware UB. For example if I remember correctly, you are not allowed to access certain memory from interrupt handlers while the microcontroller is in low power mode on ESP32 (I might have mixed up the details here, I saw that when I was looking for something else in the docs).

That’s one reason the operation is unsafe. It has no Rust-level UB, but it also doesn’t guarantee that it doesn’t trigger UB elsewhere. Its semantics are whatever the underlying hardware gives you. The compiler promises to perform exactly the operation you tell it to (unless there is UB elsewhere), but that breaks something, you get to break both pieces.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-libs-api Relevant to the library API team, which will review and decide on the RFC.
Projects
None yet
Development

Successfully merging this pull request may close these issues.