Skip to content

Add Global::value_ptr() for cross-thread cooperative signaling#12587

Closed
AlbertMarashi wants to merge 3 commits intobytecodealliance:mainfrom
AlbertMarashi:patch-1
Closed

Add Global::value_ptr() for cross-thread cooperative signaling#12587
AlbertMarashi wants to merge 3 commits intobytecodealliance:mainfrom
AlbertMarashi:patch-1

Conversation

@AlbertMarashi
Copy link

@AlbertMarashi AlbertMarashi commented Feb 13, 2026

Summary

  • Adds unsafe fn value_ptr(&self, store: impl AsContext) -> *mut u8 to Global
  • Returns a raw pointer to a mutable numeric global's underlying VMGlobalDefinition storage
  • Enables cross-thread cooperative scheduling without requiring &mut Store

Motivation

There is currently no way to signal a running WebAssembly instance from another thread using a wasm-visible global. The existing Global::set() requires &mut Store, which is exclusively held by the thread executing the module. This creates a deadlock in the API: you can't mutate the global until execution finishes, but execution won't finish until the global is mutated.

The concrete use case is cooperative suspend/resume (checkpoint/restore). A WASM transform inserts global.get $flag; br_if $suspend checks at loop heads and after call sites. The host signals suspension by writing 1 to that flag from a control thread. The module sees the flag, saves its locals to a shadow stack, and returns. On resume, the host clears the flag and re-calls the function, which restores locals and continues from where it left off.

Wasmtime's existing interruption mechanisms don't solve this:

  • Epoch interruption traps the instance — the module never gets a chance to save its own state before stopping, so it can't be resumed.
  • Fuel has the same problem — exhaustion causes a trap, not a cooperative yield.

What's needed is a way for the module itself to observe a host-written flag and choose to suspend, preserving its own stack. That requires the host to write to a global that the running module can read — without holding &mut Store.

Design

value_ptr() is deliberately narrow:

  • unsafe — the caller is responsible for lifetime management (pointer is valid only while the Store lives) and data-race avoidance (write_volatile or AtomicI32::from_ptr).
  • Restricted to numeric types — panics on funcref/externref, which require GC coordination.
  • Restricted to mutable globals — panics on Mutability::Const.
  • Only needs AsContext (shared ref) — the point is to obtain the pointer before moving the store to a worker thread. No &mut Store required.

Example

let flag = instance.get_global(&mut store, "suspending").unwrap();
let ptr = unsafe { flag.value_ptr(&store) } as *mut i32;

// Move store to worker thread
std::thread::spawn(move || {
    run.call(&mut store, ()).unwrap();
});

// Signal from control thread
unsafe { std::ptr::write_volatile(ptr, 1); }

Implement the ability to obtain an unsafe raw pointer into global variable memory in webasembly modules.
@AlbertMarashi AlbertMarashi requested a review from a team as a code owner February 13, 2026 06:18
@AlbertMarashi AlbertMarashi requested review from fitzgen and removed request for a team February 13, 2026 06:18
@bjorn3
Copy link
Contributor

bjorn3 commented Feb 13, 2026

I don't think this is sound. Globals are accessed through non-atomic loads and stores and thus a concurrent modification is UB.

@AlbertMarashi
Copy link
Author

I don't think this is sound. Globals are accessed through non-atomic loads and stores and thus a concurrent modification is UB.

Aren't all writes to i32s atomic?

@bjorn3
Copy link
Contributor

bjorn3 commented Feb 13, 2026

No. In Rust a store that is not explicitly marked as atomic concurrent with any other load or store (atomic or not) on the same memory is a data race, which is UB. And a store that is marked as atomic can only be done concurrently on the same memory with other atomic operations. The compiler is allowed to for example assume that if it duplicates a non-atomic load without any operation in between on the same thread that could modify the memory, that both loads will result in the same value. If another thread was allowed to do a store between both operations, that would be UB. Similarly, the compiler is allowed to fold successive non-atomic loads without stores in between. So for example while(!done) {} may optimize to an infinite loop as this is a non-atomic load, so no concurrent modifications are possible without UB.

This is unlike wasm where data races in the linear memory are not UB when said linear memory is marked as shared. (if not marked as shared, the wasm runtime is required to deny concurrent accesses) However globals are not stored in the linear memory and current wasm versions do not support marking globals as shared. Also for example the Pulley interpreter does unconditionally use non-atomic accesses in rust code for globals and is thus subject to the rust data race rules.

@AlbertMarashi
Copy link
Author

AlbertMarashi commented Feb 13, 2026

Interesting... I didn't know that.

So, why don't we have global atomics in that case?

However, judging based on how globals are currently implemented in wasmtime, it appears that these types of globals are being loaded from the memory address each time no?

@AlbertMarashi
Copy link
Author

As it stands currently, there appears to be no way to communicate with the code running inside of a module once it's started running, except for doing an unsafe data mutation at a given index in the module's memory from another thread.

This may likely have the same types of issues as you describe - so I am not particularly sure what the right approach is here - it might be valuable for me to have a look at how the increment_epoch functionality in the engine allows for the module running to stop running from another thread. I will report back

@bjorn3
Copy link
Contributor

bjorn3 commented Feb 13, 2026

So, why don't we have global atomics in that case?

Because wasm doesn't yet support sharing an instance between threads. https://github.com/webAssembly/shared-everything-threads is still a draft.

However, judging based on how globals are currently implemented in wasmtime, it appears that these types of globals are being loaded from the memory address each time no?

Using non-atomic loads, so the compiler is allowed to assume that the global won't change. Maybe Cranelift won't miscompile it, but for Pulley we are at the mercy of the compiler that compiled the interpreter, which does consider it UB:

let ret = unsafe { self.addr::<T, I>(i)?.read_unaligned() };
It is possible to make concurrent accesses to globals possible, but it did require auditing all places where globals are accessed to ensure atomic accesses are used. And it did technically be an extension of the wasm specification, which as I understand Wasmtime prefers not to do.

@AlbertMarashi
Copy link
Author

So, it appears that wasmtime currently uses these Atomic numbers to track the epoch

pub(crate) fn current_epoch(&self) -> u64 {
self.epoch_counter().load(Ordering::Relaxed)
}

And here

#[cfg(target_has_atomic = "64")]
pub fn epoch_ptr(self: Pin<&mut Self>) -> &mut Option<VmPtr<AtomicU64>> {
let offset = self.offsets().ptr.vmctx_epoch_ptr();
unsafe { self.vmctx_plus_offset_mut(offset) }
}

So, I guess essentially what I am asking for, is to have an external API to do what wasmtime is currently already doing internally for epoch-based suspensions, as I am working on a crate that will allow for snapshotting instances and arbitrary resumability.

Would this require our globals to become atomic? A new global atomic type? etc.

What are your thoughts?

@github-actions github-actions bot added the wasmtime:api Related to the API of the `wasmtime` crate itself label Feb 13, 2026
@cfallin
Copy link
Member

cfallin commented Feb 13, 2026

@AlbertMarashi bjorn3 is correct in all of the above: what you have built here is fundamentally at odds with the thread safety of core Wasmtime. A Store uniquely owns all storage used by the Wasm instance while running. We cannot provide the API as written in this PR, because it is incorrect.

Two ways out that I can think of:

  • You could make use of SharedMemory, arrange for your module to import that memory, and use atomic reads within Wasm. This will require work on the toolchain end and will also not work with components.
  • You could import a host function that itself has an Arc<AtomicBool> in its closure and reads that. Calling this will be a lot slower than reading a global, however.

Taking a step back, though: what are you actually trying to achieve? When you say suspend/resume do you mean that the Wasm guest saves its state somewhere and returns? Is the goal to timeslice multiple Wasm invocations into a single instance?

If so, you will be interested in the new async component model features (see e.g. Store::run_concurrent and Func::call_concurrent), as well as the in-progress component model cooperative threading work. That work has done the hard part of thinking through state ownership handoffs that you're plowing through/disregarding here.

@AlbertMarashi
Copy link
Author

Taking a step back, though: what are you actually trying to achieve? When you say suspend/resume do you mean that the Wasm guest saves its state somewhere and returns? Is the goal to timeslice multiple Wasm invocations into a single instance?

If so, you will be interested in the new async component model features (see e.g. Store::run_concurrent and Func::call_concurrent), as well as the in-progress component model cooperative threading work. That work has done the hard part of thinking through state ownership handoffs that you're plowing through/disregarding here.

Can you tell me more about this?

How does it address things that may never yield back to the host, such as infinite loops, or exponential function bombs, etc.


To clarify, what I am attempting to achieve is near native-speed transformation of WASM modules to support arbitrary suspend + snapshot + resume capabilities for WASM instances.

The business use case is a generic cloud execution platform that supports "durable" persistent functions/instances that have the ability for modules to suspend their execution for unbounded amounts of time.

This requires us to have a way to serialize all of the module state into a data file somewhere, to be later resumed by our orchestrator when a new request or event is triggered.

This requires us to be able to:

  1. Augment WASM code (or compiled output) to inject fast-check test + bnz-like instructions inside of their code at ideally either function call/loop boundaries, or even at the instruction level, if possible.
  2. Snapshot, and serialize the instance state once suspended, which requires the ability for us to serialize the stack trace, linear memory, module code, and other resources and objects.
  3. To reload and deserialize the instance state in a consistent and deterministic manner, ready to continue executing the code that we effectively left off at.

Note: Not all resources of course will be able to be fully serialized. (e.g. TCP connections / websocket connection) - however we intend for our host to maintain these types of connections in the background whilst the module is "sleeping" until a new packet/event comes in.

@alexcrichton
Copy link
Member

In addition to the Wasmtime-specific thread-safety properties, I think that this problem can also be viewed from a spec-level of "this isn't possible to do in wasm right now".

Let's say that a wasm has a big call stack which bottoms out in an infinite loop. If I understand @AlbertMarashi this PR correctly what you're thinking of doing is that this infinite loop would be instrumented (externally, via a wasm->wasm transformation) to have a global.get at the loop header which spills state and then returns back down the stack triggering everything else to spill state too. Semantically what this would look like to wasm, however, is that the value of the global changes between iterations of the loop without wasm doing anything (e.g. no calls to the host, no mutations of the global, nothing). My understanding is that this is a violation of WebAssembly semantics and would be spec-noncompliant behavior were Wasmtime to allow it.

In theory the best-fitting feature here you want is a shared global. This is part of the shared-everything-threads proposal that bjorn3 mentioned and it is not yet implemented in Wasmtime. This would require atomic loads/stores to the global and would correctly model the ability for external actors to mutate the global during wasm execution. The next-best-fitting feature is what @cfallin mentioned using shared memory. This is a somewhat heavyweight feature to use here since you'd need a full 64k of memory just for this one global, but it would work because the wasm would read a byte in memory to see if it should return and the host would mutate that byte when it wanted to inject a yield.

Can you tell me more about this?

How does it address things that may never yield back to the host, such as infinite loops, or exponential function bombs, etc.

Wasmtime's support for async-invoking wasm is documented here. In short with either epochs or fuel we force wasm to time-slice itself during infinite loops and exponential function bombs. It works very similarly to what you're thinking, we inject checks in loop headers and function headers.

What Wasmtime doesn't support, however, is mutation of the store while WebAssembly is suspended or time-sliced. Wasmtime also doesn't support serializing this state to get resumed later on.


All that's to say: I think your best path forward right now is the same wasm->wasm transformation you have today to inject instrumentation. Instead of using a global to signal "please spill and suspend" you would instead use a shared memory and some byte within that shared memory. That should all work on Wasmtime as-is today and require no Wasmtime modifications.

@AlbertMarashi
Copy link
Author

That makes sense.

Should this issue remain open should there be a future proposal for atomic globals implemented in WASM?

@alexcrichton
Copy link
Member

I'll close this in favor of #9466 which is loosely the tracking issue for shared-everything-threads which would include atomic globals.

@fitzgen
Copy link
Member

fitzgen commented Feb 17, 2026

All that's to say: I think your best path forward right now is the same wasm->wasm transformation you have today to inject instrumentation. Instead of using a global to signal "please spill and suspend" you would instead use a shared memory and some byte within that shared memory. That should all work on Wasmtime as-is today and require no Wasmtime modifications.

Agreed although I would say that you could potentially represent the interrupt check with a call to an imported component function and use compile-time builtins to lower that to the tight code that you are chasing. We would need to add unsafe intrinsics for atomic loads, but that seems pretty reasonable to me.


The final thing I would add is that your Wasm-to-Wasm transformation will need to unwind and rewind the stack (if you aren't relying on your Wasm programs cooperating with this interruption and state-saving stuff themselves) which is basically the same thing that binaryen's asyncify is doing, so it is worth looking into their implementation for inspiration if not even something you can reuse or fork. Also look into continuation-passing style transforms, which enable similar things. Both are going to add execution overheads compared to the original, uninstrumented Wasm program, however. That is pretty inescapable.

@AlbertMarashi
Copy link
Author

which is basically the same thing that binaryen's asyncify is doing
That's correct, in fact, that's pretty much the approach I was trying to implement before running into this roadblock.

Both are going to add execution overheads compared to the original, uninstrumented Wasm program, however. That is pretty inescapable

This is largely true, however, I did come up with a potential solution I am experimenting with currently that provides a zero-cost approach to suspendability/resumability with support for universal serializable instance snapshots via cross-thread signalling/interrupts.

The idea behind that is essentially this:

  1. Leave the compiled code as-is.
  2. At compile time, track compilation information and program metadata to construct side tables that provide us the necessary information to perform arbitrary resumability at any WASM PC.
  3. For native code, this would involve tracking live variable state and logic to convert register state at native PCs into a more universal VM-like program stack.
  4. When an interrupt occurs (e.g. either by page table fault, or by means of thread kill/interrupt), we'd receive the register state of the running instance. From here, we map register state/locals into a stack-based representation of state.
  5. Next, we could snapshot the module, serialize or persist it, or alternatively proceed to the next step (Resume)
  6. Resume would be the inverse of suspending, and would involve reading the program's stack to reload live values back into their respective registers as expected by the native code, and proceed instance execution from where we left off (in ideally the same identical state as before)

This approach keeps the hot code execution path free from code augmentations and checks, and rely on the CPU's native capabilities to stop program execution. The cold path (suspend) would occur at far less frequent intervals (e.g. time-based instance scheduling), so the cost of suspending, serializing and resuming modules is amortized over the main execution time.

The only extra cost would be the memory/disk requirements to store dense side tables that give our engine the necessary information to map registers to stacks at arbitrary PC points - which may potentially double the generated code size (although all of this could exist on disk/mmap'd code files due to the infrequent access and random access)


I've dropped this wasmtime zulip conversation chat log here for anyone that finds this in the future.

#general > WASM Snapshotting and Resumability

@cfallin
Copy link
Member

cfallin commented Feb 19, 2026

@AlbertMarashi you're correct that having the ability to map from native machine state to Wasm VM state (locals and operand stack), and then from Wasm VM state back to native machine code, would let you take a portable snapshot.

However it's an enormous undertaking from a compiler and runtime perspective:

  • This means that you need to preserve all values that exist in the Wasm bytecode, or alternately a way to "recover" (recompute) them at any interruption point. The latter is a structural property of all compiler passes (they need to be "reversible" in the sense that they generate a recovery path when deleting code). If you don't take the recovery-code approach, you're not actually zero-overhead.
  • This means that you need to modify the register allocator to provide a fully precise mapping from register state to SSA (and then separately map SSA, via recovery code, to Wasm VM state) at every interruptible program point. Again possible, and regalloc2 even has some support for this via the debug-labels mechanism, but it's very expensive at compile time, and somewhat finicky in the presence of liverange merging.
  • This means that you need to essentially support on-stack replacement, i.e., the ability to generate a stack frame to "side-enter" a function when you resume it. This is even harder than the register-state-to-Wasm-state mapping, because the registers are an arbitrary superset of the Wasm state: you may have e.g. intermediate results that have been computed and kept in registers by register allocation, live across interruption points, and you need some way of dumping the expressions that can recompute them for every register in the machine.
    • You can sort of model this in a somewhat principled way by having an extra entrypoint in the IR's CFG for every resume-point (or equivalently, a single virtual entry block that has out-edges to every resume block), but that significantly pessimizes codegen: it means that the function preamble doesn't dominate most code anymore (anything past a resume point), so you're going to have oodles of recomputation of basic things like the memory base loaded from vmctx. Not to mention all of this is also a major change to fundamental invariants in the compiler.
    • Then you have the relocation problem: some of those values that are live in the compiler IR are going to be native pointers, and Cranelift doesn't preserve pointer provenance (for simplicity, by design), so they're just integers in a soup of computation once produced; you have no way to find them and update them if things are mapped at different addresses on resume.

I think you could approach this in a somewhat tractable way by starting from what I did for debug instrumentation, and going the other way for resume -- always loading values back from the stackslots after every interruption point; together with something to ensure that you never have other live values across these points. That's more or less what I describe in the debugging RFC v2 here, around "As a final interesting note: if we ever want to implement on-stack replacement (OSR) ...". But note that the instrumentation approach is decidedly not zero-overhead: it is something like a 2x slowdown. So it's great for debugging, but not something you'd deploy transparently.

tl;dr: "construct some side tables so I can simply read out and reconstitute the native register state" has these "side tables" doing a lot of heavy lifting in a way that requires fundamental changes to the compiler, because you need a fully accurate bijection from Wasm state to register state, with no register left out and all native address dependencies accounted for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wasmtime:api Related to the API of the `wasmtime` crate itself

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants

Comments