-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
Summary
I'm building an application that runs two independent ML models concurrently on Apple Silicon (both macOS and iOS). Each model processes different inputs and produces different outputs - no shared arrays or dependencies between them. Currently this crashes due to thread-safety issues in the Metal backend.
I've researched the existing issues (#2133, #2067, #2086) and PR #2104. I'd like to discuss whether this use case can be supported and what the best path forward might be.
Environment
- MLX version: 0.30.3
- Platform: macOS and iOS (Apple Silicon)
- API: C++ (direct MLX C++ API)
- Build: Built from source
Use Case
Thread A: model_a → inference(input_a) → output_a
Thread B: model_b → inference(input_b) → output_b
(concurrent execution, completely independent)
Both models are:
- Loaded once at startup via
mx::import_function() - Wrapped with
mx::compile()for kernel fusion - Called from their own dedicated threads
Current Behavior
Concurrent inference produces Metal assertion failures such as:
-[_MTLCommandBuffer addScheduledHandler:]: failed assertion
'Scheduled handler provided after commit call'
A command encoder is already encoding to this command buffer
I can provide more detailed stack traces if helpful.
What I've Tried
1. Dedicated streams per model using StreamContext
// Construction
dedicated_stream_ = mx::new_stream(mx::Device::gpu);
// Inference
mx::StreamContext ctx(dedicated_stream_);
auto outputs = compiled_model_({input_array});
mx::eval(outputs[0]);
mx::synchronize();Result: Crashes. Two threads using StreamContext concurrently race on the global default_streams_ map in Scheduler::set_default_stream().
2. Default stream (no StreamContext)
Both threads use the default GPU stream.
Result: Crashes. Both threads race on the shared DeviceStream::buffer and DeviceStream::encoder fields in get_command_buffer() / commit_command_buffer().
3. Mutex serialization
This works but negates the benefit of running models in parallel.
Root Cause Analysis
I traced this to two issues:
A. StreamContext modifies global state
set_default_stream() writes to Scheduler::default_streams_, a non-thread-safe unordered_map. Concurrent StreamContext usage corrupts this or causes incorrect restore on destruction.
B. DeviceStream lacks synchronization
Even with unique stream indices, DeviceStream::buffer and encoder are accessed without locks in the eval path, causing races when operations interleave.
Potential Solutions
I'd appreciate guidance on which (if any) of these aligns with MLX's design:
Option 1: Thread-local default streams
Store per-thread stream overrides in thread-local storage. get_default_stream() checks TLS first, set_default_stream() writes to TLS. This makes StreamContext thread-safe without locks.
Option 2: Per-DeviceStream synchronization
Add a mutex to DeviceStream so different streams can execute in parallel while same-stream access is serialized. (I saw concerns about deadlock in PR #2104 discussion.)
Option 3: Explicit stream passing for compiled functions
Allow callers to pass a stream directly when invoking compiled functions, bypassing the default stream mechanism entirely.
Option 4: Document as unsupported
If this isn't a priority use case, documenting the limitation clearly would also help users plan accordingly.
Questions
- Is concurrent multi-model inference something MLX aims to support?
- Is PR Metal thread safety #2104 the intended fix? Should discussion continue there instead?
- Are there workarounds I've missed that avoid serialization?
- Would any of the proposed solutions be welcome as a contribution?
Related
- Thread safety: Ongoing issue for thread safety in MLX #2133 - Thread safety tracking issue
- [BUG] thread issues with evaluation #2067 - Thread issues with evaluation
- [BUG] c++ compile cache should be thread safe #2086 - Compiler cache thread safety
- Metal thread safety #2104 - Metal thread safety PR
- Discussion Question about running `mx.eval` in separate threads #1448 -
mx.evalin separate threads
Thanks for MLX - the performance on Apple Silicon is excellent. Happy to provide more details or test proposed fixes.