Skip to content

Request: Support for concurrent inference of independent models from separate threads #3078

@1-ashraful-islam

Description

@1-ashraful-islam

Summary

I'm building an application that runs two independent ML models concurrently on Apple Silicon (both macOS and iOS). Each model processes different inputs and produces different outputs - no shared arrays or dependencies between them. Currently this crashes due to thread-safety issues in the Metal backend.

I've researched the existing issues (#2133, #2067, #2086) and PR #2104. I'd like to discuss whether this use case can be supported and what the best path forward might be.

Environment

  • MLX version: 0.30.3
  • Platform: macOS and iOS (Apple Silicon)
  • API: C++ (direct MLX C++ API)
  • Build: Built from source

Use Case

Thread A: model_a → inference(input_a) → output_a
Thread B: model_b → inference(input_b) → output_b
(concurrent execution, completely independent)

Both models are:

  • Loaded once at startup via mx::import_function()
  • Wrapped with mx::compile() for kernel fusion
  • Called from their own dedicated threads

Current Behavior

Concurrent inference produces Metal assertion failures such as:

-[_MTLCommandBuffer addScheduledHandler:]: failed assertion
'Scheduled handler provided after commit call'
A command encoder is already encoding to this command buffer

I can provide more detailed stack traces if helpful.

What I've Tried

1. Dedicated streams per model using StreamContext

// Construction
dedicated_stream_ = mx::new_stream(mx::Device::gpu);

// Inference
mx::StreamContext ctx(dedicated_stream_);
auto outputs = compiled_model_({input_array});
mx::eval(outputs[0]);
mx::synchronize();

Result: Crashes. Two threads using StreamContext concurrently race on the global default_streams_ map in Scheduler::set_default_stream().

2. Default stream (no StreamContext)

Both threads use the default GPU stream.

Result: Crashes. Both threads race on the shared DeviceStream::buffer and DeviceStream::encoder fields in get_command_buffer() / commit_command_buffer().

3. Mutex serialization

This works but negates the benefit of running models in parallel.

Root Cause Analysis

I traced this to two issues:

A. StreamContext modifies global state

set_default_stream() writes to Scheduler::default_streams_, a non-thread-safe unordered_map. Concurrent StreamContext usage corrupts this or causes incorrect restore on destruction.

B. DeviceStream lacks synchronization

Even with unique stream indices, DeviceStream::buffer and encoder are accessed without locks in the eval path, causing races when operations interleave.

Potential Solutions

I'd appreciate guidance on which (if any) of these aligns with MLX's design:

Option 1: Thread-local default streams

Store per-thread stream overrides in thread-local storage. get_default_stream() checks TLS first, set_default_stream() writes to TLS. This makes StreamContext thread-safe without locks.

Option 2: Per-DeviceStream synchronization

Add a mutex to DeviceStream so different streams can execute in parallel while same-stream access is serialized. (I saw concerns about deadlock in PR #2104 discussion.)

Option 3: Explicit stream passing for compiled functions

Allow callers to pass a stream directly when invoking compiled functions, bypassing the default stream mechanism entirely.

Option 4: Document as unsupported

If this isn't a priority use case, documenting the limitation clearly would also help users plan accordingly.

Questions

  1. Is concurrent multi-model inference something MLX aims to support?
  2. Is PR Metal thread safety #2104 the intended fix? Should discussion continue there instead?
  3. Are there workarounds I've missed that avoid serialization?
  4. Would any of the proposed solutions be welcome as a contribution?

Related

Thanks for MLX - the performance on Apple Silicon is excellent. Happy to provide more details or test proposed fixes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions