Skip to content

Design: make compute-device inference primary and runtime override secondary #602

@shinaoka

Description

@shinaoka

Summary

We should redesign tenferro's execution-selection model so that:

  • logical_memory_space remains the allocation-time placement concept
  • compute_device becomes the first-class execution-routing concept
  • explicit runtime/context overrides become secondary, advanced controls

This would move tenferro closer to PyTorch-style tensor-centric dispatch without collapsing memory placement and execution context into the same field.

Problem

The current design mixes three concerns in a way that is hard to reason about:

  1. Allocation placement
    • Tensor<T> already stores logical_memory_space
    • this is required when allocating a tensor
  2. Execution routing
    • Tensor<T> currently has preferred_compute_device: Option<ComputeDevice>
    • this is only a hint/preference, not a canonical execution identity
  3. Runtime/context selection
    • builder-style .run() paths depend on thread-local set_default_runtime(...)
    • if runtime is not configured, with_default_runtime(...) returns RuntimeNotConfigured

This makes the public model feel runtime-first rather than tensor-first.

In particular:

  • a tensor does not have a canonical compute_device() API today
  • user-facing code is encouraged to install a thread-local default runtime even when the tensors themselves already carry placement-related metadata
  • CpuContext, CudaContext, and RocmContext are exposed as separate concrete types, which is awkward for user-facing override APIs
  • CUDA context initialization is conceptually device-bound, but the current high-level API does not expose a clean for_device(device_id) style constructor and currently defaults to device 0 internally in CudaBackend::load(...)

Design goals

  1. Keep memory placement and execution routing separate.
  2. Make execution dispatch tensor-centric by default.
  3. Keep explicit execution context APIs for advanced control.
  4. Support multiple CPU execution pools as distinct compute devices.
  5. Make GPU context identity explicit via device_id.
  6. Improve ergonomics of override APIs in Rust.

Proposed model

1. Keep logical_memory_space as the allocation-time concept

This remains required for tensor allocation.

Examples:

  • LogicalMemorySpace::MainMemory
  • LogicalMemorySpace::GpuMemory { device_id }

This answers: where does the buffer live?

2. Introduce a canonical compute_device concept

Execution routing should use a canonical compute_device() API rather than the current preferred_compute_device() terminology.

Proposed user-facing shape:

impl<T> Tensor<T> {
    pub fn compute_device(&self) -> Option<ComputeDevice>;
    pub fn set_compute_device(&mut self, device: Option<ComputeDevice>);
    pub fn with_compute_device(&self, device: Option<ComputeDevice>) -> Self;
}

preferred_compute_device can be treated as an implementation detail or migration step, but the public semantic should become:

  • this tensor's execution-routing default

3. Dispatch order: compute-device inference first, runtime override second

Proposed priority order for execution:

  1. explicit op/device argument
  2. tensor-local compute_device
  3. scoped with_compute_device(...)
  4. scoped with_compute_context(...)
  5. inference from logical_memory_space when unambiguous
  6. otherwise return an ambiguity/configuration error

This makes the tensor the default source of truth for execution routing.

4. Add a user-facing ComputeContext enum

The current split CpuContext / CudaContext / RocmContext is fine internally, but user-facing override APIs are easier to use if they accept a single enum.

Proposed shape:

pub enum ComputeContext {
    Cpu(CpuContext),
    Cuda(CudaContext),
    Rocm(RocmContext),
}

Suggested home:

  • implementation home: tenferro-internal-runtime
  • public re-export: tenferro and tenferro-dynamic-compute

Not tenferro-device, because ComputeContext depends on concrete backend context types and therefore belongs above the low-level foundation layer.

5. Make override APIs device-first and context-second

Normal users should mostly interact with ComputeDevice, not raw contexts.

Proposed APIs:

pub fn with_compute_device<R>(
    device: ComputeDevice,
    f: impl FnOnce() -> Result<R>,
) -> Result<R>;

pub fn with_compute_context<R>(
    ctx: ComputeContext,
    f: impl FnOnce() -> Result<R>,
) -> Result<R>;

Meaning:

  • with_compute_device(...)
    • normal execution override
  • with_compute_context(...)
    • advanced override for thread pools, streams, handles, allocators, etc.

6. GPU contexts should be explicitly device-bound

Creating a GPU context should require selecting an existing device id.

Examples:

let ctx = CudaContext::for_device(0)?;
let ctx = RocmContext::for_device(1)?;

For normal execution, this should usually be resolved lazily from ComputeDevice via an internal registry/cache.

7. Multiple CPU pools should be representable as distinct compute devices

This is the key reason not to collapse everything into a PyTorch-style single device field.

A plausible direction:

pub enum ComputeDevice {
    Cpu { pool_id: usize },
    Cuda { device_id: usize },
    Rocm { device_id: usize },
}

Then:

  • logical_memory_space still describes where data lives
  • compute_device describes where/how computation runs
  • CPU execution can be routed across multiple thread pools without pretending there is only one CPU device

Why this seems better than the current model

  • keeps allocation placement and execution routing conceptually separate
  • avoids forcing users to think in terms of thread-local runtime installation for ordinary tensor ops
  • aligns normal dispatch more closely with PyTorch's tensor-centric execution model
  • still supports richer execution contexts than PyTorch by separating ComputeDevice from ComputeContext
  • provides a cleaner home for future GPU stream / handle / allocator control

Open questions

  1. Should Tensor::compute_device() be stored directly, or computed from preferred_compute_device + logical_memory_space during a migration period?
  2. Should ComputeDevice::Cpu use device_id or pool_id in the public API?
  3. Should logical_memory_space continue to encode GPU device_id, or should that migrate into a separate placement descriptor long-term?
  4. Should set_default_runtime(...) remain public as an advanced compatibility layer, or should we replace it with with_compute_context(...) entirely?
  5. How should mixed-device tensor ops report ambiguity or incompatibility?
  6. How should tenferro-dynamic-compute and tenferro share the new execution-selection APIs without duplicating surface area?

Non-goals

This issue is for design review, not for immediate implementation.

In particular, this issue does not propose:

  • rewriting the backend contracts in this PR
  • collapsing memory placement and execution routing into one field
  • removing internal backend-specific context types

Request for review

I would like review on whether this separation is the right long-term direction:

  • logical_memory_space for placement
  • compute_device for routing
  • compute_context for advanced execution control

and whether the proposed API layering is the right fit for tenferro's architecture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions