Summary
We should redesign tenferro's execution-selection model so that:
logical_memory_space remains the allocation-time placement concept
compute_device becomes the first-class execution-routing concept
- explicit runtime/context overrides become secondary, advanced controls
This would move tenferro closer to PyTorch-style tensor-centric dispatch without collapsing memory placement and execution context into the same field.
Problem
The current design mixes three concerns in a way that is hard to reason about:
- Allocation placement
Tensor<T> already stores logical_memory_space
- this is required when allocating a tensor
- Execution routing
Tensor<T> currently has preferred_compute_device: Option<ComputeDevice>
- this is only a hint/preference, not a canonical execution identity
- Runtime/context selection
- builder-style
.run() paths depend on thread-local set_default_runtime(...)
- if runtime is not configured,
with_default_runtime(...) returns RuntimeNotConfigured
This makes the public model feel runtime-first rather than tensor-first.
In particular:
- a tensor does not have a canonical
compute_device() API today
- user-facing code is encouraged to install a thread-local default runtime even when the tensors themselves already carry placement-related metadata
CpuContext, CudaContext, and RocmContext are exposed as separate concrete types, which is awkward for user-facing override APIs
- CUDA context initialization is conceptually device-bound, but the current high-level API does not expose a clean
for_device(device_id) style constructor and currently defaults to device 0 internally in CudaBackend::load(...)
Design goals
- Keep memory placement and execution routing separate.
- Make execution dispatch tensor-centric by default.
- Keep explicit execution context APIs for advanced control.
- Support multiple CPU execution pools as distinct compute devices.
- Make GPU context identity explicit via
device_id.
- Improve ergonomics of override APIs in Rust.
Proposed model
1. Keep logical_memory_space as the allocation-time concept
This remains required for tensor allocation.
Examples:
LogicalMemorySpace::MainMemory
LogicalMemorySpace::GpuMemory { device_id }
This answers: where does the buffer live?
2. Introduce a canonical compute_device concept
Execution routing should use a canonical compute_device() API rather than the current preferred_compute_device() terminology.
Proposed user-facing shape:
impl<T> Tensor<T> {
pub fn compute_device(&self) -> Option<ComputeDevice>;
pub fn set_compute_device(&mut self, device: Option<ComputeDevice>);
pub fn with_compute_device(&self, device: Option<ComputeDevice>) -> Self;
}
preferred_compute_device can be treated as an implementation detail or migration step, but the public semantic should become:
- this tensor's execution-routing default
3. Dispatch order: compute-device inference first, runtime override second
Proposed priority order for execution:
- explicit op/device argument
- tensor-local
compute_device
- scoped
with_compute_device(...)
- scoped
with_compute_context(...)
- inference from
logical_memory_space when unambiguous
- otherwise return an ambiguity/configuration error
This makes the tensor the default source of truth for execution routing.
4. Add a user-facing ComputeContext enum
The current split CpuContext / CudaContext / RocmContext is fine internally, but user-facing override APIs are easier to use if they accept a single enum.
Proposed shape:
pub enum ComputeContext {
Cpu(CpuContext),
Cuda(CudaContext),
Rocm(RocmContext),
}
Suggested home:
- implementation home:
tenferro-internal-runtime
- public re-export:
tenferro and tenferro-dynamic-compute
Not tenferro-device, because ComputeContext depends on concrete backend context types and therefore belongs above the low-level foundation layer.
5. Make override APIs device-first and context-second
Normal users should mostly interact with ComputeDevice, not raw contexts.
Proposed APIs:
pub fn with_compute_device<R>(
device: ComputeDevice,
f: impl FnOnce() -> Result<R>,
) -> Result<R>;
pub fn with_compute_context<R>(
ctx: ComputeContext,
f: impl FnOnce() -> Result<R>,
) -> Result<R>;
Meaning:
with_compute_device(...)
- normal execution override
with_compute_context(...)
- advanced override for thread pools, streams, handles, allocators, etc.
6. GPU contexts should be explicitly device-bound
Creating a GPU context should require selecting an existing device id.
Examples:
let ctx = CudaContext::for_device(0)?;
let ctx = RocmContext::for_device(1)?;
For normal execution, this should usually be resolved lazily from ComputeDevice via an internal registry/cache.
7. Multiple CPU pools should be representable as distinct compute devices
This is the key reason not to collapse everything into a PyTorch-style single device field.
A plausible direction:
pub enum ComputeDevice {
Cpu { pool_id: usize },
Cuda { device_id: usize },
Rocm { device_id: usize },
}
Then:
logical_memory_space still describes where data lives
compute_device describes where/how computation runs
- CPU execution can be routed across multiple thread pools without pretending there is only one CPU device
Why this seems better than the current model
- keeps allocation placement and execution routing conceptually separate
- avoids forcing users to think in terms of thread-local runtime installation for ordinary tensor ops
- aligns normal dispatch more closely with PyTorch's tensor-centric execution model
- still supports richer execution contexts than PyTorch by separating
ComputeDevice from ComputeContext
- provides a cleaner home for future GPU stream / handle / allocator control
Open questions
- Should
Tensor::compute_device() be stored directly, or computed from preferred_compute_device + logical_memory_space during a migration period?
- Should
ComputeDevice::Cpu use device_id or pool_id in the public API?
- Should
logical_memory_space continue to encode GPU device_id, or should that migrate into a separate placement descriptor long-term?
- Should
set_default_runtime(...) remain public as an advanced compatibility layer, or should we replace it with with_compute_context(...) entirely?
- How should mixed-device tensor ops report ambiguity or incompatibility?
- How should
tenferro-dynamic-compute and tenferro share the new execution-selection APIs without duplicating surface area?
Non-goals
This issue is for design review, not for immediate implementation.
In particular, this issue does not propose:
- rewriting the backend contracts in this PR
- collapsing memory placement and execution routing into one field
- removing internal backend-specific context types
Request for review
I would like review on whether this separation is the right long-term direction:
logical_memory_space for placement
compute_device for routing
compute_context for advanced execution control
and whether the proposed API layering is the right fit for tenferro's architecture.
Summary
We should redesign tenferro's execution-selection model so that:
logical_memory_spaceremains the allocation-time placement conceptcompute_devicebecomes the first-class execution-routing conceptThis would move tenferro closer to PyTorch-style tensor-centric dispatch without collapsing memory placement and execution context into the same field.
Problem
The current design mixes three concerns in a way that is hard to reason about:
Tensor<T>already storeslogical_memory_spaceTensor<T>currently haspreferred_compute_device: Option<ComputeDevice>.run()paths depend on thread-localset_default_runtime(...)with_default_runtime(...)returnsRuntimeNotConfiguredThis makes the public model feel runtime-first rather than tensor-first.
In particular:
compute_device()API todayCpuContext,CudaContext, andRocmContextare exposed as separate concrete types, which is awkward for user-facing override APIsfor_device(device_id)style constructor and currently defaults to device 0 internally inCudaBackend::load(...)Design goals
device_id.Proposed model
1. Keep
logical_memory_spaceas the allocation-time conceptThis remains required for tensor allocation.
Examples:
LogicalMemorySpace::MainMemoryLogicalMemorySpace::GpuMemory { device_id }This answers: where does the buffer live?
2. Introduce a canonical
compute_deviceconceptExecution routing should use a canonical
compute_device()API rather than the currentpreferred_compute_device()terminology.Proposed user-facing shape:
preferred_compute_devicecan be treated as an implementation detail or migration step, but the public semantic should become:3. Dispatch order: compute-device inference first, runtime override second
Proposed priority order for execution:
compute_devicewith_compute_device(...)with_compute_context(...)logical_memory_spacewhen unambiguousThis makes the tensor the default source of truth for execution routing.
4. Add a user-facing
ComputeContextenumThe current split
CpuContext/CudaContext/RocmContextis fine internally, but user-facing override APIs are easier to use if they accept a single enum.Proposed shape:
Suggested home:
tenferro-internal-runtimetenferroandtenferro-dynamic-computeNot
tenferro-device, becauseComputeContextdepends on concrete backend context types and therefore belongs above the low-level foundation layer.5. Make override APIs device-first and context-second
Normal users should mostly interact with
ComputeDevice, not raw contexts.Proposed APIs:
Meaning:
with_compute_device(...)with_compute_context(...)6. GPU contexts should be explicitly device-bound
Creating a GPU context should require selecting an existing device id.
Examples:
For normal execution, this should usually be resolved lazily from
ComputeDevicevia an internal registry/cache.7. Multiple CPU pools should be representable as distinct compute devices
This is the key reason not to collapse everything into a PyTorch-style single
devicefield.A plausible direction:
Then:
logical_memory_spacestill describes where data livescompute_devicedescribes where/how computation runsWhy this seems better than the current model
ComputeDevicefromComputeContextOpen questions
Tensor::compute_device()be stored directly, or computed frompreferred_compute_device + logical_memory_spaceduring a migration period?ComputeDevice::Cpuusedevice_idorpool_idin the public API?logical_memory_spacecontinue to encode GPUdevice_id, or should that migrate into a separate placement descriptor long-term?set_default_runtime(...)remain public as an advanced compatibility layer, or should we replace it withwith_compute_context(...)entirely?tenferro-dynamic-computeandtenferroshare the new execution-selection APIs without duplicating surface area?Non-goals
This issue is for design review, not for immediate implementation.
In particular, this issue does not propose:
Request for review
I would like review on whether this separation is the right long-term direction:
logical_memory_spacefor placementcompute_devicefor routingcompute_contextfor advanced execution controland whether the proposed API layering is the right fit for tenferro's architecture.