Design: make compute-device inference primary and runtime override secondary

## Summary

We should redesign tenferro's execution-selection model so that:

- `logical_memory_space` remains the allocation-time placement concept
- `compute_device` becomes the first-class execution-routing concept
- explicit runtime/context overrides become secondary, advanced controls

This would move tenferro closer to PyTorch-style tensor-centric dispatch without collapsing memory placement and execution context into the same field.

## Problem

The current design mixes three concerns in a way that is hard to reason about:

1. Allocation placement
   - `Tensor<T>` already stores `logical_memory_space`
   - this is required when allocating a tensor
2. Execution routing
   - `Tensor<T>` currently has `preferred_compute_device: Option<ComputeDevice>`
   - this is only a hint/preference, not a canonical execution identity
3. Runtime/context selection
   - builder-style `.run()` paths depend on thread-local `set_default_runtime(...)`
   - if runtime is not configured, `with_default_runtime(...)` returns `RuntimeNotConfigured`

This makes the public model feel runtime-first rather than tensor-first.

In particular:

- a tensor does not have a canonical `compute_device()` API today
- user-facing code is encouraged to install a thread-local default runtime even when the tensors themselves already carry placement-related metadata
- `CpuContext`, `CudaContext`, and `RocmContext` are exposed as separate concrete types, which is awkward for user-facing override APIs
- CUDA context initialization is conceptually device-bound, but the current high-level API does not expose a clean `for_device(device_id)` style constructor and currently defaults to device 0 internally in `CudaBackend::load(...)`

## Design goals

1. Keep memory placement and execution routing separate.
2. Make execution dispatch tensor-centric by default.
3. Keep explicit execution context APIs for advanced control.
4. Support multiple CPU execution pools as distinct compute devices.
5. Make GPU context identity explicit via `device_id`.
6. Improve ergonomics of override APIs in Rust.

## Proposed model

### 1. Keep `logical_memory_space` as the allocation-time concept

This remains required for tensor allocation.

Examples:

- `LogicalMemorySpace::MainMemory`
- `LogicalMemorySpace::GpuMemory { device_id }`

This answers: where does the buffer live?

### 2. Introduce a canonical `compute_device` concept

Execution routing should use a canonical `compute_device()` API rather than the current `preferred_compute_device()` terminology.

Proposed user-facing shape:

```rust
impl<T> Tensor<T> {
    pub fn compute_device(&self) -> Option<ComputeDevice>;
    pub fn set_compute_device(&mut self, device: Option<ComputeDevice>);
    pub fn with_compute_device(&self, device: Option<ComputeDevice>) -> Self;
}
```

`preferred_compute_device` can be treated as an implementation detail or migration step, but the public semantic should become:

- this tensor's execution-routing default

### 3. Dispatch order: compute-device inference first, runtime override second

Proposed priority order for execution:

1. explicit op/device argument
2. tensor-local `compute_device`
3. scoped `with_compute_device(...)`
4. scoped `with_compute_context(...)`
5. inference from `logical_memory_space` when unambiguous
6. otherwise return an ambiguity/configuration error

This makes the tensor the default source of truth for execution routing.

### 4. Add a user-facing `ComputeContext` enum

The current split `CpuContext` / `CudaContext` / `RocmContext` is fine internally, but user-facing override APIs are easier to use if they accept a single enum.

Proposed shape:

```rust
pub enum ComputeContext {
    Cpu(CpuContext),
    Cuda(CudaContext),
    Rocm(RocmContext),
}
```

Suggested home:

- implementation home: `tenferro-internal-runtime`
- public re-export: `tenferro` and `tenferro-dynamic-compute`

Not `tenferro-device`, because `ComputeContext` depends on concrete backend context types and therefore belongs above the low-level foundation layer.

### 5. Make override APIs device-first and context-second

Normal users should mostly interact with `ComputeDevice`, not raw contexts.

Proposed APIs:

```rust
pub fn with_compute_device<R>(
    device: ComputeDevice,
    f: impl FnOnce() -> Result<R>,
) -> Result<R>;

pub fn with_compute_context<R>(
    ctx: ComputeContext,
    f: impl FnOnce() -> Result<R>,
) -> Result<R>;
```

Meaning:

- `with_compute_device(...)`
  - normal execution override
- `with_compute_context(...)`
  - advanced override for thread pools, streams, handles, allocators, etc.

### 6. GPU contexts should be explicitly device-bound

Creating a GPU context should require selecting an existing device id.

Examples:

```rust
let ctx = CudaContext::for_device(0)?;
let ctx = RocmContext::for_device(1)?;
```

For normal execution, this should usually be resolved lazily from `ComputeDevice` via an internal registry/cache.

### 7. Multiple CPU pools should be representable as distinct compute devices

This is the key reason not to collapse everything into a PyTorch-style single `device` field.

A plausible direction:

```rust
pub enum ComputeDevice {
    Cpu { pool_id: usize },
    Cuda { device_id: usize },
    Rocm { device_id: usize },
}
```

Then:

- `logical_memory_space` still describes where data lives
- `compute_device` describes where/how computation runs
- CPU execution can be routed across multiple thread pools without pretending there is only one CPU device

## Why this seems better than the current model

- keeps allocation placement and execution routing conceptually separate
- avoids forcing users to think in terms of thread-local runtime installation for ordinary tensor ops
- aligns normal dispatch more closely with PyTorch's tensor-centric execution model
- still supports richer execution contexts than PyTorch by separating `ComputeDevice` from `ComputeContext`
- provides a cleaner home for future GPU stream / handle / allocator control

## Open questions

1. Should `Tensor::compute_device()` be stored directly, or computed from `preferred_compute_device + logical_memory_space` during a migration period?
2. Should `ComputeDevice::Cpu` use `device_id` or `pool_id` in the public API?
3. Should `logical_memory_space` continue to encode GPU `device_id`, or should that migrate into a separate placement descriptor long-term?
4. Should `set_default_runtime(...)` remain public as an advanced compatibility layer, or should we replace it with `with_compute_context(...)` entirely?
5. How should mixed-device tensor ops report ambiguity or incompatibility?
6. How should `tenferro-dynamic-compute` and `tenferro` share the new execution-selection APIs without duplicating surface area?

## Non-goals

This issue is for design review, not for immediate implementation.

In particular, this issue does not propose:

- rewriting the backend contracts in this PR
- collapsing memory placement and execution routing into one field
- removing internal backend-specific context types

## Request for review

I would like review on whether this separation is the right long-term direction:

- `logical_memory_space` for placement
- `compute_device` for routing
- `compute_context` for advanced execution control

and whether the proposed API layering is the right fit for tenferro's architecture.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design: make compute-device inference primary and runtime override secondary #602

Summary

Problem

Design goals

Proposed model

1. Keep `logical_memory_space` as the allocation-time concept

2. Introduce a canonical `compute_device` concept

3. Dispatch order: compute-device inference first, runtime override second

4. Add a user-facing `ComputeContext` enum

5. Make override APIs device-first and context-second

6. GPU contexts should be explicitly device-bound

7. Multiple CPU pools should be representable as distinct compute devices

Why this seems better than the current model

Open questions

Non-goals

Request for review

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design: make compute-device inference primary and runtime override secondary #602

Description

Summary

Problem

Design goals

Proposed model

1. Keep logical_memory_space as the allocation-time concept

2. Introduce a canonical compute_device concept

3. Dispatch order: compute-device inference first, runtime override second

4. Add a user-facing ComputeContext enum

5. Make override APIs device-first and context-second

6. GPU contexts should be explicitly device-bound

7. Multiple CPU pools should be representable as distinct compute devices

Why this seems better than the current model

Open questions

Non-goals

Request for review

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Keep `logical_memory_space` as the allocation-time concept

2. Introduce a canonical `compute_device` concept

4. Add a user-facing `ComputeContext` enum