Consolidate remaining v2 design issues around semiring backend contracts

cc @tensor4all-meta

## Summary

This issue consolidates the remaining design-v2 questions after the recent placement cleanup. The main unresolved theme is how custom semiring backends should relate to primitive vocabulary, trait contracts, and general execution engines.

## Remaining issues

### 1. Custom semiring backend minimum contract is still inconsistent

Two different contracts are described today.

- `primitive-catalog.md` treats tensor-structural operations like `permute`, `reshape`, `broadcast`, and `diagonal` as tensor-layer views / metadata transforms, so the backend-facing strict minimum is close to `BatchedGemm + ReduceSum` (with diagonal-related behavior depending on whether diagonal stays in the tensor layer).
- `tenferro-internal-design.md` instead puts `Transpose`, `Reshape`, and `BroadcastInDim` into `SemiringOpKind` / `SemiringOps`, which makes them part of the required semiring contract.

These are materially different obligations for custom backends. The design needs one source of truth.

### 2. Compile-cache identity rules are still inconsistent

`computegraph-design.md` says compiled programs are cached using `GlobalValKey`-based structure, which includes `InputKey`.
`tensor-api-pseudocode.md` says the cache key is based on normalized graph topology and should ignore concrete `InputKey` / `DiffPassId` values so repeated `differentiate` calls hit the same cache entry.

This needs an explicit decision:

- either tenferro owns a second normalized compile-cache layer above computegraph
- or computegraph itself changes its cache contract

Without that, higher-order AD cache behavior is underspecified.

### 3. `StdTensorOp` is not fully a single source of truth yet

There are still mismatches between the vocabulary and the lowering tables. Examples:

- direct `StdTensorOp::Add` / `Mul` / `DotGeneral` style descriptions vs `StdTensorOp::Semiring(SemiringOpKind)`
- `Cholesky` described as custom-call in one place and direct `stablehlo.cholesky` in another
- `LuFullPivot` / `StdTensorOp::CustomCall` shown in later sections even though they are not present in the main enum definition

The enum definition, lowering table, and extensibility story should be unified.

## Current idea

A cleaner direction is to separate Tenferro primitives from the traits that execute them.

### Proposed layering

1. Primitive descriptors
   - Keep a primitive vocabulary crate/module that only describes operations and their planning/execution descriptors.
   - Examples: semiring core descriptors, scalar descriptors, linalg descriptors, transfer descriptors.

2. Executor traits
   - Define small family-specific traits that know how to `plan/execute` those descriptors.
   - Backends implement these traits, not a single giant monolithic backend trait.

3. General engines
   - Build generic interpreters / engines that consume primitive descriptors and dispatch through the executor traits.
   - Einsum becomes one such engine. Standard tensor execution could become another.

4. Backend implementations
   - CPU / CUDA / custom algebra backends only implement the trait families they actually support.

## Why this looks promising

This is close to what `origin/main` already does for semiring execution:

- `Semiring` defines the algebra
- `TensorSemiringCore<Alg>` is the required semiring execution contract
- `TensorSemiringFastPath<Alg>` is the optional optimization contract
- `EinsumBackend<Alg>` is just the composition of those traits
- `tenferro-einsum` is the general engine that interprets einsum plans through those traits

That pattern is attractive because backend authors implement capabilities, while high-level algorithms live in reusable engines.

## Design questions to resolve next

- Should design-v2 explicitly adopt the `origin/main` descriptor + `plan/execute` pattern as the reference model?
- For custom semiring backends, what is the true required core: only contraction/reduction primitives, or also tensor-structural view-like ops?
- Should diagonal / trace behavior stay in the semiring core, or be normalized away earlier?
- Should the standard backend story also move from `Backend<Op>` toward capability-family traits plus one or more general engines?

## Suggested resolution order

1. Fix the custom semiring minimum contract
2. Fix compile-cache identity semantics
3. Make `StdTensorOp` + lowering tables a single source of truth
4. Then decide how far the primitive-descriptor / executor-trait / engine split should become the v2 architectural pattern


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consolidate remaining v2 design issues around semiring backend contracts #20

Summary

Remaining issues

1. Custom semiring backend minimum contract is still inconsistent

2. Compile-cache identity rules are still inconsistent

3. `StdTensorOp` is not fully a single source of truth yet

Current idea

Proposed layering

Why this looks promising

Design questions to resolve next

Suggested resolution order

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consolidate remaining v2 design issues around semiring backend contracts #20

Description

Summary

Remaining issues

1. Custom semiring backend minimum contract is still inconsistent

2. Compile-cache identity rules are still inconsistent

3. StdTensorOp is not fully a single source of truth yet

Current idea

Proposed layering

Why this looks promising

Design questions to resolve next

Suggested resolution order

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

3. `StdTensorOp` is not fully a single source of truth yet