Worker performance: MVO update conversion + advanced caching strategies

## Summary

Follow-on performance work from #338 (Worker database performance optimisation, now closed). All original phases (1–5) plus Phase 6 MVO creates are complete. This issue tracks the remaining items: MVO update conversion to raw SQL and the three Future Considerations (A, B, C) from the [plan document](https://github.com/TetronIO/JIM/blob/main/docs/plans/done/WORKER_DATABASE_PERFORMANCE_OPTIMISATION.md#future-considerations).

## MVO Update Conversion to Raw SQL

**Plan document:** [\`docs/plans/doing/MVO_COPY_BINARY_PERSISTENCE.md\`](https://github.com/TetronIO/JIM/blob/main/docs/plans/doing/MVO_COPY_BINARY_PERSISTENCE.md)

**Context**: \`CreateMetaverseObjectsAsync\` was converted to COPY binary in #338 Phase 6. \`UpdateMetaverseObjectsAsync\` remains on EF Core with per-entity \`Entry().State\` management.

### Why updates are fundamentally harder than creates

Creates are a single operation (INSERT rows). Updates involve **four distinct operations in one flush**:

| Operation | Raw SQL approach | Complexity |
|-----------|-----------------|------------|
| UPDATE parent MVO rows (scalar changes) | \`UPDATE ... FROM (VALUES ...)\` batch | Low |
| INSERT new attribute values | COPY binary (proven pattern) | Low |
| UPDATE existing attribute values | \`UPDATE ... FROM (VALUES ...)\` — 13 nullable columns | Moderate |
| DELETE removed attribute values | \`DELETE ... WHERE "Id" IN (...)\` | Low |

The crux is **knowing which AVs are new vs modified vs deleted** without EF's change tracker. Currently the caller applies \`PendingAttributeValueAdditions\`/\`Removals\` to \`mvo.AttributeValues\` before calling update, then EF uses \`IsKeySet\` to classify. A raw SQL conversion would need the caller to preserve that classification or the repository to diff against the database — both are invasive.

Additionally, MVOs arrive in mixed EF states (tracked vs detached, \`AutoDetectChangesEnabled\` on vs off) depending on whether it's a per-page flush or cross-page reference resolution. The current \`UpdateDetachedSafe\` + per-entity state management was the result of 11 debugging attempts (see \`docs/notes/done/CROSS_PAGE_REFERENCE_IDENTITY_CONFLICT.md\`).

**Recommendation**: Wait for profiling data from real-world delta syncs. Creates are the high-volume operation during initial sync; updates are smaller per-page and EF overhead is proportionally less significant.

## Future Consideration A: Upgrade CSO Lookup Index to Full Entity Cache

Upgrade the Phase 1 \`IMemoryCache\` from storing a lookup index (external ID → CSO GUID) to storing full CSO entity graphs (\`AsNoTracking\`). Eliminates all database queries on cache hit — lookups become pure in-memory O(1).

**Trade-offs**: ~100x more memory per entry (~10 KB vs ~100 bytes), EF entity lifecycle complexity (attach/detach for modification), stricter cache invalidation requirements.

**When to consider**: If per-object PK loads on cache hit (~1–2ms each) cause unacceptable import times at scale (50k+ objects per connected system).

## Future Consideration B: MVO Join-Attribute Lookup Cache

Extend \`IMemoryCache\` to cache MVO join-attribute values — the attributes referenced in object matching rules. Sync matching (\`FindMetaverseObjectUsingMatchingRuleAsync\`) becomes a pure in-memory lookup + PK load, eliminating N+1 matching queries entirely.

**Trade-offs**: MVO attribute values can change during sync (unlike CSO external IDs), requiring invalidation on updates. Cache warming must resolve which attributes are join-rule attributes from sync rule configuration.

**When to consider**: If Phase 2's lightweight query optimisation doesn't sufficiently reduce sync matching times at scale (50k+ MVOs).

## Future Consideration C: Persistent In-Memory Model for the Worker

Architectural shift: maintain a persistent in-memory cache of everything needed for run profile execution (CSOs, MVOs, attribute values, sync rules, object types). The database becomes a persistence layer rather than the primary data source during execution.

**Architectural options to explore**:
- Service-lifetime cache with write-through updates
- Per-connected-system lazy-loaded segments
- Event-driven cache invalidation
- Hybrid: lightweight indexes always in memory, full entities on demand

**Operator mode switch**: Consider exposing as an admin-configurable processing mode:
- **Direct mode** (default) — query database on demand, lower memory, suitable for modest hosts
- **In-memory mode** — persistent cache, requires adequate memory, massively increased throughput

**When to consider**: If surgical optimisations don't bring sync cycle times to acceptable levels, or if we decide to push further. Requires a dedicated design phase.

## Architectural notes: Lookup vs Hydration separation (#440 → here)

> Added 2026-04-01 during #440 analysis. These notes capture design decisions that affect how #440's incremental work transitions into the caching strategies above.

### The seam that matters

#440 introduces a pre-fetch dictionary for CSO external ID lookups during import. The key architectural insight is that **lookup** and **hydration** are fundamentally different operations:

| Operation | Data size | Lifetime | Cache-friendly? |
|-----------|-----------|----------|----------------|
| **Lookup** (external ID → CSO ID) | ~200 bytes/entry | Stable across import | Excellent — small, rarely changes |
| **Hydration** (CSO ID → full entity graph) | ~20-40 KB/entry | Needed briefly for diffing | Poor for persistent cache at scale |

#440 should implement these as **separate method boundaries** (not interfaces — just clean method signatures):

\`\`\`
LookupCsoByExternalId(externalId) → Guid?     // Today: dictionary. Future: cache.
HydrateCso(csoId) → ConnectedSystemObject      // Today: batch DB query. Future: cache + DB fallback.
\`\`\`

This means Future Consideration A (full entity cache) slots in by **swapping the HydrateCso implementation**, not rewriting the import loop. Future Consideration C (persistent in-memory model) swaps both.

### What #440's dictionary becomes in each future

| Future | Dictionary role | Hydration role |
|--------|----------------|----------------|
| **Current (#440)** | In-memory dictionary, loaded once per import | Batch DB query per page of ~500 |
| **A: Full entity cache** | Cache lookup (dictionary as warm-up/fallback) | Cache-first, DB fallback for misses |
| **B: MVO join cache** | Unchanged — CSO lookup is orthogonal to MVO matching | Unchanged |
| **C: Persistent model** | Subsumed by the persistent model's index layer | Subsumed by the persistent model's entity layer |

### Cache invalidation constraint to track

The pre-fetch dictionary must be updated when new CSOs are created **mid-import** (same-batch creates that are then referenced by later objects). This is trivial for an in-memory dictionary (just add the new entry). However, persistent caches (A, C) face a harder problem:

- **Cross-process visibility**: If multiple workers share a cache (Redis), a CSO created by worker A must be visible to worker B's lookup
- **Write-through timing**: The CSO must be persisted to DB and cache before another object can match against it
- **The existing \`seenExternalIds\` pattern** (SyncImportTaskProcessor.cs line 978) handles same-batch duplicates and provides a model for how mid-import cache updates should work

This is not a blocker for #440 (in-memory dictionary, single worker) but must be addressed in the design phase for Future Consideration C.

### Summary: incremental steps are additive

The work in #440 is **not throwaway**:
- The dictionary becomes the cache warm-up / fallback path
- The batch hydration pattern becomes the cache-miss handler
- The lookup/hydration separation provides the seam for future cache implementations
- Index improvements (Option C) benefit all paths regardless of caching strategy

## Prioritisation

| Item | Type | Effort | Impact | Dependency |
|------|------|--------|--------|------------|
| MVO update conversion | Surgical | High | Moderate — dominates delta sync flush | Profiling data from real-world delta syncs |
| A: Full entity cache | Incremental | Moderate | Medium — eliminates PK loads on cache hit | Profiling data |
| B: MVO join-attribute cache | Incremental | Moderate | Medium-High — eliminates sync matching queries | Profiling data |
| C: In-memory Worker model | Architectural | High | Very High — eliminates virtually all read latency | Design phase |

## References

- Parent plan: [\`docs/plans/done/WORKER_DATABASE_PERFORMANCE_OPTIMISATION.md\`](https://github.com/TetronIO/JIM/blob/main/docs/plans/done/WORKER_DATABASE_PERFORMANCE_OPTIMISATION.md)
- MVO creates plan: [\`docs/plans/doing/MVO_COPY_BINARY_PERSISTENCE.md\`](https://github.com/TetronIO/JIM/blob/main/docs/plans/doing/MVO_COPY_BINARY_PERSISTENCE.md)
- Completed: #338 (Phases 1–6 creates, closed)
- Related: #394 (intra-phase parallelism, closed — introduced ParallelBatchWriter/COPY binary patterns)
- Related: #440 (FindMatchingCso batch pre-fetch — provides the lookup/hydration seam described above)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Worker performance: MVO update conversion + advanced caching strategies #436

Summary

MVO Update Conversion to Raw SQL

Why updates are fundamentally harder than creates

Future Consideration A: Upgrade CSO Lookup Index to Full Entity Cache

Future Consideration B: MVO Join-Attribute Lookup Cache

Future Consideration C: Persistent In-Memory Model for the Worker

Architectural notes: Lookup vs Hydration separation (#440 → here)

The seam that matters

What #440's dictionary becomes in each future

Cache invalidation constraint to track

Summary: incremental steps are additive

Prioritisation

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	Raw SQL approach	Complexity
UPDATE parent MVO rows (scalar changes)	`UPDATE ... FROM (VALUES ...)` batch	Low
INSERT new attribute values	COPY binary (proven pattern)	Low
UPDATE existing attribute values	`UPDATE ... FROM (VALUES ...)` — 13 nullable columns	Moderate
DELETE removed attribute values	`DELETE ... WHERE "Id" IN (...)`	Low

Operation	Data size	Lifetime	Cache-friendly?
Lookup (external ID → CSO ID)	~200 bytes/entry	Stable across import	Excellent — small, rarely changes
Hydration (CSO ID → full entity graph)	~20-40 KB/entry	Needed briefly for diffing	Poor for persistent cache at scale

Future	Dictionary role	Hydration role
Current (#440)	In-memory dictionary, loaded once per import	Batch DB query per page of ~500
A: Full entity cache	Cache lookup (dictionary as warm-up/fallback)	Cache-first, DB fallback for misses
B: MVO join cache	Unchanged — CSO lookup is orthogonal to MVO matching	Unchanged
C: Persistent model	Subsumed by the persistent model's index layer	Subsumed by the persistent model's entity layer

Item	Type	Effort	Impact	Dependency
MVO update conversion	Surgical	High	Moderate — dominates delta sync flush	Profiling data from real-world delta syncs
A: Full entity cache	Incremental	Moderate	Medium — eliminates PK loads on cache hit	Profiling data
B: MVO join-attribute cache	Incremental	Moderate	Medium-High — eliminates sync matching queries	Profiling data
C: In-memory Worker model	Architectural	High	Very High — eliminates virtually all read latency	Design phase

Worker performance: MVO update conversion + advanced caching strategies #436

Description

Summary

MVO Update Conversion to Raw SQL

Why updates are fundamentally harder than creates

Future Consideration A: Upgrade CSO Lookup Index to Full Entity Cache

Future Consideration B: MVO Join-Attribute Lookup Cache

Future Consideration C: Persistent In-Memory Model for the Worker

Architectural notes: Lookup vs Hydration separation (#440 → here)

The seam that matters

What #440's dictionary becomes in each future

Cache invalidation constraint to track

Summary: incremental steps are additive

Prioritisation

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions