Skip to content

Worker performance: MVO update conversion + advanced caching strategies #436

@JayVDZ

Description

@JayVDZ

Summary

Follow-on performance work from #338 (Worker database performance optimisation, now closed). All original phases (1–5) plus Phase 6 MVO creates are complete. This issue tracks the remaining items: MVO update conversion to raw SQL and the three Future Considerations (A, B, C) from the plan document.

MVO Update Conversion to Raw SQL

Plan document: `docs/plans/doing/MVO_COPY_BINARY_PERSISTENCE.md`

Context: `CreateMetaverseObjectsAsync` was converted to COPY binary in #338 Phase 6. `UpdateMetaverseObjectsAsync` remains on EF Core with per-entity `Entry().State` management.

Why updates are fundamentally harder than creates

Creates are a single operation (INSERT rows). Updates involve four distinct operations in one flush:

Operation Raw SQL approach Complexity
UPDATE parent MVO rows (scalar changes) `UPDATE ... FROM (VALUES ...)` batch Low
INSERT new attribute values COPY binary (proven pattern) Low
UPDATE existing attribute values `UPDATE ... FROM (VALUES ...)` — 13 nullable columns Moderate
DELETE removed attribute values `DELETE ... WHERE "Id" IN (...)` Low

The crux is knowing which AVs are new vs modified vs deleted without EF's change tracker. Currently the caller applies `PendingAttributeValueAdditions`/`Removals` to `mvo.AttributeValues` before calling update, then EF uses `IsKeySet` to classify. A raw SQL conversion would need the caller to preserve that classification or the repository to diff against the database — both are invasive.

Additionally, MVOs arrive in mixed EF states (tracked vs detached, `AutoDetectChangesEnabled` on vs off) depending on whether it's a per-page flush or cross-page reference resolution. The current `UpdateDetachedSafe` + per-entity state management was the result of 11 debugging attempts (see `docs/notes/done/CROSS_PAGE_REFERENCE_IDENTITY_CONFLICT.md`).

Recommendation: Wait for profiling data from real-world delta syncs. Creates are the high-volume operation during initial sync; updates are smaller per-page and EF overhead is proportionally less significant.

Future Consideration A: Upgrade CSO Lookup Index to Full Entity Cache

Upgrade the Phase 1 `IMemoryCache` from storing a lookup index (external ID → CSO GUID) to storing full CSO entity graphs (`AsNoTracking`). Eliminates all database queries on cache hit — lookups become pure in-memory O(1).

Trade-offs: ~100x more memory per entry (~10 KB vs ~100 bytes), EF entity lifecycle complexity (attach/detach for modification), stricter cache invalidation requirements.

When to consider: If per-object PK loads on cache hit (~1–2ms each) cause unacceptable import times at scale (50k+ objects per connected system).

Future Consideration B: MVO Join-Attribute Lookup Cache

Extend `IMemoryCache` to cache MVO join-attribute values — the attributes referenced in object matching rules. Sync matching (`FindMetaverseObjectUsingMatchingRuleAsync`) becomes a pure in-memory lookup + PK load, eliminating N+1 matching queries entirely.

Trade-offs: MVO attribute values can change during sync (unlike CSO external IDs), requiring invalidation on updates. Cache warming must resolve which attributes are join-rule attributes from sync rule configuration.

When to consider: If Phase 2's lightweight query optimisation doesn't sufficiently reduce sync matching times at scale (50k+ MVOs).

Future Consideration C: Persistent In-Memory Model for the Worker

Architectural shift: maintain a persistent in-memory cache of everything needed for run profile execution (CSOs, MVOs, attribute values, sync rules, object types). The database becomes a persistence layer rather than the primary data source during execution.

Architectural options to explore:

  • Service-lifetime cache with write-through updates
  • Per-connected-system lazy-loaded segments
  • Event-driven cache invalidation
  • Hybrid: lightweight indexes always in memory, full entities on demand

Operator mode switch: Consider exposing as an admin-configurable processing mode:

  • Direct mode (default) — query database on demand, lower memory, suitable for modest hosts
  • In-memory mode — persistent cache, requires adequate memory, massively increased throughput

When to consider: If surgical optimisations don't bring sync cycle times to acceptable levels, or if we decide to push further. Requires a dedicated design phase.

Architectural notes: Lookup vs Hydration separation (#440 → here)

Added 2026-04-01 during #440 analysis. These notes capture design decisions that affect how #440's incremental work transitions into the caching strategies above.

The seam that matters

#440 introduces a pre-fetch dictionary for CSO external ID lookups during import. The key architectural insight is that lookup and hydration are fundamentally different operations:

Operation Data size Lifetime Cache-friendly?
Lookup (external ID → CSO ID) ~200 bytes/entry Stable across import Excellent — small, rarely changes
Hydration (CSO ID → full entity graph) ~20-40 KB/entry Needed briefly for diffing Poor for persistent cache at scale

#440 should implement these as separate method boundaries (not interfaces — just clean method signatures):

```
LookupCsoByExternalId(externalId) → Guid? // Today: dictionary. Future: cache.
HydrateCso(csoId) → ConnectedSystemObject // Today: batch DB query. Future: cache + DB fallback.
```

This means Future Consideration A (full entity cache) slots in by swapping the HydrateCso implementation, not rewriting the import loop. Future Consideration C (persistent in-memory model) swaps both.

What #440's dictionary becomes in each future

Future Dictionary role Hydration role
Current (#440) In-memory dictionary, loaded once per import Batch DB query per page of ~500
A: Full entity cache Cache lookup (dictionary as warm-up/fallback) Cache-first, DB fallback for misses
B: MVO join cache Unchanged — CSO lookup is orthogonal to MVO matching Unchanged
C: Persistent model Subsumed by the persistent model's index layer Subsumed by the persistent model's entity layer

Cache invalidation constraint to track

The pre-fetch dictionary must be updated when new CSOs are created mid-import (same-batch creates that are then referenced by later objects). This is trivial for an in-memory dictionary (just add the new entry). However, persistent caches (A, C) face a harder problem:

  • Cross-process visibility: If multiple workers share a cache (Redis), a CSO created by worker A must be visible to worker B's lookup
  • Write-through timing: The CSO must be persisted to DB and cache before another object can match against it
  • The existing `seenExternalIds` pattern (SyncImportTaskProcessor.cs line 978) handles same-batch duplicates and provides a model for how mid-import cache updates should work

This is not a blocker for #440 (in-memory dictionary, single worker) but must be addressed in the design phase for Future Consideration C.

Summary: incremental steps are additive

The work in #440 is not throwaway:

  • The dictionary becomes the cache warm-up / fallback path
  • The batch hydration pattern becomes the cache-miss handler
  • The lookup/hydration separation provides the seam for future cache implementations
  • Index improvements (Option C) benefit all paths regardless of caching strategy

Prioritisation

Item Type Effort Impact Dependency
MVO update conversion Surgical High Moderate — dominates delta sync flush Profiling data from real-world delta syncs
A: Full entity cache Incremental Moderate Medium — eliminates PK loads on cache hit Profiling data
B: MVO join-attribute cache Incremental Moderate Medium-High — eliminates sync matching queries Profiling data
C: In-memory Worker model Architectural High Very High — eliminates virtually all read latency Design phase

References

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestperformancePerformance optimisation work

Type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions