Skip to content

perf: batch pre-fetch CSO external ID mappings during full import #440

@JayVDZ

Description

@JayVDZ

Summary

FindMatchingCso (called via TryAndFindMatchingConnectedSystemObjectAsync) consumed 85% of full import time — 5+ minutes out of ~6 minutes for a Medium template (1,000 users). It issued a separate EF Core database query for every imported object to check whether a matching CSO already existed.

At Medium scale: 17,000–21,000 individual DB round-trips at ~17ms each.

Resolution

✅ Phase 1: Lookup + Hydrate seam (complete)

Branch: `feature/findmatchingcso-batch-prefetch`
Commit: `f10583ba`

Implemented Option A (pre-fetch dictionary) with the Lookup/Hydrate separation:

  1. Pre-fetch dictionary — Single bulk query at import start loads all CSO external ID mappings into `Dictionary<string, Guid>` (~200 bytes/entry). Stored as `_csoExternalIdLookup` field on `SyncImportTaskProcessor`.

  2. Lookup phase (`LookupCsoByExternalId`) — O(1) dictionary lookup, no DB query. Builds the same cache key format as `GetAllCsoExternalIdMappingsAsync` (`cso:{csId}:{attrId}:{lowerValue}`).

  3. Hydrate phase (`HydrateCsoAsync`) — PK-based entity load by ID via new `GetConnectedSystemObjectsByIdsAsync`, with secondary external ID fallback for PendingProvisioning CSOs. Much cheaper than the previous LOWER() string-scan attribute-value query.

  4. Orchestrator (`TryAndFindMatchingConnectedSystemObjectAsync`) — thin method calling Lookup then Hydrate. Preserves the same external interface for the import loop.

New repository methods:

  • `ISyncRepository.GetAllCsoExternalIdMappingsAsync` (previously only on `IConnectedSystemRepository`)
  • `ISyncRepository.GetConnectedSystemObjectsByIdsAsync` — batch loads full CSO entity graphs by ID with same `.Include()` chain as existing per-object queries

Tests: 10 new tests in `ImportBatchPrefetchTests.cs` covering empty CS, re-import with existing CSOs, mixed new/existing, mid-batch creation, and repository method validation. All 2,463 existing tests pass.

Integration test validation (S8 MediumLarge, OpenLDAP)

Metric Before After Improvement
Avg per call 17.2ms 2.2ms ~8x faster
Steady-state per call ~17ms ~1.0ms ~17x faster
Total FindMatchingCso 5m 4s (17,703 calls) 22.8s (10,556 calls) ~13x faster
Pre-fetch load N/A 93–174ms (one-time) Single query for 5,273 CSOs

All 6 S8 tests passed (InitialSync, ForwardSync, DetectDrift, ReassertState, NewGroup, DeleteGroup).

⏭️ Phase 2: Page-level batch hydration (not proceeding)

What it would do: Collect ~500 CSO IDs per page from the dictionary, hydrate them all in a single `WHERE Id IN (...)` query instead of one-at-a-time.

Why we're not doing it: Phase 1 reduced FindMatchingCso from 5m 4s to 22.8s — it's no longer the bottleneck. Phase 2 would save an additional ~15–20s by collapsing 10k individual PK lookups into ~20 batch queries, but requires restructuring the single-pass import loop into a two-pass pattern (collect IDs → batch load → process), touching duplicate detection, RPEI creation, and error handling. The ROI is low relative to the structural complexity. Additionally, persistent caching (#436 Future Consideration C) will supersede this entirely — the Lookup/Hydrate seam we built provides the swap point for that future work.

⏭️ Phase 3: Index alignment (not proceeding)

What it would do: Composite index on `(ConnectedSystemId, AttributeId, LOWER(StringValue))` for the old attribute-value search query pattern.

Why we're not doing it: The old `GetConnectedSystemObjectByAttributeAsync` query is no longer on the hot path — Phase 1 replaced it with dictionary lookup + PK hydration. The only remaining call site is the secondary external ID fallback for PendingProvisioning CSOs, which is rare (narrow window between provisioning evaluation and confirming import). Adding a composite index would cost write overhead on every CSO attribute value insert/update for a query path that barely gets hit.

Architectural design: Lookup vs Hydration separation

Critical for future caching (#436) — this separation determines how easily persistent caching slots in later.

The implementation introduces a clear seam between lookup and hydration:

```
Import loop:

  1. csoId = LookupCsoByExternalId(externalId) ← cacheable, swappable
  2. if csoId != null:
    cso = HydrateCsoAsync(csoId) ← batched, swappable
    UpdateFromImport(cso, importObject)
  3. else:
    CreateNewCso(importObject)
    ```

Today:

  • `LookupCsoByExternalId` = pre-fetch dictionary (single query at import start)
  • `HydrateCsoAsync` = PK-based load per object

When persistent caching arrives (#436):

  • `LookupCsoByExternalId` = cache lookup (Redis/in-memory), dictionary as fallback/warm-up
  • `HydrateCsoAsync` = cache-first with DB batch fallback for misses

This separation ensures the work done here is additive, not throwaway when caching is implemented.

Key files

  • `src/JIM.Worker/Processors/SyncImportTaskProcessor.cs` — `LookupCsoByExternalId`, `HydrateCsoAsync`, `TryAndFindMatchingConnectedSystemObjectAsync`, pre-fetch in `PerformFullImportAsync`
  • `src/JIM.Data/Repositories/ISyncRepository.cs` — `GetAllCsoExternalIdMappingsAsync`, `GetConnectedSystemObjectsByIdsAsync`
  • `src/JIM.PostgresData/Repositories/ConnectedSystemRepository.cs` — `GetAllCsoExternalIdMappingsAsync` (existing), `GetConnectedSystemObjectsByIdsAsync` (new)
  • `test/JIM.Worker.Tests/Synchronisation/ImportBatchPrefetchTests.cs` — 10 tests for pre-fetch pipeline

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance optimisation work

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions