-
Notifications
You must be signed in to change notification settings - Fork 0
perf: batch pre-fetch CSO external ID mappings during full import #440
Description
Summary
FindMatchingCso (called via TryAndFindMatchingConnectedSystemObjectAsync) consumed 85% of full import time — 5+ minutes out of ~6 minutes for a Medium template (1,000 users). It issued a separate EF Core database query for every imported object to check whether a matching CSO already existed.
At Medium scale: 17,000–21,000 individual DB round-trips at ~17ms each.
Resolution
✅ Phase 1: Lookup + Hydrate seam (complete)
Branch: `feature/findmatchingcso-batch-prefetch`
Commit: `f10583ba`
Implemented Option A (pre-fetch dictionary) with the Lookup/Hydrate separation:
-
Pre-fetch dictionary — Single bulk query at import start loads all CSO external ID mappings into `Dictionary<string, Guid>` (~200 bytes/entry). Stored as `_csoExternalIdLookup` field on `SyncImportTaskProcessor`.
-
Lookup phase (`LookupCsoByExternalId`) — O(1) dictionary lookup, no DB query. Builds the same cache key format as `GetAllCsoExternalIdMappingsAsync` (`cso:{csId}:{attrId}:{lowerValue}`).
-
Hydrate phase (`HydrateCsoAsync`) — PK-based entity load by ID via new `GetConnectedSystemObjectsByIdsAsync`, with secondary external ID fallback for PendingProvisioning CSOs. Much cheaper than the previous LOWER() string-scan attribute-value query.
-
Orchestrator (`TryAndFindMatchingConnectedSystemObjectAsync`) — thin method calling Lookup then Hydrate. Preserves the same external interface for the import loop.
New repository methods:
- `ISyncRepository.GetAllCsoExternalIdMappingsAsync` (previously only on `IConnectedSystemRepository`)
- `ISyncRepository.GetConnectedSystemObjectsByIdsAsync` — batch loads full CSO entity graphs by ID with same `.Include()` chain as existing per-object queries
Tests: 10 new tests in `ImportBatchPrefetchTests.cs` covering empty CS, re-import with existing CSOs, mixed new/existing, mid-batch creation, and repository method validation. All 2,463 existing tests pass.
Integration test validation (S8 MediumLarge, OpenLDAP)
| Metric | Before | After | Improvement |
|---|---|---|---|
| Avg per call | 17.2ms | 2.2ms | ~8x faster |
| Steady-state per call | ~17ms | ~1.0ms | ~17x faster |
| Total FindMatchingCso | 5m 4s (17,703 calls) | 22.8s (10,556 calls) | ~13x faster |
| Pre-fetch load | N/A | 93–174ms (one-time) | Single query for 5,273 CSOs |
All 6 S8 tests passed (InitialSync, ForwardSync, DetectDrift, ReassertState, NewGroup, DeleteGroup).
⏭️ Phase 2: Page-level batch hydration (not proceeding)
What it would do: Collect ~500 CSO IDs per page from the dictionary, hydrate them all in a single `WHERE Id IN (...)` query instead of one-at-a-time.
Why we're not doing it: Phase 1 reduced FindMatchingCso from 5m 4s to 22.8s — it's no longer the bottleneck. Phase 2 would save an additional ~15–20s by collapsing 10k individual PK lookups into ~20 batch queries, but requires restructuring the single-pass import loop into a two-pass pattern (collect IDs → batch load → process), touching duplicate detection, RPEI creation, and error handling. The ROI is low relative to the structural complexity. Additionally, persistent caching (#436 Future Consideration C) will supersede this entirely — the Lookup/Hydrate seam we built provides the swap point for that future work.
⏭️ Phase 3: Index alignment (not proceeding)
What it would do: Composite index on `(ConnectedSystemId, AttributeId, LOWER(StringValue))` for the old attribute-value search query pattern.
Why we're not doing it: The old `GetConnectedSystemObjectByAttributeAsync` query is no longer on the hot path — Phase 1 replaced it with dictionary lookup + PK hydration. The only remaining call site is the secondary external ID fallback for PendingProvisioning CSOs, which is rare (narrow window between provisioning evaluation and confirming import). Adding a composite index would cost write overhead on every CSO attribute value insert/update for a query path that barely gets hit.
Architectural design: Lookup vs Hydration separation
Critical for future caching (#436) — this separation determines how easily persistent caching slots in later.
The implementation introduces a clear seam between lookup and hydration:
```
Import loop:
- csoId = LookupCsoByExternalId(externalId) ← cacheable, swappable
- if csoId != null:
cso = HydrateCsoAsync(csoId) ← batched, swappable
UpdateFromImport(cso, importObject) - else:
CreateNewCso(importObject)
```
Today:
- `LookupCsoByExternalId` = pre-fetch dictionary (single query at import start)
- `HydrateCsoAsync` = PK-based load per object
When persistent caching arrives (#436):
- `LookupCsoByExternalId` = cache lookup (Redis/in-memory), dictionary as fallback/warm-up
- `HydrateCsoAsync` = cache-first with DB batch fallback for misses
This separation ensures the work done here is additive, not throwaway when caching is implemented.
Key files
- `src/JIM.Worker/Processors/SyncImportTaskProcessor.cs` — `LookupCsoByExternalId`, `HydrateCsoAsync`, `TryAndFindMatchingConnectedSystemObjectAsync`, pre-fetch in `PerformFullImportAsync`
- `src/JIM.Data/Repositories/ISyncRepository.cs` — `GetAllCsoExternalIdMappingsAsync`, `GetConnectedSystemObjectsByIdsAsync`
- `src/JIM.PostgresData/Repositories/ConnectedSystemRepository.cs` — `GetAllCsoExternalIdMappingsAsync` (existing), `GetConnectedSystemObjectsByIdsAsync` (new)
- `test/JIM.Worker.Tests/Synchronisation/ImportBatchPrefetchTests.cs` — 10 tests for pre-fetch pipeline