perf: batch pre-fetch CSO external ID mappings during full import

## Summary

`FindMatchingCso` (called via `TryAndFindMatchingConnectedSystemObjectAsync`) consumed **85% of full import time** — 5+ minutes out of ~6 minutes for a Medium template (1,000 users). It issued a separate EF Core database query for every imported object to check whether a matching CSO already existed.

At Medium scale: **17,000–21,000 individual DB round-trips at ~17ms each.**

## Resolution

### ✅ Phase 1: Lookup + Hydrate seam (complete)

**Branch:** \`feature/findmatchingcso-batch-prefetch\`
**Commit:** \`f10583ba\`

Implemented Option A (pre-fetch dictionary) with the Lookup/Hydrate separation:

1. **Pre-fetch dictionary** — Single bulk query at import start loads all CSO external ID mappings into \`Dictionary<string, Guid>\` (~200 bytes/entry). Stored as \`_csoExternalIdLookup\` field on \`SyncImportTaskProcessor\`.

2. **Lookup phase** (\`LookupCsoByExternalId\`) — O(1) dictionary lookup, no DB query. Builds the same cache key format as \`GetAllCsoExternalIdMappingsAsync\` (\`cso:{csId}:{attrId}:{lowerValue}\`).

3. **Hydrate phase** (\`HydrateCsoAsync\`) — PK-based entity load by ID via new \`GetConnectedSystemObjectsByIdsAsync\`, with secondary external ID fallback for PendingProvisioning CSOs. Much cheaper than the previous LOWER() string-scan attribute-value query.

4. **Orchestrator** (\`TryAndFindMatchingConnectedSystemObjectAsync\`) — thin method calling Lookup then Hydrate. Preserves the same external interface for the import loop.

**New repository methods:**
- \`ISyncRepository.GetAllCsoExternalIdMappingsAsync\` (previously only on \`IConnectedSystemRepository\`)
- \`ISyncRepository.GetConnectedSystemObjectsByIdsAsync\` — batch loads full CSO entity graphs by ID with same \`.Include()\` chain as existing per-object queries

**Tests:** 10 new tests in \`ImportBatchPrefetchTests.cs\` covering empty CS, re-import with existing CSOs, mixed new/existing, mid-batch creation, and repository method validation. All 2,463 existing tests pass.

### Integration test validation (S8 MediumLarge, OpenLDAP)

| Metric | Before | After | Improvement |
|--------|--------|-------|-------------|
| **Avg per call** | 17.2ms | **2.2ms** | **~8x faster** |
| **Steady-state per call** | ~17ms | **~1.0ms** | **~17x faster** |
| **Total FindMatchingCso** | 5m 4s (17,703 calls) | **22.8s** (10,556 calls) | **~13x faster** |
| **Pre-fetch load** | N/A | 93–174ms (one-time) | Single query for 5,273 CSOs |

All 6 S8 tests passed (InitialSync, ForwardSync, DetectDrift, ReassertState, NewGroup, DeleteGroup).

### ⏭️ Phase 2: Page-level batch hydration (not proceeding)

**What it would do:** Collect ~500 CSO IDs per page from the dictionary, hydrate them all in a single \`WHERE Id IN (...)\` query instead of one-at-a-time.

**Why we're not doing it:** Phase 1 reduced FindMatchingCso from 5m 4s to 22.8s — it's no longer the bottleneck. Phase 2 would save an additional ~15–20s by collapsing 10k individual PK lookups into ~20 batch queries, but requires restructuring the single-pass import loop into a two-pass pattern (collect IDs → batch load → process), touching duplicate detection, RPEI creation, and error handling. The ROI is low relative to the structural complexity. Additionally, persistent caching (#436 Future Consideration C) will supersede this entirely — the Lookup/Hydrate seam we built provides the swap point for that future work.

### ⏭️ Phase 3: Index alignment (not proceeding)

**What it would do:** Composite index on \`(ConnectedSystemId, AttributeId, LOWER(StringValue))\` for the old attribute-value search query pattern.

**Why we're not doing it:** The old \`GetConnectedSystemObjectByAttributeAsync\` query is no longer on the hot path — Phase 1 replaced it with dictionary lookup + PK hydration. The only remaining call site is the secondary external ID fallback for PendingProvisioning CSOs, which is rare (narrow window between provisioning evaluation and confirming import). Adding a composite index would cost write overhead on every CSO attribute value insert/update for a query path that barely gets hit.

## Architectural design: Lookup vs Hydration separation

> **Critical for future caching (#436)** — this separation determines how easily persistent caching slots in later.

The implementation introduces a clear **seam between lookup and hydration**:

\`\`\`
Import loop:
  1. csoId = LookupCsoByExternalId(externalId)     ← cacheable, swappable
  2. if csoId != null:
       cso = HydrateCsoAsync(csoId)                 ← batched, swappable
       UpdateFromImport(cso, importObject)
  3. else:
       CreateNewCso(importObject)
\`\`\`

**Today:**
- \`LookupCsoByExternalId\` = pre-fetch dictionary (single query at import start)
- \`HydrateCsoAsync\` = PK-based load per object

**When persistent caching arrives (#436):**
- \`LookupCsoByExternalId\` = cache lookup (Redis/in-memory), dictionary as fallback/warm-up
- \`HydrateCsoAsync\` = cache-first with DB batch fallback for misses

This separation ensures the work done here is **additive, not throwaway** when caching is implemented.

## Key files

- \`src/JIM.Worker/Processors/SyncImportTaskProcessor.cs\` — \`LookupCsoByExternalId\`, \`HydrateCsoAsync\`, \`TryAndFindMatchingConnectedSystemObjectAsync\`, pre-fetch in \`PerformFullImportAsync\`
- \`src/JIM.Data/Repositories/ISyncRepository.cs\` — \`GetAllCsoExternalIdMappingsAsync\`, \`GetConnectedSystemObjectsByIdsAsync\`
- \`src/JIM.PostgresData/Repositories/ConnectedSystemRepository.cs\` — \`GetAllCsoExternalIdMappingsAsync\` (existing), \`GetConnectedSystemObjectsByIdsAsync\` (new)
- \`test/JIM.Worker.Tests/Synchronisation/ImportBatchPrefetchTests.cs\` — 10 tests for pre-fetch pipeline

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: batch pre-fetch CSO external ID mappings during full import #440

Summary

Resolution

✅ Phase 1: Lookup + Hydrate seam (complete)

Integration test validation (S8 MediumLarge, OpenLDAP)

⏭️ Phase 2: Page-level batch hydration (not proceeding)

⏭️ Phase 3: Index alignment (not proceeding)

Architectural design: Lookup vs Hydration separation

Key files

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Before	After	Improvement
Avg per call	17.2ms	2.2ms	~8x faster
Steady-state per call	~17ms	~1.0ms	~17x faster
Total FindMatchingCso	5m 4s (17,703 calls)	22.8s (10,556 calls)	~13x faster
Pre-fetch load	N/A	93–174ms (one-time)	Single query for 5,273 CSOs

perf: batch pre-fetch CSO external ID mappings during full import #440

Description

Summary

Resolution

✅ Phase 1: Lookup + Hydrate seam (complete)

Integration test validation (S8 MediumLarge, OpenLDAP)

⏭️ Phase 2: Page-level batch hydration (not proceeding)

⏭️ Phase 3: Index alignment (not proceeding)

Architectural design: Lookup vs Hydration separation

Key files

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions