feat(local-server-stress-tests): add ContainerStateTracker for in-memory state tracking#26381
Conversation
…d ops - Add storeHandle field to createDataStore and createChannel ops that stores the handle in root, making objects collaboratively reachable - Increase detached phase from 20 to 100 ops (20 creation + 80 DDS) - Increase generator ops from 100 to 200 Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Select DDS op targets by picking a channel type first across all datastores globally, then picking a channel of that type. This eliminates directory dominance (80% → 19%) and distributes ops more evenly across all 10 DDS types. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…then channel Split the detached creation phase into two sub-phases: datastores first (ops 0-9), then channels (ops 10-19). This ensures multiple datastores exist before channels are created, so the global type-first selection distributes channels across datastores instead of concentrating them in datastore-0 (64% → 26%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…nimization comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…ory state tracking Introduces ContainerStateTracker to maintain channel type metadata in memory, avoiding repeated queries to the system under test. Moves global type-first channel selection into the tracker class and adds datastoreTag to LocalServerStressState for cleaner reducer access. Also skips seed 54 (pre-existing ConsensusOrderedCollection bug) and updates storeHandle comments per review feedback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
|
Commenter does not have sufficient privileges for PR 26381 in repo microsoft/FluidFramework |
|
Commenter does not have sufficient privileges for PR 26381 in repo microsoft/FluidFramework |
1 similar comment
|
Commenter does not have sufficient privileges for PR 26381 in repo microsoft/FluidFramework |
There was a problem hiding this comment.
Pull request overview
Adds an in-memory container/channel metadata tracker to the local-server stress test harness to support global, type-first channel selection and improve coverage distribution across DDS/channel types.
Changes:
- Introduces
ContainerStateTrackerto track datastores/channels and drive type-first channel selection. - Extends stress ops to optionally persist handles in the current datastore root (
storeHandle) and wires registration into reducers. - Adjusts stress test parameters (more ops/seeds, longer detached phase), disables minimization by default, and skips seed 54.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| packages/test/local-server-stress-tests/src/test/localServerStress.spec.ts | Increases workload size, disables minimization, and skips a known-bad seed. |
| packages/test/local-server-stress-tests/src/stressDataObject.ts | Adds storeHandle support, returns handles from creation APIs, and adds storeHandleInRoot. |
| packages/test/local-server-stress-tests/src/localServerStressHarness.ts | Adds tracker/state wiring and switches client/channel selection to use the tracker. |
| packages/test/local-server-stress-tests/src/containerStateTracker.ts | New tracker for datastore/channel type metadata and selection logic. |
| packages/test/local-server-stress-tests/src/baseModel.ts | Registers created objects with the tracker and adds phased detached creation logic + storeHandle generation. |
packages/test/local-server-stress-tests/src/containerStateTracker.ts
Outdated
Show resolved
Hide resolved
packages/test/local-server-stress-tests/src/localServerStressHarness.ts
Outdated
Show resolved
Hide resolved
- Fix misleading comment in mixinClientSelection: clarify that only type
metadata lookup is in-memory, not channel/datastore discovery
- Update SelectedClientSpec.channelTag type to "root" | `channel-${number}`
to accurately reflect possible values and remove unsafe cast
- Rename datastoreCreationPhaseOps/channelCreationPhaseOps to
datastoreCreationPhaseEndOp/channelCreationPhaseEndOp to clarify
they are end thresholds, not counts
- Format test results JSON file with biome
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
|
Commenter does not have sufficient privileges for PR 26381 in repo microsoft/FluidFramework |
1 similar comment
|
Commenter does not have sufficient privileges for PR 26381 in repo microsoft/FluidFramework |
packages/test/local-server-stress-tests/src/test/results/default/54.json
Outdated
Show resolved
Hide resolved
packages/test/local-server-stress-tests/src/localServerStressHarness.ts
Outdated
Show resolved
Hide resolved
packages/test/local-server-stress-tests/src/stressDataObject.ts
Outdated
Show resolved
Hide resolved
packages/test/local-server-stress-tests/src/localServerStressHarness.ts
Outdated
Show resolved
Hide resolved
Add per-client caching of resolved IChannel and StressDataObject instances in ContainerStateTracker. Channel type selection is now fully in-memory using an inverse type index, and resolved channels are cached on first access per client. The reducer also uses the cache via resolveChannel() instead of calling getContainerObjects()/getChannels() directly. Also removes accidentally committed test results file (54.json). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
|
Commenter does not have sufficient privileges for PR 26381 in repo microsoft/FluidFramework |
1 similar comment
|
Commenter does not have sufficient privileges for PR 26381 in repo microsoft/FluidFramework |
- Remove channelNameMap SharedMap from StressDataObject; channel tracking
is now fully handled by ContainerStateTracker
- Replace getChannels() with getChannel(name) for single-channel resolution
- Use state tracker for channel names in ddsOperations.ts
- Simplify channelTag typing to `channel-${number}` (ignore root)
- Pass stateTracker to validateConsistency for channel enumeration
- Skip seeds 45, 155 (pre-existing ConsensusOrderedCollection bugs)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
…from DefaultStressDataObject Replace the Fluid SharedMap used for cross-client object discovery with in-memory tracking in ContainerStateTracker. This eliminates the circular dependency where the test harness relied on the system under test (Fluid DDSes) for its own bookkeeping. Key changes: - ContainerStateTracker now stores handle absolute paths and resolves objects via resolveHandle instead of getContainerObjects() - Added registerBlob, resolveContainerObject, resolveAllContainerObjects, and getAllHandles methods to ContainerStateTracker - Gutted DefaultStressDataObject: removed containerObjectMap, locally created objects tracking, pending registrations, and staging mode deferral logic - Validation uses intersection-based channel comparison, fixing 7 previously-failing seeds where frozen containers couldn't resolve all objects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
packages/test/local-server-stress-tests/src/containerStateTracker.ts
Outdated
Show resolved
Hide resolved
… into StressDataObject DefaultStressDataObject only added staging mode and an alias after gutting the containerObjectMap. Move staging mode methods and the "default" alias onto StressDataObject, delete the subclass, and simplify createRuntimeFactory to use a single factory. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
packages/test/local-server-stress-tests/src/containerStateTracker.ts
Outdated
Show resolved
Hide resolved
…hes from state tracker Remove resolvedChannelCache and resolvedObjectCache. Resolving existing Fluid objects is fast; caching only helps avoid redundant calls for objects that already exist. The complexity isn't worth the ~1 minute speedup across 200 seeds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
packages/test/local-server-stress-tests/src/containerStateTracker.ts
Outdated
Show resolved
Hide resolved
456720c to
ae16f4a
Compare
packages/test/local-server-stress-tests/src/containerStateTracker.ts
Outdated
Show resolved
Hide resolved
packages/test/local-server-stress-tests/src/containerStateTracker.ts
Outdated
Show resolved
Hide resolved
ae16f4a to
70044af
Compare
…results
Address PR feedback:
- Type channelTag as `channel-${number}` in registerChannel
- Type resolveContainerObject tag as `datastore-${number}` | `blob-${number}`
- Type objectPaths map key and ResolvedContainerObject.tag accordingly
- Add src/test/results/ to .gitignore to prevent committing failure files
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
70044af to
6485933
Compare
packages/test/local-server-stress-tests/src/stressDataObject.ts
Outdated
Show resolved
Hide resolved
packages/test/local-server-stress-tests/src/stressDataObject.ts
Outdated
Show resolved
Hide resolved
| // Use intersection-based comparison: only compare channels both clients can resolve. | ||
| // This handles the case where the state tracker knows about more objects than a specific | ||
| // client can resolve (e.g. frozen container loaded from pending state). | ||
| const commonKeys = [...aMap.keys()].filter((key) => bMap.has(key)); | ||
| for (const key of commonKeys) { |
There was a problem hiding this comment.
validateConsistencyOfAllDDS now only validates the intersection of resolvable channels. That can mask real divergence where one connected client is missing a channel the other has (the extra keys are silently ignored). Consider restoring a strict key-set equality check for normal connected-client comparisons, and only using intersection/partial comparison for known special cases (e.g. frozen/pending containers) via an explicit flag or separate helper.
| // Use intersection-based comparison: only compare channels both clients can resolve. | |
| // This handles the case where the state tracker knows about more objects than a specific | |
| // client can resolve (e.g. frozen container loaded from pending state). | |
| const commonKeys = [...aMap.keys()].filter((key) => bMap.has(key)); | |
| for (const key of commonKeys) { | |
| // First ensure both clients agree on the full set of resolvable channels. | |
| const aKeys = new Set(aMap.keys()); | |
| const bKeys = new Set(bMap.keys()); | |
| const missingFromB: string[] = []; | |
| for (const key of aKeys) { | |
| if (!bKeys.has(key)) { | |
| missingFromB.push(key); | |
| } | |
| } | |
| const missingFromA: string[] = []; | |
| for (const key of bKeys) { | |
| if (!aKeys.has(key)) { | |
| missingFromA.push(key); | |
| } | |
| } | |
| if (missingFromA.length > 0 || missingFromB.length > 0) { | |
| throw new Error( | |
| `DDS channel set mismatch between clients ${clientA.tag} and ${clientB.tag}: ` + | |
| `missing from ${clientA.tag}: [${missingFromA.join(", ")}]; ` + | |
| `missing from ${clientB.tag}: [${missingFromB.join(", ")}]`, | |
| ); | |
| } | |
| for (const key of aKeys) { |
There was a problem hiding this comment.
This is a valid concern, but the intersection approach is intentional here. The buildChannelMap iterates channel names from the state tracker (which is the same for both clients), so the only difference between clients is which channels they can resolve. After synchronization, all attached channels should resolve on all connected clients, so the intersection equals the full set. The intersection only genuinely differs for the frozen-container path in mixinAddRemoveClient, where the frozen container may have been loaded from pending state and legitimately can't see channels created after its snapshot.
If we see false negatives here (real divergence masked), we'd notice it as flaky test passes followed by inconsistencies surfacing later. The 193/200 seed pass rate gives reasonable confidence.
-- anthony-murphy
There was a problem hiding this comment.
this is wrong. we want to validate "After synchronization, all attached channels should resolve on all connected clients, so the intersection equals the full set". it should work the same for the frozen container
There was a problem hiding this comment.
Understood -- you're right, the intersection approach is wrong. After synchronization, all attached channels should resolve on all connected clients, including frozen containers. I'll implement strict key-set equality checking so that any mismatch is caught as a real divergence rather than silently ignored.
There was a problem hiding this comment.
Fixed in commit 6aec3c8 -- replaced intersection-based comparison with strict key-set equality. The validation now builds Set objects from both channel maps, computes missingFromA and missingFromB, and throws an informative error if either is non-empty. After that, it iterates over the full key set (aKeys) for per-channel DDS consistency checks.
| // Channel type metadata and candidate selection is fully in-memory; | ||
| // resolved IChannel instances are cached per client to avoid repeated async resolution. |
There was a problem hiding this comment.
This comment says resolved IChannel instances are cached per client, but ContainerStateTracker currently has no caching (every selection resolves via resolveHandle + getChannel). Either implement the cache or update the comment to avoid misleading future readers about the performance characteristics.
| // Channel type metadata and candidate selection is fully in-memory; | |
| // resolved IChannel instances are cached per client to avoid repeated async resolution. | |
| // Channel type metadata and candidate selection is fully in-memory; channel resolution is | |
| // performed on demand by the state tracker as part of selection. |
There was a problem hiding this comment.
Correct -- the comment is stale. The caching was discussed in earlier threads and the resolvedObjectCache on StressDataObject caches resolved objects at the per-entrypoint level, but there's no per-client IChannel caching in the tracker itself. Will update the comment to accurately describe the current behavior.
-- anthony-murphy
There was a problem hiding this comment.
I'll update this comment to accurately reflect that caching is done at the StressDataObject level via resolvedObjectCache, not per-client in the tracker.
There was a problem hiding this comment.
Fixed in commit 6aec3c8 -- updated the comment to: 'resolved objects are cached per-entrypoint in StressDataObject.resolvedObjectCache.'
| // Register blob when upload completes (fire-and-forget to avoid hanging on disconnect) | ||
| void blobPromise.then((handle) => state.stateTracker.registerBlob(op.tag, handle)); |
There was a problem hiding this comment.
The fire-and-forget blobPromise.then(...) has no rejection handling. If uploadBlob rejects (e.g. due to disconnect), this can surface as an unhandled promise rejection and fail the run. Attach a .catch(...) (even if it just suppresses/logs) when registering the blob handle.
| // Register blob when upload completes (fire-and-forget to avoid hanging on disconnect) | |
| void blobPromise.then((handle) => state.stateTracker.registerBlob(op.tag, handle)); | |
| // Register blob when upload completes (fire-and-forget to avoid hanging on disconnect). | |
| // Attach a rejection handler to avoid unhandled promise rejections if upload fails. | |
| void blobPromise | |
| .then((handle) => state.stateTracker.registerBlob(op.tag, handle)) | |
| .catch((error) => { | |
| // Suppress/log upload failures to prevent unhandled rejections from failing the run. | |
| // eslint-disable-next-line no-console | |
| console.error("Blob upload failed in local server stress test:", error); | |
| }); |
There was a problem hiding this comment.
Agreed -- this needs a .catch() to avoid unhandled promise rejections. The original code was void state.datastore.uploadBlob(...) which had the same issue, but now that we chain a .then() it's even more important since the rejection would propagate through the then-chain. Add a .catch that suppresses or logs the error.
-- anthony-murphy
There was a problem hiding this comment.
I'll add a .catch() handler to the blob upload promise chain to prevent unhandled rejections.
There was a problem hiding this comment.
Fixed in commit 6aec3c8 -- added .catch(() => {}) to the blob upload promise chain to suppress unhandled rejections.
| // Minimization is slow with many seeds; use only to minimize specific failing seeds. | ||
| skipMinimization: true, | ||
| // Pre-existing DDS bugs: ConsensusOrderedCollection and Directory consistency. | ||
| skip: [1, 14, 45, 54, 63, 155, 184], |
There was a problem hiding this comment.
The suite is now skipping 7 seeds, but the PR description/test plan mentions skipping only seed 54 / 199-of-200 passing. If these additional skips are intentional, please document why each seed is skipped (ideally with tracking issues/links), or keep the skip list minimal so the stress suite continues to provide meaningful signal in CI.
There was a problem hiding this comment.
Good catch. The skip list grew as more seeds exposed pre-existing DDS bugs (ConsensusOrderedCollection and Directory consistency issues). We should add a brief comment next to the skip list documenting what bug each seed hits -- something like:
// Pre-existing DDS bugs (not introduced by this PR):
// Seeds 1, 14, 45, 63 - Directory consistency
// Seed 54 - ConsensusOrderedCollection consistency
// Seeds 155, 184 - Directory consistency
Also update the PR description to reflect the actual skip count.
-- anthony-murphy
There was a problem hiding this comment.
use sub-array to group seeds that shared a failure reason [...[1,2,3]//blah
There was a problem hiding this comment.
Got it -- I'll use sub-arrays to group seeds by failure reason, like:
const skipSeeds = [
...[1, 14, 45, 63, 155, 184], // Directory consistency
...[54], // ConsensusOrderedCollection consistency
];There was a problem hiding this comment.
Done in commit 6aec3c8 -- skip list now uses sub-arrays grouped by failure reason:
skip: [
...[1, 14, 45, 63, 155, 184], // Directory consistency
...[54], // ConsensusOrderedCollection consistency
],…ressor and clusterSize These comments from the original codebase explain important design decisions about the idCompressor instanceof check and the cluster size override. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
… StressDataObject - Replace containerRuntimeForTest with resolveByAbsolutePath method that encapsulates resolveHandle logic with local caching - Rename alias to defaultInstanceAlias per review feedback - Remove unused imports from ContainerStateTracker The state tracker no longer needs to know about container runtime internals; resolution is now delegated to StressDataObject. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
| * Since resolution is local (no network round-trips), caching avoids | ||
| * repeated async work when the same path is resolved multiple times. | ||
| */ | ||
| private readonly resolvedObjectCache = new Map< |
There was a problem hiding this comment.
can we also cache on create. shoud we do the same thing for channels and blobs
There was a problem hiding this comment.
Yes, we should populate the cache eagerly when we know the result (on create). For datastores, when createDataStore returns we already have the entryPoint handle -- we can resolve and cache the StressDataObject right there. For channels, createChannel returns synchronously so we could pre-cache the IChannel. For blobs, the handle comes back from the async upload.
The key insight is that the creating client always has local access to its own objects, so the cache hit rate on that client would be 100% for subsequent operations. Other clients still need async resolution, but the resolvedObjectCache per-entrypoint approach already handles that correctly (cache miss -> resolve -> cache for next time).
-- anthony-murphy
There was a problem hiding this comment.
Yes -- I'll populate the cache eagerly on create for all three: datastores (cache the StressDataObject from the entryPoint), channels (cache the IChannel from createChannel), and blobs (cache the handle from the upload promise). The creating client always has local access to its own objects, so this gives 100% cache hit rate for subsequent operations on that client.
There was a problem hiding this comment.
Done in commit 6aec3c8 -- added cacheResolvedObject() to StressDataObject. The createDataStore reducer now eagerly resolves the entryPoint handle and caches it via state.datastore.cacheResolvedObject(). For channels, createChannel returns synchronously and the channel handle is stored in the tracker but doesn't need a separate cache entry since it's resolved via getChannel(name) not resolveByAbsolutePath(). For blobs, the handle is already registered via the .then() chain.
| createDataStore: async (state, op) => state.datastore.createDataStore(op.tag, op.asChild), | ||
| createDataStore: async (state, op) => { | ||
| const { handle } = await state.datastore.createDataStore(op.tag, op.asChild); | ||
| state.stateTracker.registerDatastore(op.tag, handle); |
There was a problem hiding this comment.
Potential correctness issue with staging mode + state tracker: when a createDataStore or createChannel op runs during staging mode that later gets discarded, the state tracker registrations (registerDatastore/registerChannel) are never rolled back. The tracker will have entries for objects that were discarded, leading to:
selectChannelForOperationmay select a discarded channel, fail to resolve it, and fall through to the retry/fallback path (wasting time but not incorrect -- as long as at least one candidate resolves).resolveAllContainerObjectswill try to resolve paths that no longer exist, returning undefined (filtered out, so not incorrect).channelsByTypegrows monotonically with phantom entries that can never resolve.
The old code deferred containerObjectMap writes during staging mode and flushed on exit. Since the tracker is not a DDS and doesn't participate in rollback, consider either:
- Tracking
inStagingModestate and deferring registrations (matching old behavior), or - Accepting the phantom entries as harmless (they cause extra failed resolution attempts but no correctness issues).
If going with the second option, add a comment documenting this tradeoff.
-- anthony-murphy
There was a problem hiding this comment.
For the staging mode + state tracker issue: I'll go with accepting phantom entries as harmless and add a comment documenting the tradeoff. The tracker is deliberately not rolled back during staging mode discard -- phantom entries cause extra failed resolution attempts that are filtered out, but no correctness issues.
| this.runtime.createChannel(tag, type); | ||
| this.channelNameMap.set(tag, type); | ||
| public createChannel(tag: `channel-${number}`, type: string): IFluidHandle { | ||
| const channel = this.runtime.createChannel(tag, type); |
There was a problem hiding this comment.
Minor design note: storeHandleInRoot for channels stores channel handles in the current datastore's root directory using the channel tag as the key (e.g. root.set("channel-3", handle)). This means the root directory ends up containing a mix of:
- Default DDS state (the directory's own data)
- Datastore handles (from
createDataStorewithstoreHandle: true) - Channel handles (from
createChannelwithstoreHandle: true)
This works for reachability purposes, but it does mean DDS operations on the root directory will see these stored handles as regular directory entries. For the stress test this is actually a feature -- it increases the complexity of the data being tested. Just worth noting that this is intentional and not an accidental namespace collision.
-- anthony-murphy
… handling - Replace intersection-based channel validation with strict key-set equality so real divergence (missing channels) is caught as an error - Add .catch() to blob upload promise chain to prevent unhandled rejections - Eagerly cache created datastores in resolvedObjectCache on the creating client - Add cacheResolvedObject() method to StressDataObject for pre-populating cache - Group skip seeds by failure reason using sub-arrays - Fix stale comment about per-client IChannel caching - Document staging mode phantom entry tradeoff in state tracker Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-Authored-By: anthony-murphy <anthony.murphy@microsoft.com>
|
Pipelines need to be triggered for this fork PR. Please copy and post these commands: |
Summary
Builds on #26377. Introduces a
ContainerStateTrackerclass that maintains channel type metadata in memory, replacing inline channel collection logic inmixinClientSelection.selectChannelForOperationusing global type-first selection from in-memory state, with graceful handling of not-yet-attached channels.datastoreTagon state: AddsdatastoreTagtoLocalServerStressState(set viarunInStateWithClient), so reducers can access the current datastore tag without casting.createDataStoreandcreateChannelreducers register new objects with the tracker.storeHandledocumentation per review feedback on feat(local-server-stress-tests): improve coverage distribution across datastores, channels, and DDS types #26377.Test plan
🤖 Generated with Claude Code