-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Problem
NodeWrapper is an async Ray actor because train() needs to yield control for pause/resume (via await asyncio.sleep(0) and await pause_event.wait()). This makes every ray.get inside the actor — particularly in CachedDataLoader and BufferView — block the event loop and trigger warnings:
Using blocking ray.get inside async actor. This blocks the event loop.
The async design exists so the driver can call pause.remote() / sample_proposal.remote() / resume.remote() on the same actor that's running train(). But training and sampling are actually independent: sample_proposal reads _best_model while train_step updates _model, and _best_model is only updated on validation improvement.
Current architecture
- Single
NodeWrapperactor per node handles both training and sampling BaseEstimator.train()isasync— baked into the base contract- Driver pauses training, samples, resumes — but pause isn't strictly necessary since best_model is separate state
- All data loading (
ray.geton object refs) blocks the event loop
Possible directions
A. Separate training and sampling actors
- Training actor: sync, owns the training loop, no async needed,
ray.getis fine - Sampling actor: holds a copy of
best_model, always available - Training actor pushes new best weights on validation improvement
- Pro: clean separation, no async complexity
- Con: weight synchronization protocol, more actors
B. Driver-controlled epoch loop
NodeWrapperexposestrain_epoch()instead of long-runningtrain()- Driver calls
train_epoch.remote()in a loop, interleaves withsample.remote() - Actor is sync, no pause/resume needed
- Pro: simplest change
- Con:
train_step/train_epochcontract may not be universal for all future estimators
C. Fully async data loading
- Keep current design but make
CachedDataLoaderasync (__aiter__/__anext__) - Replace
ray.getwithawaiton object refs - Pro: minimal architectural change
- Con: async spreads further into the codebase, doesn't address the fundamental design tension
Context
- See TODO comment above
NodeWrapperclass indeployed_graph.py LossBasedEstimator(used bySNPE_gaussian) already maintains_best_model/_modelseparation- Current
StepwiseEstimatorhastrain_step()as a universal primitive across all estimators - GPU utilization gap between standalone scripts (~45%) and falcon (~18%) is partly due to this overhead
Metadata
Metadata
Assignees
Labels
No labels