Skip to content

Refactor NodeWrapper actor design to resolve async/ray.get tension #19

@cweniger

Description

@cweniger

Problem

NodeWrapper is an async Ray actor because train() needs to yield control for pause/resume (via await asyncio.sleep(0) and await pause_event.wait()). This makes every ray.get inside the actor — particularly in CachedDataLoader and BufferView — block the event loop and trigger warnings:

Using blocking ray.get inside async actor. This blocks the event loop.

The async design exists so the driver can call pause.remote() / sample_proposal.remote() / resume.remote() on the same actor that's running train(). But training and sampling are actually independent: sample_proposal reads _best_model while train_step updates _model, and _best_model is only updated on validation improvement.

Current architecture

  1. Single NodeWrapper actor per node handles both training and sampling
  2. BaseEstimator.train() is async — baked into the base contract
  3. Driver pauses training, samples, resumes — but pause isn't strictly necessary since best_model is separate state
  4. All data loading (ray.get on object refs) blocks the event loop

Possible directions

A. Separate training and sampling actors

  • Training actor: sync, owns the training loop, no async needed, ray.get is fine
  • Sampling actor: holds a copy of best_model, always available
  • Training actor pushes new best weights on validation improvement
  • Pro: clean separation, no async complexity
  • Con: weight synchronization protocol, more actors

B. Driver-controlled epoch loop

  • NodeWrapper exposes train_epoch() instead of long-running train()
  • Driver calls train_epoch.remote() in a loop, interleaves with sample.remote()
  • Actor is sync, no pause/resume needed
  • Pro: simplest change
  • Con: train_step/train_epoch contract may not be universal for all future estimators

C. Fully async data loading

  • Keep current design but make CachedDataLoader async (__aiter__/__anext__)
  • Replace ray.get with await on object refs
  • Pro: minimal architectural change
  • Con: async spreads further into the codebase, doesn't address the fundamental design tension

Context

  • See TODO comment above NodeWrapper class in deployed_graph.py
  • LossBasedEstimator (used by SNPE_gaussian) already maintains _best_model / _model separation
  • Current StepwiseEstimator has train_step() as a universal primitive across all estimators
  • GPU utilization gap between standalone scripts (~45%) and falcon (~18%) is partly due to this overhead

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions