Refactor NodeWrapper actor design to resolve async/ray.get tension

## Problem

`NodeWrapper` is an async Ray actor because `train()` needs to yield control for pause/resume (via `await asyncio.sleep(0)` and `await pause_event.wait()`). This makes every `ray.get` inside the actor — particularly in `CachedDataLoader` and `BufferView` — block the event loop and trigger warnings:

```
Using blocking ray.get inside async actor. This blocks the event loop.
```

The async design exists so the driver can call `pause.remote()` / `sample_proposal.remote()` / `resume.remote()` on the same actor that's running `train()`. But training and sampling are actually independent: `sample_proposal` reads `_best_model` while `train_step` updates `_model`, and `_best_model` is only updated on validation improvement.

## Current architecture

1. Single `NodeWrapper` actor per node handles both training and sampling
2. `BaseEstimator.train()` is `async` — baked into the base contract
3. Driver pauses training, samples, resumes — but pause isn't strictly necessary since best_model is separate state
4. All data loading (`ray.get` on object refs) blocks the event loop

## Possible directions

**A. Separate training and sampling actors**
- Training actor: sync, owns the training loop, no async needed, `ray.get` is fine
- Sampling actor: holds a copy of `best_model`, always available
- Training actor pushes new best weights on validation improvement
- Pro: clean separation, no async complexity
- Con: weight synchronization protocol, more actors

**B. Driver-controlled epoch loop**
- `NodeWrapper` exposes `train_epoch()` instead of long-running `train()`
- Driver calls `train_epoch.remote()` in a loop, interleaves with `sample.remote()`
- Actor is sync, no pause/resume needed
- Pro: simplest change
- Con: `train_step`/`train_epoch` contract may not be universal for all future estimators

**C. Fully async data loading**
- Keep current design but make `CachedDataLoader` async (`__aiter__`/`__anext__`)
- Replace `ray.get` with `await` on object refs
- Pro: minimal architectural change
- Con: async spreads further into the codebase, doesn't address the fundamental design tension

## Context

- See TODO comment above `NodeWrapper` class in `deployed_graph.py`
- `LossBasedEstimator` (used by `SNPE_gaussian`) already maintains `_best_model` / `_model` separation
- Current `StepwiseEstimator` has `train_step()` as a universal primitive across all estimators
- GPU utilization gap between standalone scripts (~45%) and falcon (~18%) is partly due to this overhead

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor NodeWrapper actor design to resolve async/ray.get tension #19

Problem

Current architecture

Possible directions

Context

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Refactor NodeWrapper actor design to resolve async/ray.get tension #19

Description

Problem

Current architecture

Possible directions

Context

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions