Conversation
|
always add PR description containing (1) what this PR does (2) how to test result (3) which issue it resolves |
|
(1) What this PR does: It adds streaming brute-force model selection with convergence-based auto-stop, plus fake-data validation utilities and tests. |
|
nice! |
There was a problem hiding this comment.
Pull request overview
Adds streaming-capable model selection variants intended for “online” scenarios where labeled data arrives in batches (Issue #3).
Changes:
- Introduces
StreamingRandomSearchModelSelectorandStreamingBruteForceModelSelectorwithupdate()/update_one()and convergence tracking. - Exposes new selectors via
agentopt.model_selectionand top-levelagentoptexports; addsmethod="streaming_random"toModelSelector. - Adds tests and documentation updates for streaming random search (plus README example for streaming brute force).
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
src/agentopt/model_selection/streaming_random_search.py |
New streaming random-search selector with fixed sampled subset and incremental updates |
src/agentopt/model_selection/streaming_brute_force.py |
New streaming brute-force selector intended to update cumulative metrics per batch |
src/agentopt/model_selection/__init__.py |
Exports streaming selectors from the model_selection package |
src/agentopt/__init__.py |
Exports streaming selectors at top-level; adds streaming_random factory method |
tests/test_streaming_random_search.py |
Unit tests for streaming random search behavior and ModelSelector factory support |
tests/test_streaming_brute_force.py |
Unit tests for streaming brute force update behavior |
tests/streaming_brute_force_fake_data.py |
Runnable demo script for streaming brute force on synthetic data |
docs/concepts/algorithms.md |
Adds “Streaming Random Search” to algorithms list and documents usage |
docs/api/selectors.md |
Adds API doc entry for StreamingRandomSearchModelSelector |
README.md |
Adds streaming_random method to selector table and an example for streaming brute force |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| if combo_unchanged and improvement < self._CONVERGENCE_DELTA: |
There was a problem hiding this comment.
In _update_convergence_state, improvement = current_acc - self._best_accuracy and the stability check uses improvement < _CONVERGENCE_DELTA. If accuracy drops significantly (negative improvement), this still counts as “stable” and can trigger convergence even though performance changed materially. Consider using an absolute delta (abs(current_acc - self._best_accuracy) < ...) or tracking stability relative to a running best/EMA, so large regressions reset patience.
| if combo_unchanged and improvement < self._CONVERGENCE_DELTA: | |
| delta = abs(improvement) | |
| if combo_unchanged and delta < self._CONVERGENCE_DELTA: |
|
|
||
| if combo_unchanged and improvement < self._CONVERGENCE_DELTA: | ||
| self._stable_batches += 1 | ||
| else: | ||
| self._stable_batches = 0 | ||
|
|
There was a problem hiding this comment.
Convergence logic uses improvement = current_acc - self._best_accuracy and treats improvement < _CONVERGENCE_DELTA as “stable”. Large negative changes (accuracy drops) will still be counted as stable and can trigger convergence prematurely. Using an absolute delta (abs(current_acc - self._best_accuracy) < ...) or another stability criterion would avoid converging through regressions.
| if combo_unchanged and improvement < self._CONVERGENCE_DELTA: | |
| self._stable_batches += 1 | |
| else: | |
| self._stable_batches = 0 | |
| stable_accuracy = abs(improvement) < self._CONVERGENCE_DELTA | |
| if combo_unchanged and stable_accuracy: | |
| self._stable_batches += 1 | |
| else: | |
| self._stable_batches = 0 |
| AgentOpt provides 9 selection algorithms. Choose based on your search space size and evaluation budget. | ||
|
|
||
| ## At a Glance | ||
|
|
||
| | Algorithm | Strategy | Evaluations | Best For | | ||
| |:----------|:---------|:------------|:---------| | ||
| | [Brute Force](#brute-force) | Exhaustive | All | Small spaces (< 50 combos) | | ||
| | [Random Search](#random-search) | Sampling | Configurable fraction | Quick baselines | | ||
| | [Streaming Random Search](#streaming-random-search) | Streaming sampling | Incremental | Online / incoming batches | | ||
| | [Hill Climbing](#hill-climbing) | Greedy + restarts | Guided neighbors | Medium spaces | |
There was a problem hiding this comment.
The guide now says “AgentOpt provides 9 selection algorithms”, but this PR introduces two new selectors (StreamingRandomSearchModelSelector and StreamingBruteForceModelSelector). Either document both streaming algorithms here (and update the count/table accordingly), or explicitly clarify that streaming brute-force is considered part of “Brute Force” rather than a separate algorithm (and add a note/section so users can discover it).
| if self._converged: | ||
| print( | ||
| "\nStreaming random search converged; skipping new batch. " | ||
| "Current best combo is stable." | ||
| ) | ||
| return self.results() | ||
|
|
There was a problem hiding this comment.
update() returns early once _converged is set, preventing callers from continuing to incorporate new batches (even if drift occurs). Since should_continue() already exists, consider letting update() always process the batch (or provide an explicit override) and keep convergence as an advisory signal rather than a hard stop.
| def test_streaming_update_accumulates_and_tracks_best(): | ||
| selector = StreamingBruteForceModelSelector( | ||
| agent_fn=_agent_fn, | ||
| models={"agent": ["good", "bad"]}, | ||
| eval_fn=_eval_fn, | ||
| dataset=[({"x": 0}, 0)], | ||
| ) |
There was a problem hiding this comment.
This test constructs StreamingBruteForceModelSelector(agent_fn=...), but the selector currently inherits BaseModelSelector whose constructor requires agent (and does not accept agent_fn). As written, instantiating the selector will raise a TypeError before these assertions run. Once the selector’s constructor is aligned with the base agent interface, update this test to match the supported API.
| def _run_selection( | ||
| self, parallel: bool = False, max_concurrent: int = 20, | ||
| ) -> SelectionResults: | ||
| # Keep select_best behavior: evaluate the provided dataset once. | ||
| if not self._seed_consumed: | ||
| result = self.update(self.dataset, parallel=parallel, max_concurrent=max_concurrent) | ||
| self._seed_consumed = True | ||
| return result | ||
| return self.results() |
There was a problem hiding this comment.
_run_selection() is used via BaseModelSelector.select_best(), which always calls self._tracker.stop() afterward. For streaming selectors, users will typically call select_best() once (warm start) and then call update() repeatedly; after the first select_best(), the tracker will be stopped and subsequent updates won’t record token usage / cache hits. Consider overriding select_best() in streaming selectors (or restarting the tracker at the start of update()) so streaming updates continue to be tracked.
| self._all_combos: List[Dict[str, ModelCandidate]] = self._all_combos() | ||
| total = len(self._all_combos) |
There was a problem hiding this comment.
This assigns self._all_combos to a list, which shadows the inherited BaseModelSelector._all_combos() method on the instance. That can break any later call to self._all_combos() (TypeError). Prefer a different attribute name (e.g., self._all_combos_list / self._combos).
| dataset=warm_start_dataset, # seed batch | ||
| sample_fraction=0.25, | ||
| seed=42, | ||
| ) | ||
|
|
||
| selector.select_best() # evaluate warm start once | ||
| selector.update(stream_batch_1) # keep updating online |
There was a problem hiding this comment.
The example uses selector.select_best() and then calls selector.update(...). Note that select_best() (via BaseModelSelector) stops the tracker on return, so subsequent streaming updates won’t record token usage / cache hits unless the selector restarts the tracker. Either adjust the example to warm-start via update(warm_start_dataset) instead, or update the streaming selector implementation to keep tracking active across select_best() + update().
| dataset=warm_start_dataset, # seed batch | |
| sample_fraction=0.25, | |
| seed=42, | |
| ) | |
| selector.select_best() # evaluate warm start once | |
| selector.update(stream_batch_1) # keep updating online | |
| sample_fraction=0.25, | |
| seed=42, | |
| ) | |
| # warm-start with an initial batch | |
| selector.update(warm_start_dataset) | |
| # then keep updating online as new batches arrive | |
| selector.update(stream_batch_1) |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Addresses #3