Skip to content

Commit b989cf3

Browse files
QianJaneXieQian Xie
andauthored
Change "evaluation cost" to "search cost" (#77)
* distinguish "per-query cost" with "cumulative evaluation cost" * change "evaluation cost" to "search cost" --------- Co-authored-by: Qian Xie <90579251+JaneQianXie@users.noreply.github.com>
1 parent 22cb4f8 commit b989cf3

6 files changed

Lines changed: 12 additions & 12 deletions

File tree

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -167,16 +167,16 @@ AgentOpt works with any LLM framework that uses `httpx` under the hood. Here we
167167

168168
AgentOpt includes a rich set of selection algorithms. Advanced users may get significant speedups by choosing the right method for their use case. See the [documentation](https://agentoptimizer.github.io/agentopt/) and [advanced_selection_example.py](examples/advanced_selection_example.py) for details.
169169

170-
If you do not need the strict best model combination and want **lower cumulative evaluation cost**, `epsilon_lucb` is often a good choice: it stops once an **ε-optimal** arm is found (tune `epsilon` to trade off how close to optimal you need to be versus how many runs you spend).
170+
If you do not need the strict best model combination and want **lower search cost**, `epsilon_lucb` is often a good choice: it stops once an **ε-optimal** arm is found (tune `epsilon` to trade off how close to optimal you need to be versus how many runs you spend).
171171

172172
| `method=` | Best for | How it works |
173173
|-----------|----------|-------------|
174-
| `"auto"` (default) | General use | Automatically finds the best combination (wired to `arm_elimination` — strong best-arm identification with lower cumulative evaluation cost than `brute_force`) |
174+
| `"auto"` (default) | General use | Automatically finds the best combination (wired to `arm_elimination` — strong best-arm identification with lower search cost than `brute_force`) |
175175
| `"brute_force"` | Small search spaces | Evaluates all combinations |
176176
| `"random"` | Quick exploration | Samples a random fraction |
177177
| `"hill_climbing"` | Topology-aware search | Greedy search using model quality/speed rankings |
178178
| `"arm_elimination"` | Best-arm identification | Bandit; eliminates statistically dominated combinations |
179-
| `"epsilon_lucb"` | Extra cumulative evaluation cost savings when ε-optimal is enough | Bandit; stops when an epsilon-optimal best arm is identified |
179+
| `"epsilon_lucb"` | Extra search cost savings when ε-optimal is enough | Bandit; stops when an epsilon-optimal best arm is identified |
180180
| `"threshold"` | Thresholding objectives | Bandit; determines whether each combination is above/below a user-defined `threshold` on the performance metric (e.g., mean accuracy) |
181181
| `"lm_proposal"` | LLM-guided search | Uses a proposer LLM to shortlist promising combinations |
182182
| `"bayesian"` | Expensive evaluations | GP-based Bayesian optimization over categorical model choices; uses correlation between combinations (requires `pip install "agentopt-py[bayesian]"`) |

docs/blog/posts/technical-deep-dive.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ Arm Elimination works in rounds:
106106
3. **Grow the batch**: double the datapoints, evaluate only the survivors
107107
4. **Repeat**: until one combo remains or you run out of data
108108

109-
Bad combos get eliminated early and cheaply. Good combos earn more evaluation budget. The cumulative evaluation cost is far less than brute force.
109+
Bad combos get eliminated early and cheaply. Good combos earn more search budget. The search cost is far less than brute force.
110110

111111
### Epsilon-LUCB
112112

@@ -130,14 +130,14 @@ The catch: Hill Climbing requires **topology information**. It needs a notion of
130130

131131
Across our four benchmarks, Arm Elimination consistently achieves near-optimal accuracy while using 40-60% less budget than brute force:
132132

133-
| Benchmark | Brute Force Accuracy | Arm Elimination Accuracy | Cumulative cost savings |
133+
| Benchmark | Brute Force Accuracy | Arm Elimination Accuracy | Search cost savings |
134134
|-----------|---------------------|------------------------|-------------|
135135
| HotpotQA | 74.78% | 74.12% | 64% |
136136
| GPQA | 80.30% | 80.14% | 49% |
137137
| MathQA | 98.83% | 98.80% | 46% |
138138
| BFCL | 72.00% | 72.00% | 11% |
139139

140-
Nearly identical accuracy to exhaustive search, at roughly half the cumulative evaluation cost. These algorithms don't just save budget. They find the right combo with statistical guarantees.
140+
Nearly identical accuracy to exhaustive search, at roughly half the search cost. These algorithms don't just save budget. They find the right combo with statistical guarantees.
141141

142142
## Empirical Validation
143143

@@ -177,7 +177,7 @@ We validated AgentOpt across four diverse benchmarks using 9 models on Amazon Be
177177
</tbody>
178178
</table>
179179

180-
*Format: obtained accuracy / cumulative evaluation cost savings vs brute force. Averaged over 50 seeds. <span class="ao-efficient">Green</span> = within 5% of brute force accuracy AND >50% savings on that metric.*
180+
*Format: obtained accuracy / search cost savings vs brute force. Averaged over 50 seeds. <span class="ao-efficient">Green</span> = within 5% of brute force accuracy AND >50% savings on that metric.*
181181

182182
Arm Elimination is the consistent winner: it achieves near-brute-force accuracy across all four benchmarks while using 40-60% less budget. LM Proposal (asking GPT-4.1 to predict the best combo) matches brute force on GPQA (where the answer is intuitive) but collapses to 34% on HotpotQA and 45% on BFCL. It can't predict that Ministral outperforms Opus as a planner.
183183

docs/concepts/algorithms.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ selector = EpsilonLUCBModelSelector(
161161
| `confidence` | `1.0` | Confidence level for bound computation |
162162

163163
!!! success "When to use"
164-
When finding the *exact* best combo isn't necessary and you can tolerate a small accuracy gap (epsilon) in exchange for significant cost savings. Particularly effective when many combos are close in performance.
164+
When finding the *exact* best combo isn't necessary and you can tolerate a small accuracy gap (epsilon) in exchange for significant search cost savings. Particularly effective when many combos are close in performance.
165165

166166
---
167167

docs/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -165,7 +165,7 @@ AgentOpt patches `httpx` at the transport level — the same HTTP library used b
165165
| **Random Search** | Random sampling | Quick baselines |
166166
| **Hill Climbing** | Greedy + restarts | Medium spaces with model topology |
167167
| **Arm Elimination** | Progressive pruning | Statistical early stopping |
168-
| **Epsilon LUCB** | ε-optimal best arm | Extra cost savings when ε-optimal is enough |
168+
| **Epsilon LUCB** | ε-optimal best arm | Extra search cost savings when ε-optimal is enough |
169169
| **Threshold SE** | Threshold classification | Filtering combos above/below a performance target |
170170
| **LM Proposal** | LLM-guided shortlist | Leveraging model knowledge |
171171
| **Bayesian Optimization** | Gaussian Process | Expensive evaluations |

examples/advanced_selection_example.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
all use method="brute_force" for simplicity. This example demonstrates the other
66
selection algorithms available via ModelSelector(method=...). The default
77
method="auto" automatically finds the best combination (wired to arm_elimination —
8-
strong best-arm identification with lower evaluation cost than brute_force).
8+
strong best-arm identification with lower search cost than brute_force).
99
1010
Prerequisites:
1111
1. pip install openai agentopt-py
@@ -210,7 +210,7 @@ def run_bayesian():
210210
formatter_class=argparse.RawDescriptionHelpFormatter,
211211
epilog="""
212212
Available methods:
213-
auto Automatically finds the best combination (wired to arm_elimination; lower evaluation cost than brute_force) (default)
213+
auto Automatically finds the best combination (wired to arm_elimination; lower search cost than brute_force) (default)
214214
random Evaluate a random subset of combinations
215215
hill_climbing Greedy search using model quality/speed rankings
216216
arm_elimination Eliminate statistically dominated combinations early

src/agentopt/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ def ModelSelector(
6060
dataset: List of ``(input_data, expected_output)`` pairs.
6161
method: Selection algorithm. ``"auto"`` (default) automatically finds
6262
the best combination (same implementation as ``"arm_elimination"`` —
63-
strong best-arm identification with lower evaluation cost than
63+
strong best-arm identification with lower search cost than
6464
``"brute_force"``). Other options: ``"brute_force"``,
6565
``"random"``, ``"hill_climbing"``, ``"arm_elimination"``,
6666
``"epsilon_lucb"``, ``"threshold"``, ``"lm_proposal"``,

0 commit comments

Comments
 (0)