You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3-3Lines changed: 3 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -167,16 +167,16 @@ AgentOpt works with any LLM framework that uses `httpx` under the hood. Here we
167
167
168
168
AgentOpt includes a rich set of selection algorithms. Advanced users may get significant speedups by choosing the right method for their use case. See the [documentation](https://agentoptimizer.github.io/agentopt/) and [advanced_selection_example.py](examples/advanced_selection_example.py) for details.
169
169
170
-
If you do not need the strict best model combination and want **lower cumulative evaluation cost**, `epsilon_lucb` is often a good choice: it stops once an **ε-optimal** arm is found (tune `epsilon` to trade off how close to optimal you need to be versus how many runs you spend).
170
+
If you do not need the strict best model combination and want **lower search cost**, `epsilon_lucb` is often a good choice: it stops once an **ε-optimal** arm is found (tune `epsilon` to trade off how close to optimal you need to be versus how many runs you spend).
171
171
172
172
|`method=`| Best for | How it works |
173
173
|-----------|----------|-------------|
174
-
|`"auto"` (default) | General use | Automatically finds the best combination (wired to `arm_elimination` — strong best-arm identification with lower cumulative evaluation cost than `brute_force`) |
174
+
|`"auto"` (default) | General use | Automatically finds the best combination (wired to `arm_elimination` — strong best-arm identification with lower search cost than `brute_force`) |
175
175
|`"brute_force"`| Small search spaces | Evaluates all combinations |
176
176
|`"random"`| Quick exploration | Samples a random fraction |
177
177
|`"hill_climbing"`| Topology-aware search | Greedy search using model quality/speed rankings |
|`"epsilon_lucb"`| Extra cumulative evaluation cost savings when ε-optimal is enough | Bandit; stops when an epsilon-optimal best arm is identified |
179
+
|`"epsilon_lucb"`| Extra search cost savings when ε-optimal is enough | Bandit; stops when an epsilon-optimal best arm is identified |
180
180
|`"threshold"`| Thresholding objectives | Bandit; determines whether each combination is above/below a user-defined `threshold` on the performance metric (e.g., mean accuracy) |
181
181
|`"lm_proposal"`| LLM-guided search | Uses a proposer LLM to shortlist promising combinations |
182
182
|`"bayesian"`| Expensive evaluations | GP-based Bayesian optimization over categorical model choices; uses correlation between combinations (requires `pip install "agentopt-py[bayesian]"`) |
Nearly identical accuracy to exhaustive search, at roughly half the cumulative evaluation cost. These algorithms don't just save budget. They find the right combo with statistical guarantees.
140
+
Nearly identical accuracy to exhaustive search, at roughly half the search cost. These algorithms don't just save budget. They find the right combo with statistical guarantees.
141
141
142
142
## Empirical Validation
143
143
@@ -177,7 +177,7 @@ We validated AgentOpt across four diverse benchmarks using 9 models on Amazon Be
177
177
</tbody>
178
178
</table>
179
179
180
-
*Format: obtained accuracy / cumulative evaluation cost savings vs brute force. Averaged over 50 seeds. <spanclass="ao-efficient">Green</span> = within 5% of brute force accuracy AND >50% savings on that metric.*
180
+
*Format: obtained accuracy / search cost savings vs brute force. Averaged over 50 seeds. <spanclass="ao-efficient">Green</span> = within 5% of brute force accuracy AND >50% savings on that metric.*
181
181
182
182
Arm Elimination is the consistent winner: it achieves near-brute-force accuracy across all four benchmarks while using 40-60% less budget. LM Proposal (asking GPT-4.1 to predict the best combo) matches brute force on GPQA (where the answer is intuitive) but collapses to 34% on HotpotQA and 45% on BFCL. It can't predict that Ministral outperforms Opus as a planner.
|`confidence`|`1.0`| Confidence level for bound computation |
162
162
163
163
!!! success "When to use"
164
-
When finding the *exact* best combo isn't necessary and you can tolerate a small accuracy gap (epsilon) in exchange for significant cost savings. Particularly effective when many combos are close in performance.
164
+
When finding the *exact* best combo isn't necessary and you can tolerate a small accuracy gap (epsilon) in exchange for significant search cost savings. Particularly effective when many combos are close in performance.
0 commit comments