Change "evaluation cost" to "search cost" (#77)

QianJaneXie · Qian Xie · web-flow · commit b989cf3be090 · 2026-04-02T16:55:05.000-04:00
* distinguish "per-query cost" with "cumulative evaluation cost"

* change "evaluation cost" to "search cost"

---------

Co-authored-by: Qian Xie &lt;90579251+JaneQianXie@users.noreply.github.com&gt;
diff --git a/README.md b/README.md
@@ -167,16 +167,16 @@ AgentOpt works with any LLM framework that uses `httpx` under the hood. Here we
 
 AgentOpt includes a rich set of selection algorithms. Advanced users may get significant speedups by choosing the right method for their use case. See the [documentation](https://agentoptimizer.github.io/agentopt/) and [advanced_selection_example.py](examples/advanced_selection_example.py) for details.
 
-If you do not need the strict best model combination and want **lower cumulative evaluation cost**, `epsilon_lucb` is often a good choice: it stops once an **ε-optimal** arm is found (tune `epsilon` to trade off how close to optimal you need to be versus how many runs you spend).
+If you do not need the strict best model combination and want **lower search cost**, `epsilon_lucb` is often a good choice: it stops once an **ε-optimal** arm is found (tune `epsilon` to trade off how close to optimal you need to be versus how many runs you spend).
 
 | `method=` | Best for | How it works |
 |-----------|----------|-------------|
-| `"auto"` (default) | General use | Automatically finds the best combination (wired to `arm_elimination` — strong best-arm identification with lower cumulative evaluation cost than `brute_force`) |
+| `"auto"` (default) | General use | Automatically finds the best combination (wired to `arm_elimination` — strong best-arm identification with lower search cost than `brute_force`) |
 | `"brute_force"` | Small search spaces | Evaluates all combinations |
 | `"random"` | Quick exploration | Samples a random fraction |
 | `"hill_climbing"` | Topology-aware search | Greedy search using model quality/speed rankings |
 | `"arm_elimination"` | Best-arm identification | Bandit; eliminates statistically dominated combinations |
-| `"epsilon_lucb"` | Extra cumulative evaluation cost savings when ε-optimal is enough | Bandit; stops when an epsilon-optimal best arm is identified |
+| `"epsilon_lucb"` | Extra search cost savings when ε-optimal is enough | Bandit; stops when an epsilon-optimal best arm is identified |
 | `"threshold"` | Thresholding objectives | Bandit; determines whether each combination is above/below a user-defined `threshold` on the performance metric (e.g., mean accuracy) |
 | `"lm_proposal"` | LLM-guided search | Uses a proposer LLM to shortlist promising combinations |
 | `"bayesian"` | Expensive evaluations | GP-based Bayesian optimization over categorical model choices; uses correlation between combinations (requires `pip install "agentopt-py[bayesian]"`) |
diff --git a/docs/blog/posts/technical-deep-dive.md b/docs/blog/posts/technical-deep-dive.md
@@ -106,7 +106,7 @@ Arm Elimination works in rounds:
 3. **Grow the batch**: double the datapoints, evaluate only the survivors
 4. **Repeat**: until one combo remains or you run out of data
 
-Bad combos get eliminated early and cheaply. Good combos earn more evaluation budget. The cumulative evaluation cost is far less than brute force.
+Bad combos get eliminated early and cheaply. Good combos earn more search budget. The search cost is far less than brute force.
 
 ### Epsilon-LUCB
 
@@ -130,14 +130,14 @@ The catch: Hill Climbing requires **topology information**. It needs a notion of
 
 Across our four benchmarks, Arm Elimination consistently achieves near-optimal accuracy while using 40-60% less budget than brute force:
 
-| Benchmark | Brute Force Accuracy | Arm Elimination Accuracy | Cumulative cost savings |
+| Benchmark | Brute Force Accuracy | Arm Elimination Accuracy | Search cost savings |
 |-----------|---------------------|------------------------|-------------|
 | HotpotQA | 74.78% | 74.12% | 64% |
 | GPQA | 80.30% | 80.14% | 49% |
 | MathQA | 98.83% | 98.80% | 46% |
 | BFCL | 72.00% | 72.00% | 11% |
 
-Nearly identical accuracy to exhaustive search, at roughly half the cumulative evaluation cost. These algorithms don't just save budget. They find the right combo with statistical guarantees.
+Nearly identical accuracy to exhaustive search, at roughly half the search cost. These algorithms don't just save budget. They find the right combo with statistical guarantees.
 
 ## Empirical Validation
 
@@ -177,7 +177,7 @@ We validated AgentOpt across four diverse benchmarks using 9 models on Amazon Be
 </tbody>
 </table>
 
-*Format: obtained accuracy / cumulative evaluation cost savings vs brute force. Averaged over 50 seeds. <span class="ao-efficient">Green</span> = within 5% of brute force accuracy AND >50% savings on that metric.*
+*Format: obtained accuracy / search cost savings vs brute force. Averaged over 50 seeds. <span class="ao-efficient">Green</span> = within 5% of brute force accuracy AND >50% savings on that metric.*
 
 Arm Elimination is the consistent winner: it achieves near-brute-force accuracy across all four benchmarks while using 40-60% less budget. LM Proposal (asking GPT-4.1 to predict the best combo) matches brute force on GPQA (where the answer is intuitive) but collapses to 34% on HotpotQA and 45% on BFCL. It can't predict that Ministral outperforms Opus as a planner.
 
diff --git a/docs/concepts/algorithms.md b/docs/concepts/algorithms.md
@@ -161,7 +161,7 @@ selector = EpsilonLUCBModelSelector(
 | `confidence` | `1.0` | Confidence level for bound computation |
 
 !!! success "When to use"
-    When finding the *exact* best combo isn't necessary and you can tolerate a small accuracy gap (epsilon) in exchange for significant cost savings. Particularly effective when many combos are close in performance.
+    When finding the *exact* best combo isn't necessary and you can tolerate a small accuracy gap (epsilon) in exchange for significant search cost savings. Particularly effective when many combos are close in performance.
 
 ---
 
diff --git a/docs/index.md b/docs/index.md
@@ -165,7 +165,7 @@ AgentOpt patches `httpx` at the transport level — the same HTTP library used b
 | **Random Search** | Random sampling | Quick baselines |
 | **Hill Climbing** | Greedy + restarts | Medium spaces with model topology |
 | **Arm Elimination** | Progressive pruning | Statistical early stopping |
-| **Epsilon LUCB** | ε-optimal best arm | Extra cost savings when ε-optimal is enough |
+| **Epsilon LUCB** | ε-optimal best arm | Extra search cost savings when ε-optimal is enough |
 | **Threshold SE** | Threshold classification | Filtering combos above/below a performance target |
 | **LM Proposal** | LLM-guided shortlist | Leveraging model knowledge |
 | **Bayesian Optimization** | Gaussian Process | Expensive evaluations |
diff --git a/examples/advanced_selection_example.py b/examples/advanced_selection_example.py
@@ -5,7 +5,7 @@
 all use method="brute_force" for simplicity. This example demonstrates the other
 selection algorithms available via ModelSelector(method=...). The default
 method="auto" automatically finds the best combination (wired to arm_elimination —
-strong best-arm identification with lower evaluation cost than brute_force).
+strong best-arm identification with lower search cost than brute_force).
 
 Prerequisites:
     1. pip install openai agentopt-py
@@ -210,7 +210,7 @@ def run_bayesian():
         formatter_class=argparse.RawDescriptionHelpFormatter,
         epilog="""
 Available methods:
-  auto              Automatically finds the best combination (wired to arm_elimination; lower evaluation cost than brute_force) (default)
+  auto              Automatically finds the best combination (wired to arm_elimination; lower search cost than brute_force) (default)
   random            Evaluate a random subset of combinations
   hill_climbing     Greedy search using model quality/speed rankings
   arm_elimination   Eliminate statistically dominated combinations early
diff --git a/src/agentopt/__init__.py b/src/agentopt/__init__.py
@@ -60,7 +60,7 @@ def ModelSelector(
         dataset: List of ``(input_data, expected_output)`` pairs.
         method: Selection algorithm. ``"auto"`` (default) automatically finds
             the best combination (same implementation as ``"arm_elimination"`` —
-            strong best-arm identification with lower evaluation cost than
+            strong best-arm identification with lower search cost than
             ``"brute_force"``). Other options: ``"brute_force"``,
             ``"random"``, ``"hill_climbing"``, ``"arm_elimination"``,
             ``"epsilon_lucb"``, ``"threshold"``, ``"lm_proposal"``,