Fix docs to match package source; update blog post title and remove duplicate table

Wenyueh · claude · Wenyueh · commit 437bbb702d80 · 2026-03-23T16:37:40.000-04:00
Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/docs/api/results.md b/docs/api/results.md
@@ -42,7 +42,7 @@ Each evaluated combination produces a `ModelResult`:
 | `latency_seconds` | `float` | Mean latency per datapoint |
 | `input_tokens` | `Dict[str, int]` | Input tokens by model |
 | `output_tokens` | `Dict[str, int]` | Output tokens by model |
-| `estimated_price` | `float` | Estimated cost in USD |
+| `price` | `float` (property) | Per-sample cost in USD, or `None` if pricing unavailable |
 | `is_best` | `bool` | Whether this is the top-ranked combination |
 | `datapoint_results` | `List[DatapointResult]` | Per-datapoint breakdown |
 
@@ -55,6 +55,7 @@ Per-datapoint evaluation detail:
 | Field | Type | Description |
 |:------|:-----|:------------|
 | `datapoint_index` | `int` | Index in the dataset |
-| `datapoint_id` | `str` | Unique identifier |
 | `score` | `float` | Eval score for this datapoint |
 | `latency_seconds` | `float` | Latency for this datapoint |
+| `input_tokens` | `Dict[str, int]` | Input tokens by model |
+| `output_tokens` | `Dict[str, int]` | Output tokens by model |
diff --git a/docs/api/selectors.md b/docs/api/selectors.md
@@ -10,7 +10,6 @@ All selectors share a common constructor interface and the `select_best()` metho
 | `models` | `Dict[str, List]` | Maps node names to candidate model lists |
 | `eval_fn` | `Callable` | `(expected, actual) -> float` score in `[0, 1]` |
 | `dataset` | `List[Tuple]` | `[(input_data, expected_answer), ...]` |
-| `invoke_fn` | `Callable`, optional | Custom `(agent, input) -> result`. Default: `agent(input)` |
 | `model_prices` | `Dict`, optional | Custom pricing: `{"model": {"input_price": x, "output_price": y}}` |
 | `tracker` | `LLMTracker`, optional | Custom tracker instance (e.g., with disk cache) |
 
diff --git a/docs/api/tracker.md b/docs/api/tracker.md
@@ -12,8 +12,8 @@ from agentopt.proxy import LLMTracker
 
 ```python
 tracker = LLMTracker(
-    cache=True,              # Enable response caching (default: True)
-    cache_dir="./llm_cache", # Persist cache to disk (default: None, memory-only)
+    cache=True,                      # Enable response caching (default: True)
+    cache_dir=".agentopt_cache",     # Persist cache to disk (default: ".agentopt_cache")
 )
 ```
 
diff --git a/docs/blog/posts/technical-deep-dive.md b/docs/blog/posts/technical-deep-dive.md
@@ -13,7 +13,7 @@ categories:
   - Model Selection
 ---
 
-# Why Your Agent Needs a Model Optimizer, Not Just a Model
+# Why Your Agent Needs a Model Combo Optimizer, Not Just a Model
 
 *Wenyue Hua\*, Qian Xie, Sripad Karne, Armaan Agrawal, Nikos Pagonas, Kostis Kaffes, Tianyi Peng\**
 
@@ -183,13 +183,7 @@ Arm Elimination is the consistent winner: it achieves near-brute-force accuracy
 
 ### Budget Alternatives
 
-For every benchmark, there exists a combination within 3-5% of the best accuracy that costs 10-100x less:
-
-| Benchmark | Best | Accuracy | Cost | Budget Pick | Accuracy | Cost | Ratio |
-|-----------|------|---------|------|------------|---------|------|-------|
-| HotpotQA | Ministral + Opus | 74.8% | $2.71 | Qwen3 Next + gpt-oss-120b | 71.3% | $0.13 | 21x |
-| MathQA | Opus + Qwen3 Next | 98.8% | $5.89 | Ministral + C3 Haiku | 94.0% | $0.05 | 118x |
-| BFCL | Opus | 72.0% | $60.78 | Qwen3 Next | 71.0% | $1.87 | 32x |
+As shown in the cost-savings table above, for every benchmark there exists a combination within 3-5% of the best accuracy that costs 10-100x less. You don't need the most expensive model to get near-optimal results.
 
 ## Get Started
 
diff --git a/docs/concepts/algorithms.md b/docs/concepts/algorithms.md
@@ -121,15 +121,14 @@ selector = ArmEliminationModelSelector(
     models=models,
     eval_fn=eval_fn,
     dataset=dataset,
-    n_initial=10,
     growth_factor=2.0,
     confidence=1.0,
 )
 ```
 
 | Parameter | Default | Description |
 |:----------|:--------|:------------|
-| `n_initial` | `10` | Initial batch size (datapoints) |
+| `n_initial` | `None` | Initial batch size. Default: 10% of dataset (`max(1, len(dataset)//10)`) |
 | `growth_factor` | `2.0` | Batch size multiplier per round |
 | `confidence` | `1.0` | Elimination confidence threshold |
 
@@ -205,15 +204,18 @@ selector = LMProposalModelSelector(
     models=models,
     eval_fn=eval_fn,
     dataset=dataset,
-    proposer_model="gpt-4o-mini",
-    max_combinations=12,
+    proposer_model="gpt-4.1",
+    objective="maximize accuracy and then minimize latency and cost",
+    dataset_preview_size=10,
 )
 ```
 
 | Parameter | Default | Description |
 |:----------|:--------|:------------|
-| `proposer_model` | `"gpt-4o-mini"` | Model used for proposal generation |
-| `max_combinations` | `12` | Max combinations to shortlist |
+| `proposer_model` | `"gpt-4.1"` | Model used for proposal generation |
+| `proposer_client` | `None` | Custom OpenAI-compatible client; auto-creates `OpenAI()` if omitted |
+| `objective` | `"maximize accuracy and then minimize latency and cost"` | Natural-language objective passed to the proposer |
+| `dataset_preview_size` | `10` | Number of dataset examples shown to the proposer |
 
 !!! success "When to use"
     When you want to leverage an LLM's knowledge about model capabilities to skip obviously bad combinations.

Original file line number	Diff line number	Diff line change
`@@ -12,8 +12,8 @@ from agentopt.proxy import LLMTracker`
`12`	`12`
`13`	`13`	```python
`14`	`14`	`tracker = LLMTracker(`
`15`		`- cache=True, # Enable response caching (default: True)`
`16`		`- cache_dir="./llm_cache", # Persist cache to disk (default: None, memory-only)`
	`15`	`+ cache=True, # Enable response caching (default: True)`
	`16`	`+ cache_dir=".agentopt_cache", # Persist cache to disk (default: ".agentopt_cache")`
`17`	`17`	`)`
`18`	`18`	```
`19`	`19`