From 6b5e45b080a5a405e9a07eac2f0ded933ce10c6b Mon Sep 17 00:00:00 2001
From: Richard Abrich <richard.abrich@gmail.com>
Date: Thu, 29 Jan 2026 13:43:18 -0500
Subject: [PATCH 1/3] docs: replace aspirational claims with honest
 placeholders

- Remove unvalidated badges (95%+ success rate, 67% cost savings)
- Add "First open-source WAA reproduction" as headline
- Move WAA to top as main feature with status indicator
- Change "Recent Improvements" to "Roadmap (In Progress)"
- Remove v0.2.0 version references (current is v0.1.1)
- Add Azure quota requirements note for parallelization
- Mark features as [IN PROGRESS] where appropriate

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 README.md | 59 +++++++++++++++++++++++++++++++------------------------
 1 file changed, 33 insertions(+), 26 deletions(-)

diff --git a/README.md b/README.md
index f2c8802..67c4144 100644
--- a/README.md
+++ b/README.md
@@ -5,41 +5,43 @@
 [![Downloads](https://img.shields.io/pypi/dm/openadapt-evals.svg)](https://pypi.org/project/openadapt-evals/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)
-[![Azure Success Rate](https://img.shields.io/badge/Azure%20Success%20Rate-95%25%2B-success)](https://github.com/OpenAdaptAI/openadapt-evals)
-[![Cost Savings](https://img.shields.io/badge/Cost%20Savings-67%25-brightgreen)](https://github.com/OpenAdaptAI/openadapt-evals/blob/main/COST_OPTIMIZATION.md)
-
-Evaluation infrastructure for GUI agent benchmarks.
+Evaluation infrastructure for GUI agent benchmarks. **First open-source WAA (Windows Agent Arena) reproduction.**
 
 ## Overview
 
 `openadapt-evals` provides a unified framework for evaluating GUI automation agents across standardized benchmarks like Windows Agent Arena (WAA), OSWorld, WebArena, and others.
 
-## Recent Improvements
+## Windows Agent Arena (WAA) - Headline Feature
+
+> **Status**: Actively running full 154-task evaluation. Results coming soon.
+
+This is the **first open-source reproduction** of the [Windows Agent Arena](https://github.com/microsoft/WindowsAgentArena) benchmark, enabling:
+- Reproducible baseline measurements for GUI agents
+- Side-by-side model comparison (GPT-4o, Claude, etc.)
+- Per-domain breakdown of agent capabilities
+
+See the [WAA Benchmark Results](#waa-benchmark-results) section below for current status.
 
-We've made significant improvements to reliability, cost-efficiency, and observability:
+## Roadmap (In Progress)
 
-### Azure Reliability (v0.2.0 - January 2026)
-- **95%+ Success Rate Target**: Fixed nested virtualization issues that caused 0% task completion
-- **VM Configuration**: Upgraded to `Standard_D4s_v5` with proper nested virtualization support
+The following features are under active development:
+
+### Azure Reliability (`[IN PROGRESS]`)
+- **Goal**: 95%+ task completion rate (vs. early issues with 0%)
+- **VM Configuration**: Using `Standard_D8ds_v5` with nested virtualization
 - **Health Monitoring**: Automatic detection and retry of stuck jobs
-- **Fast Failure Detection**: 10-minute timeout instead of 8+ hour hangs
-- See [PR #11](https://github.com/OpenAdaptAI/openadapt-evals/pull/11) for details
-
-### Cost Optimization (v0.2.0 - January 2026)
-- **67% Cost Reduction**: From $7.68 to $2.50 per full evaluation (154 tasks)
-- **Tiered VM Sizing**: Automatic VM size selection based on task complexity (37% savings)
-- **Spot Instance Support**: 70-80% discount on compute costs (64% savings with tiered VMs)
-- **Azure Container Registry**: 10x faster image pulls (1-2 min vs 8-12 min)
-- **Real-time Cost Tracking**: Monitor costs during evaluation
-- See [COST_OPTIMIZATION.md](./COST_OPTIMIZATION.md) and [PR #13](https://github.com/OpenAdaptAI/openadapt-evals/pull/13) for details
-
-### Screenshot Validation & Viewer (v0.2.0 - January 2026)
-- **Real Benchmark Screenshots**: Viewer now displays actual WAA evaluation screenshots
+
+### Cost Optimization (`[IN PROGRESS]`)
+- **Goal**: Reduce per-evaluation cost from ~$7.68 to ~$2.50 (154 tasks)
+- **Tiered VM Sizing**: Match VM size to task complexity
+- **Spot Instance Support**: Use preemptible VMs for 70-80% discount
+- See [COST_OPTIMIZATION.md](./COST_OPTIMIZATION.md) for design
+
+### Benchmark Viewer (Available)
+- **Real Benchmark Screenshots**: Viewer displays actual WAA evaluation screenshots
 - **Auto-Screenshot Tool**: Automated screenshot generation with Playwright
-- **Screenshot Validation**: Manifest-based validation ensuring correctness
 - **Execution Logs**: Step-by-step logs with search and filtering
-- **Live Monitoring**: Real-time Azure ML job monitoring with auto-refresh
-- See [PR #6](https://github.com/OpenAdaptAI/openadapt-evals/pull/6) for details
+- **Live Monitoring**: Real-time progress tracking
 
 ## Installation
 
@@ -318,6 +320,11 @@ results = evaluate_agent_on_benchmark(agent, adapter, task_ids=[t.task_id for t
 
 Run WAA at scale using Azure ML compute with optimized costs:
 
+> **⚠️ Quota Requirements**: Parallel evaluation requires sufficient Azure vCPU quota.
+> - Each worker needs a `Standard_D8ds_v5` VM (8 vCPUs)
+> - 10 workers = 80 vCPUs required
+> - Default quota is typically 10 vCPUs - [request an increase](https://learn.microsoft.com/en-us/azure/quotas/quickstart-increase-quota-portal) before running parallel evaluations
+
 ```bash
 # Install Azure dependencies
 pip install openadapt-evals[azure]
@@ -358,7 +365,7 @@ results = orchestrator.run_evaluation(
 )
 ```
 
-**Azure Reliability**: The orchestrator now uses `Standard_D4s_v5` VMs with proper nested virtualization support and automatic health monitoring, achieving 95%+ success rates.
+**Azure Reliability**: The orchestrator uses `Standard_D8ds_v5` VMs with nested virtualization support and automatic health monitoring.
 
 ### Live Monitoring
 

From 18caac1a441cb212a41c110e4ad8e473283aec3c Mon Sep 17 00:00:00 2001
From: Richard Abrich <richard.abrich@gmail.com>
Date: Thu, 29 Jan 2026 14:06:54 -0500
Subject: [PATCH 2/3] docs: fix inaccurate "first reproduction" claim

WAA is already open-source from Microsoft. Changed to accurate claim:
"Simplified CLI toolkit for Windows Agent Arena"

Updated value proposition to reflect what we actually provide:
- Azure VM setup and SSH tunnel management
- Agent adapters for Claude/GPT/custom agents
- Results viewer
- Parallelization support

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 README.md | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 67c4144..bc556be 100644
--- a/README.md
+++ b/README.md
@@ -5,7 +5,7 @@
 [![Downloads](https://img.shields.io/pypi/dm/openadapt-evals.svg)](https://pypi.org/project/openadapt-evals/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)
-Evaluation infrastructure for GUI agent benchmarks. **First open-source WAA (Windows Agent Arena) reproduction.**
+Evaluation infrastructure for GUI agent benchmarks. **Simplified CLI toolkit for Windows Agent Arena.**
 
 ## Overview
 
@@ -15,10 +15,11 @@ Evaluation infrastructure for GUI agent benchmarks. **First open-source WAA (Win
 
 > **Status**: Actively running full 154-task evaluation. Results coming soon.
 
-This is the **first open-source reproduction** of the [Windows Agent Arena](https://github.com/microsoft/WindowsAgentArena) benchmark, enabling:
-- Reproducible baseline measurements for GUI agents
-- Side-by-side model comparison (GPT-4o, Claude, etc.)
-- Per-domain breakdown of agent capabilities
+A **simplified CLI toolkit** for the [Windows Agent Arena](https://github.com/microsoft/WindowsAgentArena) benchmark, providing:
+- Easy Azure VM setup and SSH tunnel management
+- Agent adapters for Claude, GPT-4o, and custom agents
+- Results viewer with per-domain breakdown
+- Parallelization support for faster evaluations
 
 See the [WAA Benchmark Results](#waa-benchmark-results) section below for current status.
 

From 50660a8420ff331a4a5088187d84aba3e74fb61b Mon Sep 17 00:00:00 2001
From: Richard Abrich <richard.abrich@gmail.com>
Date: Thu, 29 Jan 2026 14:10:20 -0500
Subject: [PATCH 3/3] docs: fix VM size to match code (D4s_v5 not D8ds_v5)

The code uses Standard_D4s_v5 (4 vCPUs) by default, not D8ds_v5.
Updated all references to be accurate.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---
 README.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/README.md b/README.md
index bc556be..32357e6 100644
--- a/README.md
+++ b/README.md
@@ -29,7 +29,7 @@ The following features are under active development:
 
 ### Azure Reliability (`[IN PROGRESS]`)
 - **Goal**: 95%+ task completion rate (vs. early issues with 0%)
-- **VM Configuration**: Using `Standard_D8ds_v5` with nested virtualization
+- **VM Configuration**: Using `Standard_D4s_v5` with nested virtualization (configurable)
 - **Health Monitoring**: Automatic detection and retry of stuck jobs
 
 ### Cost Optimization (`[IN PROGRESS]`)
@@ -322,8 +322,8 @@ results = evaluate_agent_on_benchmark(agent, adapter, task_ids=[t.task_id for t
 Run WAA at scale using Azure ML compute with optimized costs:
 
 > **⚠️ Quota Requirements**: Parallel evaluation requires sufficient Azure vCPU quota.
-> - Each worker needs a `Standard_D8ds_v5` VM (8 vCPUs)
-> - 10 workers = 80 vCPUs required
+> - Default VM: `Standard_D4s_v5` (4 vCPUs per worker)
+> - 10 workers = 40 vCPUs required
 > - Default quota is typically 10 vCPUs - [request an increase](https://learn.microsoft.com/en-us/azure/quotas/quickstart-increase-quota-portal) before running parallel evaluations
 
 ```bash
@@ -366,7 +366,7 @@ results = orchestrator.run_evaluation(
 )
 ```
 
-**Azure Reliability**: The orchestrator uses `Standard_D8ds_v5` VMs with nested virtualization support and automatic health monitoring.
+**Azure Reliability**: The orchestrator uses `Standard_D4s_v5` VMs with nested virtualization support and automatic health monitoring.
 
 ### Live Monitoring