OpenAdaptAI · abrichr · Jan 29, 2026 · Jan 29, 2026 · Jan 29, 2026 · Jan 29, 2026
diff --git a/.beads/issues.jsonl b/.beads/issues.jsonl
@@ -1,3 +1,4 @@
+{"id":"openadapt-evals-0an","title":"CLI: aws-costs and waa-image delete commands added","notes":"openadapt-evals PR #24: Added aws-costs command, waa-image delete action, changed default to Docker Hub","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-29T16:17:03.612486-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T16:17:03.612486-05:00"}
 {"id":"openadapt-evals-0dt","title":"Add pre-flight check for Windows install issues","description":"Detect product key prompts or stuck installations BEFORE 10-minute timeout. Check container logs for specific error patterns.","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:57:42.24338-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T18:57:42.24338-05:00"}
 {"id":"openadapt-evals-0ms","title":"Run 20-50 task evaluation","description":"Run WAA benchmark on 20-50 tasks to measure baseline success rate. Target is \u003e80% success rate. This provides quantitative data on agent performance.","notes":"2026-01-29: Azure quota limits parallelization to 2 workers max (10 vCPUs / 4 vCPUs per worker). 10-worker test failed with ClusterCoreQuotaReached. User declined manual portal quota increase. Waiting for api-openai test results before full 154-task run.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:26.461765-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T00:28:02.609085-05:00","dependencies":[{"issue_id":"openadapt-evals-0ms","depends_on_id":"openadapt-evals-c3f","type":"blocks","created_at":"2026-01-20T17:44:26.462904-05:00","created_by":"Richard Abrich"}]}
 {"id":"openadapt-evals-2ar","title":"Implement permanent fix for Windows unattended install","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.544113-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.634857-05:00","closed_at":"2026-01-20T20:32:06.634857-05:00","close_reason":"Duplicate of openadapt-evals-b3l"}
@@ -8,5 +9,6 @@
 {"id":"openadapt-evals-czj","title":"Docker installation fails on Azure VM - pkgProblemResolver error","description":"vm setup-waa fails to install Docker. Error: pkgProblemResolver::Resolve generated breaks. Need to investigate root cause before attempting fix.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T22:48:59.527637-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T22:48:59.527637-05:00"}
 {"id":"openadapt-evals-dke","title":"SYSTEM: Create knowledge persistence workflow using Beads","description":"Every fix/approach must be logged as a Beads issue with:\n1. Problem description\n2. Attempted solution\n3. Result (worked/failed/partial)\n4. Root cause if known\n5. Files changed\n\nBefore any fix attempt, agent MUST:\n1. Run 'bd list --labels=fix,approach' to see prior attempts\n2. Review what was tried before\n3. Document new attempt BEFORE implementing\n\nAfter context compaction, first action:\n1. Run 'bd ready' for current tasks\n2. Run 'bd list --labels=recurring' for known recurring issues\n3. Check docs/RECURRING_ISSUES.md for patterns","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T19:00:18.155796-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T19:00:18.155796-05:00"}
 {"id":"openadapt-evals-gna","title":"Test simplified Dockerfile (Azure mode)","description":"Testing Dockerfile.simplified which uses vanilla WAA Azure mode: native OEM mechanism (C:\\oem), InstallFrom element for unattended install, VERSION=11e for no product key. Steps: 1) Delete current VM 2) Create fresh VM 3) Build simplified image 4) Test Windows installation via QEMU screenshots","notes":"2026-01-22: Confirmed the blocker is not just docker pull; even starting the existing 'winarena' container via az vm run-command timed out.\n\n- smoke-live tried to run docker start winarena via run-command and timed out (900s)\n- WAA server remained unreachable at http://172.171.112.41:5000\n- VM was deallocated after the attempt\n\nImplication: VM/docker state is unhealthy or container start is hanging (possibly due to incomplete image extraction / stuck daemon / disk pressure).\nNext: add/run a vm-debug command to capture docker/system logs and determine whether to rebuild VM/image, pin/mirror image (ACR), or adjust docker config.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-21T12:47:15.12243-05:00","created_by":"Richard Abrich","updated_at":"2026-01-22T10:32:01.038825-05:00","labels":["testing","waa"],"comments":[{"id":3,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Session Recovery 2026-01-22 17:58: Previous agents killed during compaction. VM state: Docker/containerd unhealthy, disk /mnt only 32GB (need 47GB+ for vanilla WAA). Git-lfs failing. User feedback: 1) use beads, 2) larger disk, 3) clean up CLI, 4) vanilla WAA config.","created_at":"2026-01-22T18:05:45Z"},{"id":4,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Launched 3 parallel agents: ae159fc (VM disk upgrade), aabad47 (CLI cleanup), aee4e8a (fix containerd). Check /private/tmp/claude/-Users-abrichr-oa-src-openadapt-ml/tasks/*.output for results.","created_at":"2026-01-22T18:06:18Z"},{"id":5,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"WORKFLOW DOCUMENTED: VM config changes = delete VM -\u003e update code -\u003e relaunch. Added to CLAUDE.md. Default VM size now D8ds_v5 (300GB). Launching fresh VM now.","created_at":"2026-01-22T18:09:12Z"},{"id":6,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:20: VM resources cleaned up, launched agent a9be1f8 to add auto-cleanup to CLI, WAA setup retrying in background (b04fcbe). Workflow documented in CLAUDE.md and STATUS.md.","created_at":"2026-01-22T18:11:56Z"},{"id":7,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:30: VM created with D8s_v3 fallback (D8ds_v5 quota 0), IP 20.120.37.97. Restored waa_deploy symlink. Docker image building. W\u0026B integration agent a21c3ef running.","created_at":"2026-01-22T18:25:29Z"},{"id":8,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 19:05: WAA Docker image built successfully! Container running. Windows booting. VM: 20.120.37.97, VNC: http://20.120.37.97:8006","created_at":"2026-01-22T18:47:03Z"}]}
+{"id":"openadapt-evals-hvm","title":"VL model fix PR #18 ready to merge","notes":"openadapt-ml PR #18: VL model detection, exception handling, assistant_only_loss fix. All tests passing. Ready to merge.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-29T16:17:03.491938-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T16:17:03.491938-05:00"}
 {"id":"openadapt-evals-sz4","title":"RCA: Windows product key prompt recurring issue","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.266286-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.493102-05:00","closed_at":"2026-01-20T20:32:06.493102-05:00","close_reason":"RCA complete - root cause is VERSION mismatch (CLI=11, Dockerfile=11e). Fix documented in RECURRING_ISSUES.md and WINDOWS_PRODUCT_KEY_RCA.md"}
 {"id":"openadapt-evals-wis","title":"Add pre-flight check to detect Windows install issues","status":"closed","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.865052-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.757261-05:00","closed_at":"2026-01-20T20:32:06.757261-05:00","close_reason":"Duplicate of openadapt-evals-0dt"}
diff --git a/.github/workflows/release.yml b/.github/workflows/release.yml
@@ -0,0 +1,101 @@
+name: Auto Release
+
+on:
+  push:
+    branches:
+      - main
+    paths:
+      - '**.py'
+      - 'pyproject.toml'
+
+jobs:
+  release:
+    name: Bump version and release
+    runs-on: ubuntu-latest
+    # Only run on merged PRs (not direct pushes or version bump commits)
+    if: |
+      github.event.head_commit.message != '' &&
+      !startsWith(github.event.head_commit.message, 'chore: bump version')
+    permissions:
+      contents: write
+
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          fetch-depth: 0
+          token: ${{ secrets.GITHUB_TOKEN }}
+
+      - name: Set up Python
+        uses: actions/setup-python@v5
+        with:
+          python-version: "3.12"
+
+      - name: Install toml
+        run: pip install toml
+
+      - name: Determine version bump type
+        id: bump-type
+        run: |
+          COMMIT_MSG="${{ github.event.head_commit.message }}"
+          # Extract the type from conventional commit (feat, fix, etc.)
+          if [[ "$COMMIT_MSG" =~ ^feat ]]; then
+            echo "type=minor" >> $GITHUB_OUTPUT
+          elif [[ "$COMMIT_MSG" =~ ^(fix|perf) ]]; then
+            echo "type=patch" >> $GITHUB_OUTPUT
+          elif [[ "$COMMIT_MSG" =~ ^(docs|style|refactor|test|chore|ci|build) ]]; then
+            echo "type=patch" >> $GITHUB_OUTPUT
+          else
+            # Default to patch for non-conventional commits
+            echo "type=patch" >> $GITHUB_OUTPUT
+          fi
+
+      - name: Bump version
+        id: bump
+        run: |
+          python << 'EOF'
+          import toml
+          import os
+
+          # Read current version
+          with open('pyproject.toml', 'r') as f:
+              data = toml.load(f)
+
+          current = data['project']['version']
+          major, minor, patch = map(int, current.split('.'))
+
+          bump_type = os.environ.get('BUMP_TYPE', 'patch')
+
+          if bump_type == 'major':
+              major += 1
+              minor = 0
+              patch = 0
+          elif bump_type == 'minor':
+              minor += 1
+              patch = 0
+          else:  # patch
+              patch += 1
+
+          new_version = f"{major}.{minor}.{patch}"
+          data['project']['version'] = new_version
+
+          with open('pyproject.toml', 'w') as f:
+              toml.dump(data, f)
+
+          print(f"Bumped {current} -> {new_version}")
+
+          # Set output
+          with open(os.environ['GITHUB_OUTPUT'], 'a') as f:
+              f.write(f"version={new_version}\n")
+              f.write(f"tag=v{new_version}\n")
+          EOF
+        env:
+          BUMP_TYPE: ${{ steps.bump-type.outputs.type }}
+
+      - name: Commit and tag
+        run: |
+          git config user.name "github-actions[bot]"
+          git config user.email "github-actions[bot]@users.noreply.github.com"
+          git add pyproject.toml
+          git commit -m "chore: bump version to ${{ steps.bump.outputs.version }}"
+          git tag ${{ steps.bump.outputs.tag }}
+          git push origin main --tags
diff --git a/README.md b/README.md
@@ -5,41 +5,44 @@
 [![Downloads](https://img.shields.io/pypi/dm/openadapt-evals.svg)](https://pypi.org/project/openadapt-evals/)
 [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
 [![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)
-[![Azure Success Rate](https://img.shields.io/badge/Azure%20Success%20Rate-95%25%2B-success)](https://github.com/OpenAdaptAI/openadapt-evals)
-[![Cost Savings](https://img.shields.io/badge/Cost%20Savings-67%25-brightgreen)](https://github.com/OpenAdaptAI/openadapt-evals/blob/main/COST_OPTIMIZATION.md)
-
-Evaluation infrastructure for GUI agent benchmarks.
+Evaluation infrastructure for GUI agent benchmarks. **Simplified CLI toolkit for Windows Agent Arena.**
 
 ## Overview
 
 `openadapt-evals` provides a unified framework for evaluating GUI automation agents across standardized benchmarks like Windows Agent Arena (WAA), OSWorld, WebArena, and others.
 
-## Recent Improvements
+## Windows Agent Arena (WAA) - Headline Feature
+
+> **Status**: Actively running full 154-task evaluation. Results coming soon.
+
+A **simplified CLI toolkit** for the [Windows Agent Arena](https://github.com/microsoft/WindowsAgentArena) benchmark, providing:
+- Easy Azure VM setup and SSH tunnel management
+- Agent adapters for Claude, GPT-4o, and custom agents
+- Results viewer with per-domain breakdown
+- Parallelization support for faster evaluations
+
+See the [WAA Benchmark Results](#waa-benchmark-results) section below for current status.
 
-We've made significant improvements to reliability, cost-efficiency, and observability:
+## Roadmap (In Progress)
 
-### Azure Reliability (v0.2.0 - January 2026)
-- **95%+ Success Rate Target**: Fixed nested virtualization issues that caused 0% task completion
-- **VM Configuration**: Upgraded to `Standard_D4s_v5` with proper nested virtualization support
+The following features are under active development:
+
+### Azure Reliability (`[IN PROGRESS]`)
+- **Goal**: 95%+ task completion rate (vs. early issues with 0%)
+- **VM Configuration**: Using `Standard_D4s_v5` with nested virtualization (configurable)
 - **Health Monitoring**: Automatic detection and retry of stuck jobs
-- **Fast Failure Detection**: 10-minute timeout instead of 8+ hour hangs
-- See [PR #11](https://github.com/OpenAdaptAI/openadapt-evals/pull/11) for details
-
-### Cost Optimization (v0.2.0 - January 2026)
-- **67% Cost Reduction**: From $7.68 to $2.50 per full evaluation (154 tasks)
-- **Tiered VM Sizing**: Automatic VM size selection based on task complexity (37% savings)
-- **Spot Instance Support**: 70-80% discount on compute costs (64% savings with tiered VMs)
-- **Azure Container Registry**: 10x faster image pulls (1-2 min vs 8-12 min)
-- **Real-time Cost Tracking**: Monitor costs during evaluation
-- See [COST_OPTIMIZATION.md](./COST_OPTIMIZATION.md) and [PR #13](https://github.com/OpenAdaptAI/openadapt-evals/pull/13) for details
-
-### Screenshot Validation & Viewer (v0.2.0 - January 2026)
-- **Real Benchmark Screenshots**: Viewer now displays actual WAA evaluation screenshots
+
+### Cost Optimization (`[IN PROGRESS]`)
+- **Goal**: Reduce per-evaluation cost from ~$7.68 to ~$2.50 (154 tasks)
+- **Tiered VM Sizing**: Match VM size to task complexity
+- **Spot Instance Support**: Use preemptible VMs for 70-80% discount
+- See [COST_OPTIMIZATION.md](./COST_OPTIMIZATION.md) for design
+
+### Benchmark Viewer (Available)
+- **Real Benchmark Screenshots**: Viewer displays actual WAA evaluation screenshots
 - **Auto-Screenshot Tool**: Automated screenshot generation with Playwright
-- **Screenshot Validation**: Manifest-based validation ensuring correctness
 - **Execution Logs**: Step-by-step logs with search and filtering
-- **Live Monitoring**: Real-time Azure ML job monitoring with auto-refresh
-- See [PR #6](https://github.com/OpenAdaptAI/openadapt-evals/pull/6) for details
+- **Live Monitoring**: Real-time progress tracking
 
 ## Installation
 
@@ -79,7 +82,7 @@ adapter = WAALiveAdapter(config)
 agent = ApiAgent(provider="anthropic")  # or "openai" for GPT-5.1
 
 # Run evaluation
-results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])
+results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS"])
 
 # Compute metrics
 metrics = compute_metrics(results)
@@ -262,7 +265,7 @@ The package provides a CLI for running WAA evaluations:
 python -m openadapt_evals.benchmarks.cli probe --server http://vm-ip:5000
 
 # Run live evaluation against a WAA server
-python -m openadapt_evals.benchmarks.cli live --server http://vm-ip:5000 --task-ids notepad_1,notepad_2
+python -m openadapt_evals.benchmarks.cli live --server http://vm-ip:5000 --task-ids notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS,notepad_a7d4b6c5-569b-452e-9e1d-ffdb3d431d15-WOS
 
 # Generate HTML viewer for results
 python -m openadapt_evals.benchmarks.cli view --run-name my_eval_run
@@ -298,7 +301,7 @@ if not adapter.check_connection():
     print("WAA server not ready")
 
 # Run evaluation
-results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])
+results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS"])
 ```
 
 ### Local WAA Evaluation
@@ -318,6 +321,11 @@ results = evaluate_agent_on_benchmark(agent, adapter, task_ids=[t.task_id for t
 
 Run WAA at scale using Azure ML compute with optimized costs:
 
+> **⚠️ Quota Requirements**: Parallel evaluation requires sufficient Azure vCPU quota.
+> - Default VM: `Standard_D4s_v5` (4 vCPUs per worker)
+> - 10 workers = 40 vCPUs required
+> - Default quota is typically 10 vCPUs - [request an increase](https://learn.microsoft.com/en-us/azure/quotas/quickstart-increase-quota-portal) before running parallel evaluations
+
 ```bash
 # Install Azure dependencies
 pip install openadapt-evals[azure]
@@ -358,7 +366,7 @@ results = orchestrator.run_evaluation(
 )
 ```
 
-**Azure Reliability**: The orchestrator now uses `Standard_D4s_v5` VMs with proper nested virtualization support and automatic health monitoring, achieving 95%+ success rates.
+**Azure Reliability**: The orchestrator uses `Standard_D4s_v5` VMs with nested virtualization support and automatic health monitoring.
 
 ### Live Monitoring
 
@@ -371,7 +379,7 @@ pip install openadapt-evals[viewer]
 # Start an Azure evaluation (in terminal 1)
 python -m openadapt_evals.benchmarks.cli azure \
     --workers 1 \
-    --task-ids notepad_1,browser_1 \
+    --task-ids notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS,chrome_2ae9ba84-3a0d-4d4c-8338-3a1478dc5fe3-wos \
     --waa-path /path/to/WAA
 
 # Monitor job logs in real-time (in terminal 2)