Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .beads/issues.jsonl
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
{"id":"openadapt-evals-0an","title":"CLI: aws-costs and waa-image delete commands added","notes":"openadapt-evals PR #24: Added aws-costs command, waa-image delete action, changed default to Docker Hub","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-29T16:17:03.612486-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T16:17:03.612486-05:00"}
{"id":"openadapt-evals-0dt","title":"Add pre-flight check for Windows install issues","description":"Detect product key prompts or stuck installations BEFORE 10-minute timeout. Check container logs for specific error patterns.","status":"open","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:57:42.24338-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T18:57:42.24338-05:00"}
{"id":"openadapt-evals-0ms","title":"Run 20-50 task evaluation","description":"Run WAA benchmark on 20-50 tasks to measure baseline success rate. Target is \u003e80% success rate. This provides quantitative data on agent performance.","notes":"2026-01-29: Azure quota limits parallelization to 2 workers max (10 vCPUs / 4 vCPUs per worker). 10-worker test failed with ClusterCoreQuotaReached. User declined manual portal quota increase. Waiting for api-openai test results before full 154-task run.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T17:44:26.461765-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T00:28:02.609085-05:00","dependencies":[{"issue_id":"openadapt-evals-0ms","depends_on_id":"openadapt-evals-c3f","type":"blocks","created_at":"2026-01-20T17:44:26.462904-05:00","created_by":"Richard Abrich"}]}
{"id":"openadapt-evals-2ar","title":"Implement permanent fix for Windows unattended install","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.544113-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.634857-05:00","closed_at":"2026-01-20T20:32:06.634857-05:00","close_reason":"Duplicate of openadapt-evals-b3l"}
Expand All @@ -8,5 +9,6 @@
{"id":"openadapt-evals-czj","title":"Docker installation fails on Azure VM - pkgProblemResolver error","description":"vm setup-waa fails to install Docker. Error: pkgProblemResolver::Resolve generated breaks. Need to investigate root cause before attempting fix.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T22:48:59.527637-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T22:48:59.527637-05:00"}
{"id":"openadapt-evals-dke","title":"SYSTEM: Create knowledge persistence workflow using Beads","description":"Every fix/approach must be logged as a Beads issue with:\n1. Problem description\n2. Attempted solution\n3. Result (worked/failed/partial)\n4. Root cause if known\n5. Files changed\n\nBefore any fix attempt, agent MUST:\n1. Run 'bd list --labels=fix,approach' to see prior attempts\n2. Review what was tried before\n3. Document new attempt BEFORE implementing\n\nAfter context compaction, first action:\n1. Run 'bd ready' for current tasks\n2. Run 'bd list --labels=recurring' for known recurring issues\n3. Check docs/RECURRING_ISSUES.md for patterns","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T19:00:18.155796-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T19:00:18.155796-05:00"}
{"id":"openadapt-evals-gna","title":"Test simplified Dockerfile (Azure mode)","description":"Testing Dockerfile.simplified which uses vanilla WAA Azure mode: native OEM mechanism (C:\\oem), InstallFrom element for unattended install, VERSION=11e for no product key. Steps: 1) Delete current VM 2) Create fresh VM 3) Build simplified image 4) Test Windows installation via QEMU screenshots","notes":"2026-01-22: Confirmed the blocker is not just docker pull; even starting the existing 'winarena' container via az vm run-command timed out.\n\n- smoke-live tried to run docker start winarena via run-command and timed out (900s)\n- WAA server remained unreachable at http://172.171.112.41:5000\n- VM was deallocated after the attempt\n\nImplication: VM/docker state is unhealthy or container start is hanging (possibly due to incomplete image extraction / stuck daemon / disk pressure).\nNext: add/run a vm-debug command to capture docker/system logs and determine whether to rebuild VM/image, pin/mirror image (ACR), or adjust docker config.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-21T12:47:15.12243-05:00","created_by":"Richard Abrich","updated_at":"2026-01-22T10:32:01.038825-05:00","labels":["testing","waa"],"comments":[{"id":3,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Session Recovery 2026-01-22 17:58: Previous agents killed during compaction. VM state: Docker/containerd unhealthy, disk /mnt only 32GB (need 47GB+ for vanilla WAA). Git-lfs failing. User feedback: 1) use beads, 2) larger disk, 3) clean up CLI, 4) vanilla WAA config.","created_at":"2026-01-22T18:05:45Z"},{"id":4,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"Launched 3 parallel agents: ae159fc (VM disk upgrade), aabad47 (CLI cleanup), aee4e8a (fix containerd). Check /private/tmp/claude/-Users-abrichr-oa-src-openadapt-ml/tasks/*.output for results.","created_at":"2026-01-22T18:06:18Z"},{"id":5,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"WORKFLOW DOCUMENTED: VM config changes = delete VM -\u003e update code -\u003e relaunch. Added to CLAUDE.md. Default VM size now D8ds_v5 (300GB). Launching fresh VM now.","created_at":"2026-01-22T18:09:12Z"},{"id":6,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:20: VM resources cleaned up, launched agent a9be1f8 to add auto-cleanup to CLI, WAA setup retrying in background (b04fcbe). Workflow documented in CLAUDE.md and STATUS.md.","created_at":"2026-01-22T18:11:56Z"},{"id":7,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 18:30: VM created with D8s_v3 fallback (D8ds_v5 quota 0), IP 20.120.37.97. Restored waa_deploy symlink. Docker image building. W\u0026B integration agent a21c3ef running.","created_at":"2026-01-22T18:25:29Z"},{"id":8,"issue_id":"openadapt-evals-gna","author":"Richard Abrich","text":"2026-01-22 19:05: WAA Docker image built successfully! Container running. Windows booting. VM: 20.120.37.97, VNC: http://20.120.37.97:8006","created_at":"2026-01-22T18:47:03Z"}]}
{"id":"openadapt-evals-hvm","title":"VL model fix PR #18 ready to merge","notes":"openadapt-ml PR #18: VL model detection, exception handling, assistant_only_loss fix. All tests passing. Ready to merge.","status":"open","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-29T16:17:03.491938-05:00","created_by":"Richard Abrich","updated_at":"2026-01-29T16:17:03.491938-05:00"}
{"id":"openadapt-evals-sz4","title":"RCA: Windows product key prompt recurring issue","status":"closed","priority":0,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.266286-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.493102-05:00","closed_at":"2026-01-20T20:32:06.493102-05:00","close_reason":"RCA complete - root cause is VERSION mismatch (CLI=11, Dockerfile=11e). Fix documented in RECURRING_ISSUES.md and WINDOWS_PRODUCT_KEY_RCA.md"}
{"id":"openadapt-evals-wis","title":"Add pre-flight check to detect Windows install issues","status":"closed","priority":1,"issue_type":"task","owner":"richard.abrich@gmail.com","created_at":"2026-01-20T18:59:36.865052-05:00","created_by":"Richard Abrich","updated_at":"2026-01-20T20:32:06.757261-05:00","closed_at":"2026-01-20T20:32:06.757261-05:00","close_reason":"Duplicate of openadapt-evals-0dt"}
101 changes: 101 additions & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
name: Auto Release

on:
push:
branches:
- main
paths:
- '**.py'
- 'pyproject.toml'

jobs:
release:
name: Bump version and release
runs-on: ubuntu-latest
# Only run on merged PRs (not direct pushes or version bump commits)
if: |
github.event.head_commit.message != '' &&
!startsWith(github.event.head_commit.message, 'chore: bump version')
permissions:
contents: write

steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
token: ${{ secrets.GITHUB_TOKEN }}

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: "3.12"

- name: Install toml
run: pip install toml

- name: Determine version bump type
id: bump-type
run: |
COMMIT_MSG="${{ github.event.head_commit.message }}"
# Extract the type from conventional commit (feat, fix, etc.)
if [[ "$COMMIT_MSG" =~ ^feat ]]; then
echo "type=minor" >> $GITHUB_OUTPUT
elif [[ "$COMMIT_MSG" =~ ^(fix|perf) ]]; then
echo "type=patch" >> $GITHUB_OUTPUT
elif [[ "$COMMIT_MSG" =~ ^(docs|style|refactor|test|chore|ci|build) ]]; then
echo "type=patch" >> $GITHUB_OUTPUT
else
# Default to patch for non-conventional commits
echo "type=patch" >> $GITHUB_OUTPUT
fi

- name: Bump version
id: bump
run: |
python << 'EOF'
import toml
import os

# Read current version
with open('pyproject.toml', 'r') as f:
data = toml.load(f)

current = data['project']['version']
major, minor, patch = map(int, current.split('.'))

bump_type = os.environ.get('BUMP_TYPE', 'patch')

if bump_type == 'major':
major += 1
minor = 0
patch = 0
elif bump_type == 'minor':
minor += 1
patch = 0
else: # patch
patch += 1

new_version = f"{major}.{minor}.{patch}"
data['project']['version'] = new_version

with open('pyproject.toml', 'w') as f:
toml.dump(data, f)

print(f"Bumped {current} -> {new_version}")

# Set output
with open(os.environ['GITHUB_OUTPUT'], 'a') as f:
f.write(f"version={new_version}\n")
f.write(f"tag=v{new_version}\n")
EOF
env:
BUMP_TYPE: ${{ steps.bump-type.outputs.type }}

- name: Commit and tag
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"
git add pyproject.toml
git commit -m "chore: bump version to ${{ steps.bump.outputs.version }}"
git tag ${{ steps.bump.outputs.tag }}
git push origin main --tags
68 changes: 38 additions & 30 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,41 +5,44 @@
[![Downloads](https://img.shields.io/pypi/dm/openadapt-evals.svg)](https://pypi.org/project/openadapt-evals/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)
[![Azure Success Rate](https://img.shields.io/badge/Azure%20Success%20Rate-95%25%2B-success)](https://github.com/OpenAdaptAI/openadapt-evals)
[![Cost Savings](https://img.shields.io/badge/Cost%20Savings-67%25-brightgreen)](https://github.com/OpenAdaptAI/openadapt-evals/blob/main/COST_OPTIMIZATION.md)

Evaluation infrastructure for GUI agent benchmarks.
Evaluation infrastructure for GUI agent benchmarks. **Simplified CLI toolkit for Windows Agent Arena.**

## Overview

`openadapt-evals` provides a unified framework for evaluating GUI automation agents across standardized benchmarks like Windows Agent Arena (WAA), OSWorld, WebArena, and others.

## Recent Improvements
## Windows Agent Arena (WAA) - Headline Feature

> **Status**: Actively running full 154-task evaluation. Results coming soon.

A **simplified CLI toolkit** for the [Windows Agent Arena](https://github.com/microsoft/WindowsAgentArena) benchmark, providing:
- Easy Azure VM setup and SSH tunnel management
- Agent adapters for Claude, GPT-4o, and custom agents
- Results viewer with per-domain breakdown
- Parallelization support for faster evaluations

See the [WAA Benchmark Results](#waa-benchmark-results) section below for current status.

We've made significant improvements to reliability, cost-efficiency, and observability:
## Roadmap (In Progress)

### Azure Reliability (v0.2.0 - January 2026)
- **95%+ Success Rate Target**: Fixed nested virtualization issues that caused 0% task completion
- **VM Configuration**: Upgraded to `Standard_D4s_v5` with proper nested virtualization support
The following features are under active development:

### Azure Reliability (`[IN PROGRESS]`)
- **Goal**: 95%+ task completion rate (vs. early issues with 0%)
- **VM Configuration**: Using `Standard_D4s_v5` with nested virtualization (configurable)
- **Health Monitoring**: Automatic detection and retry of stuck jobs
- **Fast Failure Detection**: 10-minute timeout instead of 8+ hour hangs
- See [PR #11](https://github.com/OpenAdaptAI/openadapt-evals/pull/11) for details

### Cost Optimization (v0.2.0 - January 2026)
- **67% Cost Reduction**: From $7.68 to $2.50 per full evaluation (154 tasks)
- **Tiered VM Sizing**: Automatic VM size selection based on task complexity (37% savings)
- **Spot Instance Support**: 70-80% discount on compute costs (64% savings with tiered VMs)
- **Azure Container Registry**: 10x faster image pulls (1-2 min vs 8-12 min)
- **Real-time Cost Tracking**: Monitor costs during evaluation
- See [COST_OPTIMIZATION.md](./COST_OPTIMIZATION.md) and [PR #13](https://github.com/OpenAdaptAI/openadapt-evals/pull/13) for details

### Screenshot Validation & Viewer (v0.2.0 - January 2026)
- **Real Benchmark Screenshots**: Viewer now displays actual WAA evaluation screenshots

### Cost Optimization (`[IN PROGRESS]`)
- **Goal**: Reduce per-evaluation cost from ~$7.68 to ~$2.50 (154 tasks)
- **Tiered VM Sizing**: Match VM size to task complexity
- **Spot Instance Support**: Use preemptible VMs for 70-80% discount
- See [COST_OPTIMIZATION.md](./COST_OPTIMIZATION.md) for design

### Benchmark Viewer (Available)
- **Real Benchmark Screenshots**: Viewer displays actual WAA evaluation screenshots
- **Auto-Screenshot Tool**: Automated screenshot generation with Playwright
- **Screenshot Validation**: Manifest-based validation ensuring correctness
- **Execution Logs**: Step-by-step logs with search and filtering
- **Live Monitoring**: Real-time Azure ML job monitoring with auto-refresh
- See [PR #6](https://github.com/OpenAdaptAI/openadapt-evals/pull/6) for details
- **Live Monitoring**: Real-time progress tracking

## Installation

Expand Down Expand Up @@ -79,7 +82,7 @@ adapter = WAALiveAdapter(config)
agent = ApiAgent(provider="anthropic") # or "openai" for GPT-5.1

# Run evaluation
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS"])

# Compute metrics
metrics = compute_metrics(results)
Expand Down Expand Up @@ -262,7 +265,7 @@ The package provides a CLI for running WAA evaluations:
python -m openadapt_evals.benchmarks.cli probe --server http://vm-ip:5000

# Run live evaluation against a WAA server
python -m openadapt_evals.benchmarks.cli live --server http://vm-ip:5000 --task-ids notepad_1,notepad_2
python -m openadapt_evals.benchmarks.cli live --server http://vm-ip:5000 --task-ids notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS,notepad_a7d4b6c5-569b-452e-9e1d-ffdb3d431d15-WOS

# Generate HTML viewer for results
python -m openadapt_evals.benchmarks.cli view --run-name my_eval_run
Expand Down Expand Up @@ -298,7 +301,7 @@ if not adapter.check_connection():
print("WAA server not ready")

# Run evaluation
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_1"])
results = evaluate_agent_on_benchmark(agent, adapter, task_ids=["notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS"])
```

### Local WAA Evaluation
Expand All @@ -318,6 +321,11 @@ results = evaluate_agent_on_benchmark(agent, adapter, task_ids=[t.task_id for t

Run WAA at scale using Azure ML compute with optimized costs:

> **⚠️ Quota Requirements**: Parallel evaluation requires sufficient Azure vCPU quota.
> - Default VM: `Standard_D4s_v5` (4 vCPUs per worker)
> - 10 workers = 40 vCPUs required
> - Default quota is typically 10 vCPUs - [request an increase](https://learn.microsoft.com/en-us/azure/quotas/quickstart-increase-quota-portal) before running parallel evaluations

```bash
# Install Azure dependencies
pip install openadapt-evals[azure]
Expand Down Expand Up @@ -358,7 +366,7 @@ results = orchestrator.run_evaluation(
)
```

**Azure Reliability**: The orchestrator now uses `Standard_D4s_v5` VMs with proper nested virtualization support and automatic health monitoring, achieving 95%+ success rates.
**Azure Reliability**: The orchestrator uses `Standard_D4s_v5` VMs with nested virtualization support and automatic health monitoring.

### Live Monitoring

Expand All @@ -371,7 +379,7 @@ pip install openadapt-evals[viewer]
# Start an Azure evaluation (in terminal 1)
python -m openadapt_evals.benchmarks.cli azure \
--workers 1 \
--task-ids notepad_1,browser_1 \
--task-ids notepad_366de66e-cbae-4d72-b042-26390db2b145-WOS,chrome_2ae9ba84-3a0d-4d4c-8338-3a1478dc5fe3-wos \
--waa-path /path/to/WAA

# Monitor job logs in real-time (in terminal 2)
Expand Down
Loading