Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 34 additions & 26 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,41 +5,44 @@
[![Downloads](https://img.shields.io/pypi/dm/openadapt-evals.svg)](https://pypi.org/project/openadapt-evals/)
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10%2B-blue)](https://www.python.org/downloads/)
[![Azure Success Rate](https://img.shields.io/badge/Azure%20Success%20Rate-95%25%2B-success)](https://github.com/OpenAdaptAI/openadapt-evals)
[![Cost Savings](https://img.shields.io/badge/Cost%20Savings-67%25-brightgreen)](https://github.com/OpenAdaptAI/openadapt-evals/blob/main/COST_OPTIMIZATION.md)

Evaluation infrastructure for GUI agent benchmarks.
Evaluation infrastructure for GUI agent benchmarks. **Simplified CLI toolkit for Windows Agent Arena.**

## Overview

`openadapt-evals` provides a unified framework for evaluating GUI automation agents across standardized benchmarks like Windows Agent Arena (WAA), OSWorld, WebArena, and others.

## Recent Improvements
## Windows Agent Arena (WAA) - Headline Feature

> **Status**: Actively running full 154-task evaluation. Results coming soon.

A **simplified CLI toolkit** for the [Windows Agent Arena](https://github.com/microsoft/WindowsAgentArena) benchmark, providing:
- Easy Azure VM setup and SSH tunnel management
- Agent adapters for Claude, GPT-4o, and custom agents
- Results viewer with per-domain breakdown
- Parallelization support for faster evaluations

See the [WAA Benchmark Results](#waa-benchmark-results) section below for current status.

We've made significant improvements to reliability, cost-efficiency, and observability:
## Roadmap (In Progress)

### Azure Reliability (v0.2.0 - January 2026)
- **95%+ Success Rate Target**: Fixed nested virtualization issues that caused 0% task completion
- **VM Configuration**: Upgraded to `Standard_D4s_v5` with proper nested virtualization support
The following features are under active development:

### Azure Reliability (`[IN PROGRESS]`)
- **Goal**: 95%+ task completion rate (vs. early issues with 0%)
- **VM Configuration**: Using `Standard_D4s_v5` with nested virtualization (configurable)
- **Health Monitoring**: Automatic detection and retry of stuck jobs
- **Fast Failure Detection**: 10-minute timeout instead of 8+ hour hangs
- See [PR #11](https://github.com/OpenAdaptAI/openadapt-evals/pull/11) for details

### Cost Optimization (v0.2.0 - January 2026)
- **67% Cost Reduction**: From $7.68 to $2.50 per full evaluation (154 tasks)
- **Tiered VM Sizing**: Automatic VM size selection based on task complexity (37% savings)
- **Spot Instance Support**: 70-80% discount on compute costs (64% savings with tiered VMs)
- **Azure Container Registry**: 10x faster image pulls (1-2 min vs 8-12 min)
- **Real-time Cost Tracking**: Monitor costs during evaluation
- See [COST_OPTIMIZATION.md](./COST_OPTIMIZATION.md) and [PR #13](https://github.com/OpenAdaptAI/openadapt-evals/pull/13) for details

### Screenshot Validation & Viewer (v0.2.0 - January 2026)
- **Real Benchmark Screenshots**: Viewer now displays actual WAA evaluation screenshots

### Cost Optimization (`[IN PROGRESS]`)
- **Goal**: Reduce per-evaluation cost from ~$7.68 to ~$2.50 (154 tasks)
- **Tiered VM Sizing**: Match VM size to task complexity
- **Spot Instance Support**: Use preemptible VMs for 70-80% discount
- See [COST_OPTIMIZATION.md](./COST_OPTIMIZATION.md) for design

### Benchmark Viewer (Available)
- **Real Benchmark Screenshots**: Viewer displays actual WAA evaluation screenshots
- **Auto-Screenshot Tool**: Automated screenshot generation with Playwright
- **Screenshot Validation**: Manifest-based validation ensuring correctness
- **Execution Logs**: Step-by-step logs with search and filtering
- **Live Monitoring**: Real-time Azure ML job monitoring with auto-refresh
- See [PR #6](https://github.com/OpenAdaptAI/openadapt-evals/pull/6) for details
- **Live Monitoring**: Real-time progress tracking

## Installation

Expand Down Expand Up @@ -318,6 +321,11 @@ results = evaluate_agent_on_benchmark(agent, adapter, task_ids=[t.task_id for t

Run WAA at scale using Azure ML compute with optimized costs:

> **⚠️ Quota Requirements**: Parallel evaluation requires sufficient Azure vCPU quota.
> - Default VM: `Standard_D4s_v5` (4 vCPUs per worker)
> - 10 workers = 40 vCPUs required
> - Default quota is typically 10 vCPUs - [request an increase](https://learn.microsoft.com/en-us/azure/quotas/quickstart-increase-quota-portal) before running parallel evaluations

```bash
# Install Azure dependencies
pip install openadapt-evals[azure]
Expand Down Expand Up @@ -358,7 +366,7 @@ results = orchestrator.run_evaluation(
)
```

**Azure Reliability**: The orchestrator now uses `Standard_D4s_v5` VMs with proper nested virtualization support and automatic health monitoring, achieving 95%+ success rates.
**Azure Reliability**: The orchestrator uses `Standard_D4s_v5` VMs with nested virtualization support and automatic health monitoring.

### Live Monitoring

Expand Down