A comprehensive toolkit for creating, validating, and managing tasks, tools, and SOPs for Tau2-Bench domains.
When you have the perfect combination of:
- β A complete, well-designed tool set
- β Clear, comprehensive domain rules
- β High-quality, unambiguous instructions
...there should be only ONE path to achieve the ground truth.
Tau-Helper accelerates domain development by helping you:
- Validate instructions for quality and clarity
- Map instructions to SOPs and detect ambiguity
- Generate ground truth tasks automatically through iterative execution
- Debug and verify task execution
This ensures your domains provide deterministic, high-quality benchmarks for agent evaluation.
π Cursor IDE Users: This directory includes
.cursorrulesfor AI assistant integration. The file is automatically loaded by Cursor to help you work with this library.
Setup: Place tau_helper as a sibling directory to warrior-tau-bench:
your-workspace/
βββ warrior-tau-bench/ β Main project
βββ tau_helper/ β Helper tools
Install:
# Install dependencies
pip install -r tau_helper/requirements.txtImportant: Always run tau_helper commands from the warrior-tau-bench/ root directory. The tool uses the current working directory to find domains.
Create .env in tau_helper/ directory:
# Default model (used for instruction evaluation)
DEFAULT_MODEL=gpt-4o-mini
DEFAULT_API_KEY=your-openai-api-key
DEFAULT_BASE_URL=https://api.openai.com/v1
# Reasoning model (used for SOP mapping, scaffolding)
DEFAULT_MODEL_R=gpt-4o
DEFAULT_API_KEY_R=your-openai-api-key
DEFAULT_BASE_URL_R=https://api.openai.com/v1
# Multi-Agent Architecture (OPTIONAL)
# If both R2 and JUDGE are configured, scaffolding uses multi-agent mode:
# - R and R2 generate scaffolds independently
# - If they differ, JUDGE picks the best one
# - Improves quality through model consensus
# R2 Model: Can use stronger OpenAI models like gpt-5-mini, gpt-4o, etc.
DEFAULT_MODEL_R2=gpt-5-mini
DEFAULT_API_KEY_R2=your-openai-api-key
DEFAULT_BASE_URL_R2=https://api.openai.com/v1
# Judge Model: Reasoning models work well (DeepSeek R1, etc.)
DEFAULT_MODEL_R_JUDGE=deepseek-ai/DeepSeek-R1-0528
DEFAULT_API_KEY_R_JUDGE=your-friendli-api-key
DEFAULT_BASE_URL_R_JUDGE=https://api.friendli.ai/serverless/v1- Instruction Evaluation - Score instructions 0-100 for user-facing, non-procedural quality
- SOP Chain Mapper - Map instructions to SOPs, detect ambiguity, suggest fixes
- Task Scaffolder β NEW - Generate complete, executable tasks from instructions
- Action Executor - Execute and debug domain task actions
- Agent Log Reader - Analyze agent evaluation results
Always run from warrior-tau-bench/ root:
cd /path/to/warrior-tau-bench
python ../tau_helper/run.py <command># Show info and list domains
python ../tau_helper/run.py info
python ../tau_helper/run.py list-domains
python ../tau_helper/run.py list-domains --domain sec# Evaluate single instruction
python ../tau_helper/run.py evaluate "You are analyzing financial data"
# Batch evaluation from file
python ../tau_helper/run.py evaluate -f instructions.txt --batch# Map instruction from task
python ../tau_helper/run.py map-sop <domain> --variation <variation> --task <task_id>
python ../tau_helper/run.py map-sop sec --variation variation_2 --task task_001
# Compare predicted vs actual actions
python ../tau_helper/run.py map-sop sec --variation variation_2 --task task_001 --compare
# Map custom instruction
python ../tau_helper/run.py map-sop sec --variation variation_2 --instruction "Calculate WACC for Tesla"
# Use specific model (override DEFAULT_MODEL_R)
python ../tau_helper/run.py map-sop sec --variation variation_2 --task task_001 --model gpt-4oSOP Mapper Output:
- Primary SOP chain with confidence score
- Alternative interpretations (if ambiguous)
- Ambiguity explanation
- Missing information needed to disambiguate
- Suggested instruction fix (non-procedural, minimal changes)
# Generate task from custom instruction
python ../tau_helper/run.py scaffold <domain> --variation <variation> --instruction "Your instruction"
# Generate task from existing task instruction
python ../tau_helper/run.py scaffold <domain> --variation <variation> --task <task_id>
# Specify custom task ID
python ../tau_helper/run.py scaffold <domain> --variation <variation> --instruction "..." --task-id task_new_001
# Show detailed progress including execution results
python ../tau_helper/run.py scaffold <domain> --variation <variation> --instruction "..." --verbose
# Examples
python ../tau_helper/run.py scaffold salesforce_management --variation variation_2 --instruction "Transfer accounts to Chris Sullivan"
python ../tau_helper/run.py scaffold salesforce_management --variation variation_2 --task task_001 --verboseTask Scaffolder Features:
- Code-based generation: Agents write Python code that calls tools
- R/R2 Roundtable: R generates code, R2 reviews for correctness, up to 5 refinement rounds
- Live editing: Execution failures trigger automatic diagnosis (R2) and fixes (R)
- Real values only: No placeholders! Uses actual values from execution results
- Domain-agnostic: Works with any domain automatically
- Outputs complete task in
tasks.pyformat with real, executable values
π€ Multi-Agent Architecture
Configure DEFAULT_MODEL_R2 and optionally DEFAULT_MODEL_R_JUDGE in .env:
How it works:
- Model R generates Python code for the task
- Model R2 reviews for correctness (tool calls, data flow, SOP compliance)
- Up to 5 rounds of refinement until R2 approves
- Judge (optional) mediates if R2 has critical concerns after all rounds
- Live editing on execution: failures trigger R2 diagnosis β R fix β re-execute
Recommended Models:
- R/R2: OpenAI models like
gpt-4o,gpt-5-mini - Judge: Reasoning models like
deepseek-ai/DeepSeek-R1-0528
# List tasks
python ../tau_helper/run.py execute <domain> --variation <variation> --list-tasks
# Show task details
python ../tau_helper/run.py execute <domain> --variation <variation> --task <task_id> --show
# Execute specific action
python ../tau_helper/run.py execute <domain> --variation <variation> --task <task_id> --action <action_index>
# Execute all actions in sequence
python ../tau_helper/run.py execute <domain> --variation <variation> --task <task_id>
# Show database state
python ../tau_helper/run.py execute <domain> --variation <variation> --task <task_id> --db-state
# Reset database
python ../tau_helper/run.py execute <domain> --variation <variation> --task <task_id> --reset-db
# Examples
python ../tau_helper/run.py execute sec --variation variation_2 --list-tasks
python ../tau_helper/run.py execute sec --variation variation_2 --task task_001 --action 0
python ../tau_helper/run.py execute airline --variation variation_1 --task task_003# Show statistics
python ../tau_helper/run.py agent-logs <domain> --variation <variation> --stats
# List all tasks with pass rates
python ../tau_helper/run.py agent-logs <domain> --variation <variation> --list-tasks
# Analyze specific task
python ../tau_helper/run.py agent-logs <domain> --variation <variation> --task <task_id>
# Show user simulation messages (initial turn) - 8 per task
python ../tau_helper/run.py agent-logs <domain> --variation <variation> --user-sim
# Show user sim for specific task (all 8 agent attempts)
python ../tau_helper/run.py agent-logs <domain> --variation <variation> --task <task_id> --user-sim
# Limit results
python ../tau_helper/run.py agent-logs <domain> --variation <variation> --user-sim --limit 5
# Use custom agent.json file
python ../tau_helper/run.py agent-logs <domain> --variation <variation> --agent-json path/to/agent.json --user-sim
# Find problematic tasks
python ../tau_helper/run.py agent-logs <domain> --variation <variation> --task <task_id> --errors-only
# Custom agent.json location
python ../tau_helper/run.py agent-logs <domain> --variation <variation> --agent-json path/to/agent.json --stats
# Examples
python ../tau_helper/run.py agent-logs sec --variation variation_2 --stats
python ../tau_helper/run.py agent-logs sec --variation variation_2 --task task_072 --compareUser-facing, non-procedural:
"You are Sarah Chen, a financial analyst. You need to analyze
Tesla's Q4 2023 financial performance for your investment report."
Characteristics:
- Describes WHAT the user wants, not HOW to do it
- Natural, conversational language
- Role and context provided
- No function names or technical details
Procedural:
"First, call get_balance_sheet() for Tesla, then execute
calculate_wacc() with the parameters, and save the results."
Problems:
- Step-by-step instructions
- Function names and API calls
- Implementation details
from tau_helper.llm import get_llm_client, get_reasoning_llm
from tau_helper.evaluator import InstructionEvaluator
# Instruction evaluation
llm = get_llm_client()
evaluator = InstructionEvaluator(llm)
result = evaluator.evaluate("You are analyzing financial data")
print(f"Score: {result.score}/100")
# SOP mapping
from tau_helper.sop_mapper import SOPMapper
llm_r = get_reasoning_llm()
mapper = SOPMapper(llm_r, "sec", "variation_2")
mapping = mapper.map_instruction("Extract AAPL financials")
# Action execution
from tau_helper.action_executor import ActionExecutor
executor = ActionExecutor("sec", "variation_2")
result = executor.execute_action("task_001", 0)
# Agent log reading
from tau_helper.agent_log_reader import AgentLogReader
reader = AgentLogReader("sec", "variation_2")
stats = reader.analyze_stats()The correct workflow for generating tasks:
# Step 1: Evaluate instruction quality
python ../tau_helper/run.py evaluate "You are Pat Manager. You want Chris Sullivan to succeed in Q1 2026..."
# Step 2: Map instruction to SOPs (check for ambiguity)
python ../tau_helper/run.py map-sop salesforce_management --variation variation_2 --instruction "You are Pat Manager..."
# If ambiguous, use the suggested fix from Step 2, then:
# Step 3: Evaluate the fixed instruction again
python ../tau_helper/run.py evaluate "FIXED_INSTRUCTION_HERE"
# Step 4: Scaffold with iterative execution (generates task with REAL values)
python ../tau_helper/run.py scaffold salesforce_management --variation variation_2 --instruction "..." --task-id task_006 --verbose
# Step 5: Copy the scaffolded code to tasks.py
# The code already contains real values from execution - no manual filling needed!
# Step 6: Validate the completed task
uv run alignerr validate --domain salesforce_management --variation variation_2 --task-id task_006Why this workflow?
- Step 1-3: Ensures instruction is high quality, user-facing, and non-procedural
- Step 2: Detects missing information before scaffolding
- Step 4: Generates actions iteratively with real execution, producing ready-to-use code
- Step 5: Copy-paste the output - it's already complete with real values!
- Step 6: Validates the final task works correctly
# Check if task_001 instruction is clear
python ../tau_helper/run.py map-sop sec --variation variation_2 --task task_001
# If ambiguous, you'll see:
# - Ambiguity explanation
# - Missing information
# - Suggested instruction fix (copy-paste ready)# See what agent did wrong
python ../tau_helper/run.py agent-logs sec --variation variation_2 --task task_072 --compare
# Execute actions manually to reproduce
python ../tau_helper/run.py execute sec --variation variation_2 --task task_072 --show
python ../tau_helper/run.py execute sec --variation variation_2 --task task_072 --action 0
python ../tau_helper/run.py execute sec --variation variation_2 --task task_072 --db-state# Execute all actions
python ../tau_helper/run.py execute sec --variation variation_2 --task task_001
# Check final database state
python ../tau_helper/run.py execute sec --variation variation_2 --task task_001 --db-state
# Reset and retry
python ../tau_helper/run.py execute sec --variation variation_2 --task task_001 --reset-dbtau_helper/
βββ cli.py # Main CLI entry point
βββ llm.py # LLM client (DEFAULT_MODEL, DEFAULT_MODEL_R)
βββ evaluator.py # Instruction evaluation
βββ sop_mapper.py # SOP chain mapping + ambiguity detection
βββ action_executor.py # Domain-agnostic action executor
βββ agent_log_reader.py # Agent log analysis
βββ run.py # Wrapper for easy execution
βββ .env # Configuration
βββ requirements.txt # Dependencies
- README.md - This file (command reference)
- QUICKSTART.md - 5-minute quick start guide
- .cursorrules - Cursor IDE integration (auto-loaded for AI assistance)
# Get help
python ../tau_helper/run.py --help
python ../tau_helper/run.py <command> --helpCurrent version: 0.2.0