Manager Agent Gym - Comprehensive Library Documentation

Version 0.1.0

🎯 Executive Summary

Manager Agent Gym is a research platform for developing and evaluating autonomous agents that orchestrate complex workflows involving both human and AI collaborators. The library implements the Autonomous Manager Agent research challenge as described in the accompanying research paper, providing a complete POSG (Partially Observable Stochastic Game) framework for building and evaluating autonomous workflow management systems.

Key Capabilities

🧩 Hierarchical Task Decomposition: AI managers break down complex goals into executable task graphs using structured reasoning
⚖️ Multi-Objective Optimization: Balance competing objectives (cost, quality, time, oversight) under dynamic preferences
🤝 Ad Hoc Team Coordination: Orchestrate mixed human and AI teams without prior joint training
📋 Governance Compliance: Maintain regulatory compliance while adapting to evolving constraints
🔬 Research Evaluation: Comprehensive evaluation framework with multi-objective regret analysis

🏗️ Core Architecture

POSG Implementation

The system implements the formal framework ⟨I, S, b⁰, {Aᵢ}, {Oᵢ}, P, {Rᵢ}⟩:

I (Agents): Manager + Worker agent implementations
S (State): Workflow containing tasks, resources, agents, messages
Aᵢ (Actions): BaseManagerAction hierarchy for manager decisions
Oᵢ (Observations): ManagerObservation for partial state visibility
P (Transitions): WorkflowExecutionEngine manages state evolution
Rᵢ (Rewards): Multi-objective evaluation via ValidationEngine and regret analysis

Module Organization

manager_agent_gym/
├── core/                   # Core implementations
│   ├── manager_agent/         # Manager agent implementations
│   ├── workflow_agents/       # Worker agent implementations  
│   ├── execution/            # Workflow execution engine
│   ├── evaluation/           # Validation and regret calculation
│   ├── communication/        # Agent communication system
│   ├── decomposition/        # Task decomposition services
│   └── common/              # Shared utilities and LLM interface
├── schemas/                # Data models and type definitions
│   ├── core/                 # POSG state components
│   ├── execution/            # Runtime state and actions
│   ├── evaluation/           # Success criteria and metrics
│   ├── preferences/          # Preferences and evaluators
│   └── workflow_agents/      # Agent configurations and outputs
└── examples/               # Progressive tutorials and demos

💡 Key Concepts

Manager Agents

AI agents that observe workflow state and make strategic decisions:

Assign tasks to specialized agents
Create new tasks when needed through decomposition
Monitor progress and adapt to changes
Balance multiple preferences (quality, time, cost, oversight)
Communicate with stakeholders to clarify requirements

Available Implementations:

ChainOfThoughtManagerAgent: LLM-based structured decision making with constrained action generation
RandomManagerAgentV2: Baseline random action selection for comparison
OneShotDelegateManagerAgent: Simple assignment-only manager

Workflow Agents (Workers)

Specialized agents that execute tasks:

AIAgent: LLM-based task execution with structured tools and OpenAI Agents SDK
MockHumanAgent: Realistic human simulation with noise modeling and capacity constraints
StakeholderAgent: Represents stakeholders who provide requirements and feedback

Workflows

Collections of interconnected tasks with:

Task dependencies and hierarchical subtask structures
Resource requirements and outputs
Regulatory constraints and governance rules
Mixed human and AI agent teams
Communication history and coordination state

Execution Engine

The simulation environment that:

Runs discrete timesteps with manager observation and action phases
Manages task execution asynchronously across multiple agents
Tracks comprehensive metrics for evaluation
Handles preference dynamics and stakeholder updates
Provides callbacks for custom monitoring and analysis

Evaluation Framework

Comprehensive multi-objective evaluation including:

Preference adherence via rubric-based scoring
Constraint compliance validation
Workflow quality metrics (completion rate, coordination efficiency)
Human-centric metrics (oversight burden, transparency)
Regret analysis for multi-objective optimization

🚀 Installation & Setup

Requirements

Python 3.12+
uv package manager (recommended)
OpenAI API key (for LLM-based agents)
Optional: Anthropic API key for Claude models

Installation

# Clone the repository
git clone https://github.com/your-org/manager-agent-gym
cd manager-agent-gym

# Install with uv (recommended)
uv pip install -e .

# Alternative: Install with pip
pip install -e .

# Configure API keys
cp .env.example .env
# Edit .env file with your API keys:
# OPENAI_API_KEY=sk-your-key-here
# ANTHROPIC_API_KEY=sk-ant-your-key-here  # Optional

Note: The library uses pydantic-settings which automatically loads configuration from the .env file.

Dependencies

Key dependencies include:

pydantic (2.10.5+): Type validation and data models
openai (1.58.0+): LLM inference
litellm (1.60.8+): Multi-model LLM interface
openai-agents (0.2.4+): Structured agent execution
fastapi (0.115.7+): Optional web interfaces
rich (13.9.4+): Console output formatting
typer (0.12.5+): CLI interfaces

📖 Usage Guide

Basic Usage

import asyncio
from manager_agent_gym import (
    ChainOfThoughtManagerAgent,
    WorkflowExecutionEngine,
    AgentRegistry,
    PreferenceWeights,
    Preference,
)

# Create preferences
preferences = PreferenceWeights(
    preferences=[
        Preference(name="quality", weight=0.4, description="High-quality deliverables"),
        Preference(name="time", weight=0.3, description="Reasonable timeline"),
        Preference(name="cost", weight=0.2, description="Cost-effective execution"),
        Preference(name="oversight", weight=0.1, description="Manageable oversight"),
    ]
)

# Create manager agent
manager = ChainOfThoughtManagerAgent(
    preferences=preferences,
    model_name="gpt-4o",
    manager_persona="Strategic Project Coordinator",
)

# Set up and run workflow
async def run_workflow():
    workflow = create_workflow()  # Your workflow creation
    agent_registry = AgentRegistry()
    
    # Register agents
    for agent in workflow.agents.values():
        agent_registry.register_agent(agent)
    
    # Create execution engine
    engine = WorkflowExecutionEngine(
        workflow=workflow,
        agent_registry=agent_registry,
        manager_agent=manager,
        stakeholder_agent=stakeholder,
        max_timesteps=20,
        seed=42,
    )
    
    # Run execution
    results = await engine.run_full_execution()
    return results

# Run the workflow
results = asyncio.run(run_workflow())

Configuration Options

Manager Agent Modes:

"cot": Chain of Thought Manager (default, LLM-based)
"random": Random baseline for comparison
"assign_all": Simple one-shot delegation

Model Selection:

"gpt-4o", "gpt-4o-mini": OpenAI GPT-4 variants
"gpt-5", "gpt-5-mini": OpenAI GPT-5 (when available)
"o3": OpenAI o3 model (default)
"claude-3-5-sonnet": Anthropic Claude
"gemini-2.0-flash": Google Gemini

Environment Variables:

MAG_MANAGER_MODE: Default manager mode
MAG_MODEL_NAME: Default model name
OPENAI_API_KEY: OpenAI API key
ANTHROPIC_API_KEY: Anthropic API key

📋 Core Data Models

Workflow Schema

class Workflow(BaseModel):
    # Identity
    id: UUID
    name: str
    workflow_goal: str
    owner_id: UUID
    
    # POSG Components  
    tasks: dict[UUID, Task]           # Task graph (G)
    resources: dict[UUID, Resource]   # Resource registry (R)
    agents: dict[str, AgentInterface] # Available agents (W)
    messages: list[Message]           # Communication history (C)
    
    # Constraints and governance
    constraints: list[Constraint]
    
    # Execution state
    total_cost: float
    total_simulated_hours: float
    is_active: bool

Task Schema

class Task(BaseModel):
    id: UUID
    name: str
    description: str
    status: TaskStatus  # PENDING, READY, RUNNING, COMPLETED, FAILED
    
    # Dependencies and hierarchy
    dependency_task_ids: list[UUID]
    subtasks: list[Task]
    
    # Assignment and execution
    assigned_agent_id: str | None
    estimated_duration_hours: float | None
    estimated_cost: float | None
    
    # Outputs
    output_resource_ids: list[UUID]

Agent Configuration

class AIAgentConfig(AgentConfig):
    agent_type: str = "ai_agent"
    model_name: str = "gpt-4o"
    system_prompt: str
    max_concurrent_tasks: int = 3

class HumanAgentConfig(AgentConfig):
    agent_type: str = "human_agent"
    availability_schedule: str
    skill_areas: list[str]
    hourly_rate: float

🎓 Examples & Scenarios

Getting Started Examples

Located in examples/getting_started/:

hello_manager_agent.py: Complete workflow execution cycle
basic_agent_communication.py: Agent coordination patterns

End-to-End Scenarios

The library includes 20+ realistic business scenarios in examples/end_to_end_examples/:

Financial Services:

banking_license_application/: Regulatory compliance workflow
icaap/: Internal Capital Adequacy Assessment Process
orsa/: Own Risk and Solvency Assessment

Legal & Compliance:

legal_global_data_breach/: Crisis response and remediation
legal_contract_negotiation/: Multi-party agreement workflows
legal_m_and_a/: Merger and acquisition due diligence

Technology:

genai_feature_launch/: AI product development lifecycle
tech_company_acquisition/: Technical integration planning
data_science_analytics/: ML model development pipeline

Marketing & Operations:

marketing_campaign/: Multi-channel campaign execution
supply_chain_planning/: Global logistics optimization
mnc_workforce_restructuring/: Large-scale organizational change

Running Examples

Interactive CLI (Recommended):

# Interactive mode with scenario selection
python -m examples.cli

# Batch mode with specific scenarios
python -m examples.cli \
  --scenarios icaap data_science_analytics \
  --manager-mode cot \
  --model-name gpt-4o \
  --max-timesteps 30 \
  --parallel-jobs 4

Note: The CLI is the recommended way to run simulations as it provides comprehensive experiment management and configuration options.

Programmatic Usage:

from examples.run_examples import run_demo

# Run specific scenario
results = await run_demo(
    workflow_name="icaap",
    max_timesteps=25,
    model_name="gpt-4o",
    manager_agent_mode="cot",
    seed=42
)

🔬 Research Applications

Four Foundational Research Challenges

Hierarchical Task Decomposition
- Moving beyond pattern matching to compositional reasoning
- Systematic hierarchical planning for novel scenarios
- Dynamic task creation and refinement
Multi-Objective Optimization
- Balancing competing objectives under non-stationary preferences
- Adaptation without costly retraining
- Preference learning from stakeholder feedback
Ad Hoc Team Coordination
- Orchestrating heterogeneous teams without prior coordination
- Dynamic capability inference and role assignment
- Mixed human-AI collaboration patterns
Governance by Design
- Maintaining compliance across dynamic workflows
- Interpretable natural language constraint handling
- Audit trails and transparency requirements

Evaluation Methodology

The platform implements comprehensive evaluation including:

Workflow-Level Quality:

Task completion rates and success criteria
Coordination efficiency and deadtime metrics
Resource optimization and budget adherence

Compliance & Human-Centric:

Oversight burden on human stakeholders
Governance adherence and constraint violations
Communication effectiveness and transparency

Preference Adherence:

Multi-objective regret analysis
Preference weight sensitivity
Dynamic preference adaptation

Performance Metrics:

Execution time and computational cost
LLM token usage and API costs
Scalability across workflow complexity

🛠️ Advanced Features

Custom Manager Agents

class CustomManagerAgent(ManagerAgent):
    def __init__(self, preferences: PreferenceWeights):
        super().__init__("custom_manager", preferences)
    
    async def take_action(self, observation: ManagerObservation) -> BaseManagerAction:
        # Custom decision logic
        return action

# Register with factory
def create_custom_manager(preferences: PreferenceWeights) -> ManagerAgent:
    return CustomManagerAgent(preferences)

Custom Evaluation Rubrics

from manager_agent_gym.schemas.preferences.rubric import WorkflowRubric

# Code-based rubric
def quality_validator(context: ValidationContext) -> EvaluatedScore:
    # Custom validation logic
    score = evaluate_quality(context.workflow)
    return EvaluatedScore(score=score, reasoning="Quality assessment")

quality_rubric = WorkflowRubric(
    name="quality_check",
    validator=quality_validator,
    max_score=1.0,
    run_condition=RunCondition.EACH_TIMESTEP
)

# LLM-based rubric
llm_rubric = WorkflowRubric(
    name="stakeholder_satisfaction",
    llm_prompt="Evaluate stakeholder satisfaction based on communication quality...",
    max_score=1.0,
    model="gpt-4o"
)

Custom Workflow Agents

class CustomAgent(AgentInterface[AgentConfig]):
    async def execute_task(self, task: Task, resources: list[Resource]) -> ExecutionResult:
        # Custom task execution logic
        return ExecutionResult(
            success=True,
            outputs=[output_resource],
            metadata={"custom_metric": value}
        )

State Restoration and Checkpointing

# Save workflow state
from manager_agent_gym.core.execution.state_restorer import WorkflowStateRestorer

restorer = WorkflowStateRestorer()
checkpoint = await restorer.create_checkpoint(workflow, timestep=10)

# Restore from checkpoint
restored_workflow = await restorer.restore_from_checkpoint(checkpoint)

📊 Output and Analysis

Execution Results

class ExecutionResult(BaseModel):
    timestep: int
    workflow_state: str  # JSON snapshot
    manager_action: dict | None
    tasks_started: list[UUID]
    tasks_completed: list[UUID]
    tasks_failed: list[UUID]
    metrics: dict[str, float]
    evaluation_scores: dict[str, float]

Analysis Tools

The library provides analysis utilities in analysis_outputs/:

Cost correction analysis
Manager action pattern analysis
Preference adherence tracking
Cross-scenario performance comparison

Visualization Support

# Matplotlib integration for metrics visualization
import matplotlib.pyplot as plt
from manager_agent_gym.analysis import plot_execution_metrics

# Plot manager actions over time
plot_execution_metrics(execution_results)

# Generate performance reports
generate_performance_report(results, output_dir="reports/")

🧪 Testing

Test Structure

tests/
├── integration/           # End-to-end integration tests
├── unit/                 # Component unit tests
└── simulation_outputs/   # Test execution outputs

Running Tests

# Run all tests
pytest

# Run specific test categories
pytest tests/integration/
pytest tests/unit/

# Run with coverage
pytest --cov=manager_agent_gym

Test Examples

# Unit test example
def test_workflow_validation():
    workflow = create_test_workflow()
    assert workflow.validate_task_graph()

# Integration test example  
async def test_manager_agent_execution():
    manager = ChainOfThoughtManagerAgent(preferences)
    observation = create_test_observation()
    action = await manager.take_action(observation)
    assert isinstance(action, BaseManagerAction)

🔧 Configuration Reference

Environment Variables

Variable	Description	Default	Examples
`MAG_MANAGER_MODE`	Default manager type	`"cot"`	`"cot"`, `"random"`, `"assign_all"`
`MAG_MODEL_NAME`	Default LLM model	`"o3"`	`"gpt-4o"`, `"claude-3-5-sonnet"`
`OPENAI_API_KEY`	OpenAI API key	Required	`"sk-..."`
`ANTHROPIC_API_KEY`	Anthropic API key	Optional	`"sk-ant-..."`

OutputConfig

from manager_agent_gym.schemas.config import OutputConfig

output_config = OutputConfig(
    base_dir="outputs/",
    save_workflow_snapshots=True,
    save_agent_communications=True,
    save_evaluation_details=True
)

🤝 Contributing

Development Setup

# Install development dependencies with uv (recommended)
uv pip install -e ".[dev]"

# Alternative: Install with pip
pip install -e ".[dev]"

# Configure environment
cp .env.example .env
# Edit .env with your API keys

# Install pre-commit hooks
pre-commit install

# Run linting
ruff check manager_agent_gym/
ruff format manager_agent_gym/

Adding New Scenarios

Create scenario module in examples/end_to_end_examples/
Implement required functions:
- create_workflow()
- create_preferences()
- create_team_timeline()
- create_preference_update_requests()
- create_evaluator_to_measure_goal_achievement()
Register in examples/scenarios.py

Adding New Manager Agents

Inherit from ManagerAgent base class
Implement take_action() method
Register in factory system
Add tests and documentation

📚 API Reference

Core Classes

Manager Agents:

ManagerAgent: Abstract base class
ChainOfThoughtManagerAgent: LLM-based structured manager
RandomManagerAgentV2: Random baseline manager

Workflow Components:

Workflow: Complete workflow state
Task: Individual work items with dependencies
Resource: Workflow artifacts and deliverables
AgentInterface: Worker agent base class

Execution:

WorkflowExecutionEngine: Main simulation engine
AgentRegistry: Agent discovery and management
CommunicationService: Inter-agent messaging

Evaluation:

ValidationEngine: Rubric-based evaluation
Evaluator: Preference evaluation configuration
WorkflowRubric: Individual evaluation criteria

Manager Actions

Available manager actions:

AssignTaskAction: Assign tasks to agents
CreateTaskAction: Create new tasks through decomposition
RefineTaskAction: Modify existing task specifications
SendMessageAction: Communicate with agents/stakeholders
UpdatePreferencesAction: Adjust optimization weights
CreateResourceAction: Define new workflow resources

🔗 References

Research Paper: See paper.md for theoretical foundations
API Documentation: Complete reference in docs/API.md
Architecture Guide: Technical details in docs/SIMULATOR_ARCHITECTURE.md
Research Guide: Implementation guide in docs/RESEARCH_GUIDE.md

Manager Agent Gym v0.1.0 - Where AI learns to manage complex work in realistic environments.

For questions, issues, or contributions, please refer to the GitHub repository or contact the development team.

FilesExpand file tree

LIBRARY_DOCUMENTATION.md

Latest commit

History