Version 0.1.0
Manager Agent Gym is a research platform for developing and evaluating autonomous agents that orchestrate complex workflows involving both human and AI collaborators. The library implements the Autonomous Manager Agent research challenge as described in the accompanying research paper, providing a complete POSG (Partially Observable Stochastic Game) framework for building and evaluating autonomous workflow management systems.
- π§© Hierarchical Task Decomposition: AI managers break down complex goals into executable task graphs using structured reasoning
- βοΈ Multi-Objective Optimization: Balance competing objectives (cost, quality, time, oversight) under dynamic preferences
- π€ Ad Hoc Team Coordination: Orchestrate mixed human and AI teams without prior joint training
- π Governance Compliance: Maintain regulatory compliance while adapting to evolving constraints
- π¬ Research Evaluation: Comprehensive evaluation framework with multi-objective regret analysis
The system implements the formal framework β¨I, S, bβ°, {Aα΅’}, {Oα΅’}, P, {Rα΅’}β©:
- I (Agents): Manager + Worker agent implementations
- S (State):
Workflowcontaining tasks, resources, agents, messages - Aα΅’ (Actions):
BaseManagerActionhierarchy for manager decisions - Oα΅’ (Observations):
ManagerObservationfor partial state visibility - P (Transitions):
WorkflowExecutionEnginemanages state evolution - Rα΅’ (Rewards): Multi-objective evaluation via
ValidationEngineand regret analysis
manager_agent_gym/
βββ core/ # Core implementations
β βββ manager_agent/ # Manager agent implementations
β βββ workflow_agents/ # Worker agent implementations
β βββ execution/ # Workflow execution engine
β βββ evaluation/ # Validation and regret calculation
β βββ communication/ # Agent communication system
β βββ decomposition/ # Task decomposition services
β βββ common/ # Shared utilities and LLM interface
βββ schemas/ # Data models and type definitions
β βββ core/ # POSG state components
β βββ execution/ # Runtime state and actions
β βββ evaluation/ # Success criteria and metrics
β βββ preferences/ # Preferences and evaluators
β βββ workflow_agents/ # Agent configurations and outputs
βββ examples/ # Progressive tutorials and demos
AI agents that observe workflow state and make strategic decisions:
- Assign tasks to specialized agents
- Create new tasks when needed through decomposition
- Monitor progress and adapt to changes
- Balance multiple preferences (quality, time, cost, oversight)
- Communicate with stakeholders to clarify requirements
Available Implementations:
ChainOfThoughtManagerAgent: LLM-based structured decision making with constrained action generationRandomManagerAgentV2: Baseline random action selection for comparisonOneShotDelegateManagerAgent: Simple assignment-only manager
Specialized agents that execute tasks:
- AIAgent: LLM-based task execution with structured tools and OpenAI Agents SDK
- MockHumanAgent: Realistic human simulation with noise modeling and capacity constraints
- StakeholderAgent: Represents stakeholders who provide requirements and feedback
Collections of interconnected tasks with:
- Task dependencies and hierarchical subtask structures
- Resource requirements and outputs
- Regulatory constraints and governance rules
- Mixed human and AI agent teams
- Communication history and coordination state
The simulation environment that:
- Runs discrete timesteps with manager observation and action phases
- Manages task execution asynchronously across multiple agents
- Tracks comprehensive metrics for evaluation
- Handles preference dynamics and stakeholder updates
- Provides callbacks for custom monitoring and analysis
Comprehensive multi-objective evaluation including:
- Preference adherence via rubric-based scoring
- Constraint compliance validation
- Workflow quality metrics (completion rate, coordination efficiency)
- Human-centric metrics (oversight burden, transparency)
- Regret analysis for multi-objective optimization
- Python 3.12+
- uv package manager (recommended)
- OpenAI API key (for LLM-based agents)
- Optional: Anthropic API key for Claude models
# Clone the repository
git clone https://github.com/your-org/manager-agent-gym
cd manager-agent-gym
# Install with uv (recommended)
uv pip install -e .
# Alternative: Install with pip
pip install -e .
# Configure API keys
cp .env.example .env
# Edit .env file with your API keys:
# OPENAI_API_KEY=sk-your-key-here
# ANTHROPIC_API_KEY=sk-ant-your-key-here # OptionalNote: The library uses
pydantic-settingswhich automatically loads configuration from the.envfile.
Key dependencies include:
pydantic(2.10.5+): Type validation and data modelsopenai(1.58.0+): LLM inferencelitellm(1.60.8+): Multi-model LLM interfaceopenai-agents(0.2.4+): Structured agent executionfastapi(0.115.7+): Optional web interfacesrich(13.9.4+): Console output formattingtyper(0.12.5+): CLI interfaces
import asyncio
from manager_agent_gym import (
ChainOfThoughtManagerAgent,
WorkflowExecutionEngine,
AgentRegistry,
PreferenceWeights,
Preference,
)
# Create preferences
preferences = PreferenceWeights(
preferences=[
Preference(name="quality", weight=0.4, description="High-quality deliverables"),
Preference(name="time", weight=0.3, description="Reasonable timeline"),
Preference(name="cost", weight=0.2, description="Cost-effective execution"),
Preference(name="oversight", weight=0.1, description="Manageable oversight"),
]
)
# Create manager agent
manager = ChainOfThoughtManagerAgent(
preferences=preferences,
model_name="gpt-4o",
manager_persona="Strategic Project Coordinator",
)
# Set up and run workflow
async def run_workflow():
workflow = create_workflow() # Your workflow creation
agent_registry = AgentRegistry()
# Register agents
for agent in workflow.agents.values():
agent_registry.register_agent(agent)
# Create execution engine
engine = WorkflowExecutionEngine(
workflow=workflow,
agent_registry=agent_registry,
manager_agent=manager,
stakeholder_agent=stakeholder,
max_timesteps=20,
seed=42,
)
# Run execution
results = await engine.run_full_execution()
return results
# Run the workflow
results = asyncio.run(run_workflow())Manager Agent Modes:
"cot": Chain of Thought Manager (default, LLM-based)"random": Random baseline for comparison"assign_all": Simple one-shot delegation
Model Selection:
"gpt-4o","gpt-4o-mini": OpenAI GPT-4 variants"gpt-5","gpt-5-mini": OpenAI GPT-5 (when available)"o3": OpenAI o3 model (default)"claude-3-5-sonnet": Anthropic Claude"gemini-2.0-flash": Google Gemini
Environment Variables:
MAG_MANAGER_MODE: Default manager modeMAG_MODEL_NAME: Default model nameOPENAI_API_KEY: OpenAI API keyANTHROPIC_API_KEY: Anthropic API key
class Workflow(BaseModel):
# Identity
id: UUID
name: str
workflow_goal: str
owner_id: UUID
# POSG Components
tasks: dict[UUID, Task] # Task graph (G)
resources: dict[UUID, Resource] # Resource registry (R)
agents: dict[str, AgentInterface] # Available agents (W)
messages: list[Message] # Communication history (C)
# Constraints and governance
constraints: list[Constraint]
# Execution state
total_cost: float
total_simulated_hours: float
is_active: boolclass Task(BaseModel):
id: UUID
name: str
description: str
status: TaskStatus # PENDING, READY, RUNNING, COMPLETED, FAILED
# Dependencies and hierarchy
dependency_task_ids: list[UUID]
subtasks: list[Task]
# Assignment and execution
assigned_agent_id: str | None
estimated_duration_hours: float | None
estimated_cost: float | None
# Outputs
output_resource_ids: list[UUID]class AIAgentConfig(AgentConfig):
agent_type: str = "ai_agent"
model_name: str = "gpt-4o"
system_prompt: str
max_concurrent_tasks: int = 3
class HumanAgentConfig(AgentConfig):
agent_type: str = "human_agent"
availability_schedule: str
skill_areas: list[str]
hourly_rate: floatLocated in examples/getting_started/:
hello_manager_agent.py: Complete workflow execution cyclebasic_agent_communication.py: Agent coordination patterns
The library includes 20+ realistic business scenarios in examples/end_to_end_examples/:
Financial Services:
banking_license_application/: Regulatory compliance workflowicaap/: Internal Capital Adequacy Assessment Processorsa/: Own Risk and Solvency Assessment
Legal & Compliance:
legal_global_data_breach/: Crisis response and remediationlegal_contract_negotiation/: Multi-party agreement workflowslegal_m_and_a/: Merger and acquisition due diligence
Technology:
genai_feature_launch/: AI product development lifecycletech_company_acquisition/: Technical integration planningdata_science_analytics/: ML model development pipeline
Marketing & Operations:
marketing_campaign/: Multi-channel campaign executionsupply_chain_planning/: Global logistics optimizationmnc_workforce_restructuring/: Large-scale organizational change
Interactive CLI (Recommended):
# Interactive mode with scenario selection
python -m examples.cli
# Batch mode with specific scenarios
python -m examples.cli \
--scenarios icaap data_science_analytics \
--manager-mode cot \
--model-name gpt-4o \
--max-timesteps 30 \
--parallel-jobs 4Note: The CLI is the recommended way to run simulations as it provides comprehensive experiment management and configuration options.
Programmatic Usage:
from examples.run_examples import run_demo
# Run specific scenario
results = await run_demo(
workflow_name="icaap",
max_timesteps=25,
model_name="gpt-4o",
manager_agent_mode="cot",
seed=42
)-
Hierarchical Task Decomposition
- Moving beyond pattern matching to compositional reasoning
- Systematic hierarchical planning for novel scenarios
- Dynamic task creation and refinement
-
Multi-Objective Optimization
- Balancing competing objectives under non-stationary preferences
- Adaptation without costly retraining
- Preference learning from stakeholder feedback
-
Ad Hoc Team Coordination
- Orchestrating heterogeneous teams without prior coordination
- Dynamic capability inference and role assignment
- Mixed human-AI collaboration patterns
-
Governance by Design
- Maintaining compliance across dynamic workflows
- Interpretable natural language constraint handling
- Audit trails and transparency requirements
The platform implements comprehensive evaluation including:
Workflow-Level Quality:
- Task completion rates and success criteria
- Coordination efficiency and deadtime metrics
- Resource optimization and budget adherence
Compliance & Human-Centric:
- Oversight burden on human stakeholders
- Governance adherence and constraint violations
- Communication effectiveness and transparency
Preference Adherence:
- Multi-objective regret analysis
- Preference weight sensitivity
- Dynamic preference adaptation
Performance Metrics:
- Execution time and computational cost
- LLM token usage and API costs
- Scalability across workflow complexity
class CustomManagerAgent(ManagerAgent):
def __init__(self, preferences: PreferenceWeights):
super().__init__("custom_manager", preferences)
async def take_action(self, observation: ManagerObservation) -> BaseManagerAction:
# Custom decision logic
return action
# Register with factory
def create_custom_manager(preferences: PreferenceWeights) -> ManagerAgent:
return CustomManagerAgent(preferences)from manager_agent_gym.schemas.preferences.rubric import WorkflowRubric
# Code-based rubric
def quality_validator(context: ValidationContext) -> EvaluatedScore:
# Custom validation logic
score = evaluate_quality(context.workflow)
return EvaluatedScore(score=score, reasoning="Quality assessment")
quality_rubric = WorkflowRubric(
name="quality_check",
validator=quality_validator,
max_score=1.0,
run_condition=RunCondition.EACH_TIMESTEP
)
# LLM-based rubric
llm_rubric = WorkflowRubric(
name="stakeholder_satisfaction",
llm_prompt="Evaluate stakeholder satisfaction based on communication quality...",
max_score=1.0,
model="gpt-4o"
)class CustomAgent(AgentInterface[AgentConfig]):
async def execute_task(self, task: Task, resources: list[Resource]) -> ExecutionResult:
# Custom task execution logic
return ExecutionResult(
success=True,
outputs=[output_resource],
metadata={"custom_metric": value}
)# Save workflow state
from manager_agent_gym.core.execution.state_restorer import WorkflowStateRestorer
restorer = WorkflowStateRestorer()
checkpoint = await restorer.create_checkpoint(workflow, timestep=10)
# Restore from checkpoint
restored_workflow = await restorer.restore_from_checkpoint(checkpoint)class ExecutionResult(BaseModel):
timestep: int
workflow_state: str # JSON snapshot
manager_action: dict | None
tasks_started: list[UUID]
tasks_completed: list[UUID]
tasks_failed: list[UUID]
metrics: dict[str, float]
evaluation_scores: dict[str, float]The library provides analysis utilities in analysis_outputs/:
- Cost correction analysis
- Manager action pattern analysis
- Preference adherence tracking
- Cross-scenario performance comparison
# Matplotlib integration for metrics visualization
import matplotlib.pyplot as plt
from manager_agent_gym.analysis import plot_execution_metrics
# Plot manager actions over time
plot_execution_metrics(execution_results)
# Generate performance reports
generate_performance_report(results, output_dir="reports/")tests/
βββ integration/ # End-to-end integration tests
βββ unit/ # Component unit tests
βββ simulation_outputs/ # Test execution outputs
# Run all tests
pytest
# Run specific test categories
pytest tests/integration/
pytest tests/unit/
# Run with coverage
pytest --cov=manager_agent_gym# Unit test example
def test_workflow_validation():
workflow = create_test_workflow()
assert workflow.validate_task_graph()
# Integration test example
async def test_manager_agent_execution():
manager = ChainOfThoughtManagerAgent(preferences)
observation = create_test_observation()
action = await manager.take_action(observation)
assert isinstance(action, BaseManagerAction)| Variable | Description | Default | Examples |
|---|---|---|---|
MAG_MANAGER_MODE |
Default manager type | "cot" |
"cot", "random", "assign_all" |
MAG_MODEL_NAME |
Default LLM model | "o3" |
"gpt-4o", "claude-3-5-sonnet" |
OPENAI_API_KEY |
OpenAI API key | Required | "sk-..." |
ANTHROPIC_API_KEY |
Anthropic API key | Optional | "sk-ant-..." |
from manager_agent_gym.schemas.config import OutputConfig
output_config = OutputConfig(
base_dir="outputs/",
save_workflow_snapshots=True,
save_agent_communications=True,
save_evaluation_details=True
)# Install development dependencies with uv (recommended)
uv pip install -e ".[dev]"
# Alternative: Install with pip
pip install -e ".[dev]"
# Configure environment
cp .env.example .env
# Edit .env with your API keys
# Install pre-commit hooks
pre-commit install
# Run linting
ruff check manager_agent_gym/
ruff format manager_agent_gym/- Create scenario module in
examples/end_to_end_examples/ - Implement required functions:
create_workflow()create_preferences()create_team_timeline()create_preference_update_requests()create_evaluator_to_measure_goal_achievement()
- Register in
examples/scenarios.py
- Inherit from
ManagerAgentbase class - Implement
take_action()method - Register in factory system
- Add tests and documentation
Manager Agents:
ManagerAgent: Abstract base classChainOfThoughtManagerAgent: LLM-based structured managerRandomManagerAgentV2: Random baseline manager
Workflow Components:
Workflow: Complete workflow stateTask: Individual work items with dependenciesResource: Workflow artifacts and deliverablesAgentInterface: Worker agent base class
Execution:
WorkflowExecutionEngine: Main simulation engineAgentRegistry: Agent discovery and managementCommunicationService: Inter-agent messaging
Evaluation:
ValidationEngine: Rubric-based evaluationEvaluator: Preference evaluation configurationWorkflowRubric: Individual evaluation criteria
Available manager actions:
AssignTaskAction: Assign tasks to agentsCreateTaskAction: Create new tasks through decompositionRefineTaskAction: Modify existing task specificationsSendMessageAction: Communicate with agents/stakeholdersUpdatePreferencesAction: Adjust optimization weightsCreateResourceAction: Define new workflow resources
- Research Paper: See
paper.mdfor theoretical foundations - API Documentation: Complete reference in
docs/API.md - Architecture Guide: Technical details in
docs/SIMULATOR_ARCHITECTURE.md - Research Guide: Implementation guide in
docs/RESEARCH_GUIDE.md
Manager Agent Gym v0.1.0 - Where AI learns to manage complex work in realistic environments.
For questions, issues, or contributions, please refer to the GitHub repository or contact the development team.