[ICML 2025] M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality

This repository contains the implementation of M3HF (Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality), a novel framework that integrates multi-phase human feedback of mixed quality into multi-agent reinforcement learning (MARL) training processes.

🔍 Overview

M3HF addresses the significant challenge of designing effective reward functions in multi-agent reinforcement learning by incorporating human feedback of varying quality levels. The framework enables:

Multi-phase feedback integration across training generations
Mixed-quality feedback handling from both expert and non-expert humans
LLM-powered feedback parsing for natural language instructions
Adaptive reward shaping through predefined templates
Robust learning with performance-based weight adjustments

🚀 Installation

Prerequisites

Python 3.8 or higher
CUDA-capable GPU (recommended)
OpenAI API key (optional, for LLM integration)

Quick Installation

git clone https://github.com/your-username/gym-macro-overcooked.git
cd gym-macro-overcooked

# Install system dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install libsdl2-dev

# Install Python dependencies
pip install -e .

Dependencies

The package automatically installs all required dependencies:

Core ML: torch, ray[rllib], gymnasium
Environment: pygame, numpy, matplotlib
Logging: wandb, tensorboard
LLM Integration: openai (optional)

⚡ Quick Start

1. Manual Environment Testing

# Play the environment manually
python play.py --env_id Overcooked-MA-v1 --n_agent 3 --task 6 --map_type A

2. M3HF Training Options

Two main scripts are available:

m3hf.py: Full M3HF algorithm with human feedback integration and multi-generation training
m3hf_main.py: Simplified training script for basic multi-agent PPO without feedback loop

# Run M3HF with simulated human feedback
python m3hf.py \
    --env_id Overcooked-MA-v1 \
    --n_agent 3 \
    --map_type A \
    --task 6 \
    --generations 5 \
    --use_wandb \
    --demo_mode

# Run with OpenAI integration for real human feedback
export OPENAI_API_KEY="your-api-key"
python m3hf.py \
    --env_id Overcooked-MA-v1 \
    --n_agent 3 \
    --map_type A \
    --task 6 \
    --generations 5 \
    --use_wandb

# Alternative: Simple training without feedback loop
python m3hf_main.py \
    --env_id Overcooked-MA-v1 \
    --num_workers 4 \
    --training_iterations 100

3. GPU Optimization

For faster training with multiple GPUs:

# Custom GPU selection
python m3hf.py \
    --env_id Overcooked-MA-v1 \
    --n_agent 3 \
    --map_type A \
    --task 6 \
    --generations 5 \
    --num_gpus 2 \
    --gpu_devices "0,2" \
    --use_wandb \
    --demo_mode

4. Baseline Comparison

# Train IPPO baseline
python play_rllib_ippo.py \
    --env_id Overcooked-MA-v1 \
    --num_workers 4 \
    --training_iterations 1000

🧠 M3HF Algorithm

Algorithm Overview

M3HF operates through iterative generations of agent training and human feedback:

Agent Training: Multi-agent policies trained with current reward functions
Rollout Generation: Video demonstrations of agent behavior
Human Feedback: Natural language feedback on agent performance
LLM Parsing: Feedback converted to structured instructions
Reward Shaping: New reward functions generated from templates
Weight Update: Performance-based adjustment of reward weights

Core Components

Multi-phase Human Feedback Markov Game (MHF-MG): Theoretical framework
Feedback Quality Assessment: Automatic filtering of low-quality feedback
Reward Function Templates: Distance, action, state, and cooperation-based rewards
Meta-learning Weight Updates: Adaptive optimization of reward combinations
Robustness Mechanisms: Handling of mixed-quality and contradictory feedback

🎮 Environment Details

Overcooked Multi-Agent Environment

The environment simulates a cooperative cooking task where 3 agents must prepare salads:

Map A	Map B	Map C

Task Types

TASKLIST = [
    "tomato salad",           # task 0
    "lettuce salad",          # task 1  
    "onion salad",            # task 2
    "lettuce-tomato salad",   # task 3
    "onion-tomato salad",     # task 4
    "lettuce-onion salad",    # task 5
    "lettuce-onion-tomato salad"  # task 6 (most complex)
]

Observation Space

Vector Observation (32-dimensional):

obs = [
    tomato.x, tomato.y, tomato.status,      # [0:3]
    lettuce.x, lettuce.y, lettuce.status,   # [3:6]  
    onion.x, onion.y, onion.status,         # [6:9]
    plate1.x, plate1.y,                     # [9:11]
    plate2.x, plate2.y,                     # [11:13]
    knife1.x, knife1.y,                     # [13:15]
    knife2.x, knife2.y,                     # [15:17]
    delivery.x, delivery.y,                 # [17:19]
    agent1.x, agent1.y,                     # [19:21]
    agent2.x, agent2.y,                     # [21:23]
    agent3.x, agent3.y,                     # [23:25]
    task_onehot                             # [25:32]
]

Action Space

Macro Actions:

Navigation: get_tomato, get_lettuce, get_onion
Interaction: get_plate1, get_plate2, chop, deliver
Movement: go_knife1, go_knife2, go_counter
Basic: stay, up, down, left, right

📚 Usage Examples

Human Feedback Examples

# Example feedback for coordination improvement
feedback = "The red chef should get the tomato first, then the green chef can take it and cut it as quickly as possible."

# Example feedback for task efficiency  
feedback = "Agents should work in parallel - one chopping while another gets plates."

# Example feedback for error correction
feedback = "Don't deliver wrong salads! Make sure all ingredients are properly chopped and combined."

Custom Reward Templates

You can extend M3HF by adding new reward function templates to the LLM prompts. The templates are defined in the REWARD_FUNCTION_BUILD_PROMPT and used by the LLM to generate reward functions based on human feedback:

# Edit prompt.py to add new templates to REWARD_FUNCTION_BUILD_PROMPT

# Existing templates in the prompt:

# 1. Distance-based:
# lambda obs, act: -sqrt((obs[e1_x] - obs[e2_x])**2 + (obs[e1_y] - obs[e2_y])**2)
# Example: lambda obs, act: -sqrt((obs[19] - obs[0])**2 + (obs[20] - obs[1])**2)  # Distance between agent 1 and tomato

# 2. Action-based:
# lambda obs, act: 1 if act == desired_action else 0
# Example: lambda obs, act: 1 if act == 5 else 0  # Reward for 'Interact' action

# 3. Status-based:
# lambda obs, act: 1 if obs[e_status] == desired_status else 0
# Example: lambda obs, act: 1 if obs[2] == 1 else 0  # Reward if tomato is chopped

# 4. Proximity-based:
# lambda obs, act: r_prox if sqrt((obs[e1_x] - obs[e2_x])**2 + (obs[e1_y] - obs[e2_y])**2) <= d else 0
# Example: lambda obs, act: 0.5 if sqrt((obs[19] - obs[13])**2 + (obs[20] - obs[14])**2) <= 1 else 0  # Reward if agent 1 is near knife 1

# 5. Time-based penalty:
# lambda obs, act, t: -beta * t
# Example: lambda obs, act, t: -0.001 * t  # Increasing penalty over time

# 6. Success-based:
# lambda obs, act: r_success if goal_condition_met(obs) else 0
# Example: lambda obs, act: 10 if obs[25] == 1 and obs[17] == obs[19] and obs[18] == obs[20] else 0  # Reward for delivering completed order

# 7. Energy-based penalty:
# lambda obs, act: -gamma * energy_cost(act)
# Example: lambda obs, act: -0.2 * (1 if act != 0 else 0)  # Penalty for non-zero actions

# 8. Composite reward:
# lambda obs, act: sum(weight_i * reward_function_i(obs, act) for i in range(n))

# To add custom templates:
# 1. Edit prompt.py
# 2. Add your template to REWARD_FUNCTION_BUILD_PROMPT
# 3. Provide examples showing how to parameterize it
# 4. Include documentation for the LLM to understand when to use it

🐛 Troubleshooting

Common Issues

OpenAI API Errors:

export OPENAI_API_KEY="your-key-here"
# Or disable LLM: python m3hf_main.py --no_openai

Ray/RLLib Issues:

pip install ray[rllib]==2.0.0
# Check Ray version compatibility

SDL Dependencies:

sudo apt-get install libsdl2-dev libsdl2-image-dev libsdl2-mixer-dev

CUDA/GPU Issues:

# Check CUDA installation and GPU availability
nvidia-smi

# Use CPU-only mode
python m3hf.py --num_gpus 0

# Select specific GPUs to avoid conflicts
python m3hf.py --gpu_devices "0,2" --num_gpus 2

# Check GPU memory usage and free up memory if needed
nvidia-smi --query-gpu=memory.used,memory.total --format=csv

GPU Memory Issues:

# Reduce batch sizes if running out of memory
# Edit m3hf.py and reduce train_batch_size and sgd_minibatch_size

# Or use fewer workers
python m3hf.py --num_workers 2 --workers_per_cpu 1

# Monitor GPU usage during training
watch -n 1 nvidia-smi

Performance Optimization

Multi-GPU Training:

# Use specific GPUs for optimal performance
python m3hf.py --num_gpus 2 --gpu_devices "0,2"

Parallel Workers:

# Increase workers based on CPU cores (recommended: CPU_cores / 2)
python m3hf.py --num_workers 8 --workers_per_cpu 2

Memory Usage:

# Monitor memory and adjust if needed
# Batch sizes are auto-optimized for multi-GPU setups
# Default: train_batch_size=20480, sgd_minibatch_size=2048

Hardware Recommendations:
- Minimum: 1 GPU, 8 CPU cores, 16GB RAM
- Recommended: 2+ GPUs, 16+ CPU cores, 32GB+ RAM
- Optimal: 3+ GPUs (A30/V100/A100), 32+ CPU cores, 64GB+ RAM
Wandb Logging: Enable for tracking: --use_wandb

📄 Citation

If you use this code in your research, please cite:

@article{wang2025m3hf,
  title={M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality},
  author={Wang, Ziyan and Zhang, Zhicheng and Fang, Fei and Du, Yali},
  journal={Proceedings of the 42nd International Conference on Machine Learning},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
examples		examples
gym_macro_overcooked		gym_macro_overcooked
image		image
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
custom_env.py		custom_env.py
language.py		language.py
m3hf.py		m3hf.py
m3hf_main.py		m3hf_main.py
play.py		play.py
play_rllib.py		play_rllib.py
play_rllib_ippo.py		play_rllib_ippo.py
prompt.py		prompt.py
requirements.txt		requirements.txt
rollout.py		rollout.py
rollout_lstm.py		rollout_lstm.py
setup.py		setup.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

[ICML 2025] M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality

🔍 Overview

🚀 Installation

Prerequisites

Quick Installation

Dependencies

⚡ Quick Start

1. Manual Environment Testing

2. M3HF Training Options

3. GPU Optimization

4. Baseline Comparison

🧠 M3HF Algorithm

Algorithm Overview

Core Components

🎮 Environment Details

Overcooked Multi-Agent Environment

Task Types

Observation Space

Action Space

📚 Usage Examples

Human Feedback Examples

Custom Reward Templates

🐛 Troubleshooting

Common Issues

Performance Optimization

📄 Citation

About

Uh oh!

Releases

Packages

Languages

License

cooperativex/M3HF

Folders and files

Latest commit

History

Repository files navigation

[ICML 2025] M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality

🔍 Overview

🚀 Installation

Prerequisites

Quick Installation

Dependencies

⚡ Quick Start

1. Manual Environment Testing

2. M3HF Training Options

3. GPU Optimization

4. Baseline Comparison

🧠 M3HF Algorithm

Algorithm Overview

Core Components

🎮 Environment Details

Overcooked Multi-Agent Environment

Task Types

Observation Space

Action Space

📚 Usage Examples

Human Feedback Examples

Custom Reward Templates

🐛 Troubleshooting

Common Issues

Performance Optimization

📄 Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages