[ICML 2025] M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality
This repository contains the implementation of M3HF (Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality), a novel framework that integrates multi-phase human feedback of mixed quality into multi-agent reinforcement learning (MARL) training processes.
M3HF addresses the significant challenge of designing effective reward functions in multi-agent reinforcement learning by incorporating human feedback of varying quality levels. The framework enables:
- Multi-phase feedback integration across training generations
- Mixed-quality feedback handling from both expert and non-expert humans
- LLM-powered feedback parsing for natural language instructions
- Adaptive reward shaping through predefined templates
- Robust learning with performance-based weight adjustments
- Python 3.8 or higher
- CUDA-capable GPU (recommended)
- OpenAI API key (optional, for LLM integration)
git clone https://github.com/your-username/gym-macro-overcooked.git
cd gym-macro-overcooked
# Install system dependencies (Ubuntu/Debian)
sudo apt-get update
sudo apt-get install libsdl2-dev
# Install Python dependencies
pip install -e .The package automatically installs all required dependencies:
- Core ML:
torch,ray[rllib],gymnasium - Environment:
pygame,numpy,matplotlib - Logging:
wandb,tensorboard - LLM Integration:
openai(optional)
# Play the environment manually
python play.py --env_id Overcooked-MA-v1 --n_agent 3 --task 6 --map_type ATwo main scripts are available:
m3hf.py: Full M3HF algorithm with human feedback integration and multi-generation trainingm3hf_main.py: Simplified training script for basic multi-agent PPO without feedback loop
# Run M3HF with simulated human feedback
python m3hf.py \
--env_id Overcooked-MA-v1 \
--n_agent 3 \
--map_type A \
--task 6 \
--generations 5 \
--use_wandb \
--demo_mode
# Run with OpenAI integration for real human feedback
export OPENAI_API_KEY="your-api-key"
python m3hf.py \
--env_id Overcooked-MA-v1 \
--n_agent 3 \
--map_type A \
--task 6 \
--generations 5 \
--use_wandb
# Alternative: Simple training without feedback loop
python m3hf_main.py \
--env_id Overcooked-MA-v1 \
--num_workers 4 \
--training_iterations 100For faster training with multiple GPUs:
# Custom GPU selection
python m3hf.py \
--env_id Overcooked-MA-v1 \
--n_agent 3 \
--map_type A \
--task 6 \
--generations 5 \
--num_gpus 2 \
--gpu_devices "0,2" \
--use_wandb \
--demo_mode# Train IPPO baseline
python play_rllib_ippo.py \
--env_id Overcooked-MA-v1 \
--num_workers 4 \
--training_iterations 1000M3HF operates through iterative generations of agent training and human feedback:
- Agent Training: Multi-agent policies trained with current reward functions
- Rollout Generation: Video demonstrations of agent behavior
- Human Feedback: Natural language feedback on agent performance
- LLM Parsing: Feedback converted to structured instructions
- Reward Shaping: New reward functions generated from templates
- Weight Update: Performance-based adjustment of reward weights
- Multi-phase Human Feedback Markov Game (MHF-MG): Theoretical framework
- Feedback Quality Assessment: Automatic filtering of low-quality feedback
- Reward Function Templates: Distance, action, state, and cooperation-based rewards
- Meta-learning Weight Updates: Adaptive optimization of reward combinations
- Robustness Mechanisms: Handling of mixed-quality and contradictory feedback
The environment simulates a cooperative cooking task where 3 agents must prepare salads:
| Map A | Map B | Map C |
|---|---|---|
![]() |
![]() |
![]() |
TASKLIST = [
"tomato salad", # task 0
"lettuce salad", # task 1
"onion salad", # task 2
"lettuce-tomato salad", # task 3
"onion-tomato salad", # task 4
"lettuce-onion salad", # task 5
"lettuce-onion-tomato salad" # task 6 (most complex)
]Vector Observation (32-dimensional):
obs = [
tomato.x, tomato.y, tomato.status, # [0:3]
lettuce.x, lettuce.y, lettuce.status, # [3:6]
onion.x, onion.y, onion.status, # [6:9]
plate1.x, plate1.y, # [9:11]
plate2.x, plate2.y, # [11:13]
knife1.x, knife1.y, # [13:15]
knife2.x, knife2.y, # [15:17]
delivery.x, delivery.y, # [17:19]
agent1.x, agent1.y, # [19:21]
agent2.x, agent2.y, # [21:23]
agent3.x, agent3.y, # [23:25]
task_onehot # [25:32]
]Macro Actions:
- Navigation:
get_tomato,get_lettuce,get_onion - Interaction:
get_plate1,get_plate2,chop,deliver - Movement:
go_knife1,go_knife2,go_counter - Basic:
stay,up,down,left,right
# Example feedback for coordination improvement
feedback = "The red chef should get the tomato first, then the green chef can take it and cut it as quickly as possible."
# Example feedback for task efficiency
feedback = "Agents should work in parallel - one chopping while another gets plates."
# Example feedback for error correction
feedback = "Don't deliver wrong salads! Make sure all ingredients are properly chopped and combined."You can extend M3HF by adding new reward function templates to the LLM prompts. The templates are defined in the REWARD_FUNCTION_BUILD_PROMPT and used by the LLM to generate reward functions based on human feedback:
# Edit prompt.py to add new templates to REWARD_FUNCTION_BUILD_PROMPT
# Existing templates in the prompt:
# 1. Distance-based:
# lambda obs, act: -sqrt((obs[e1_x] - obs[e2_x])**2 + (obs[e1_y] - obs[e2_y])**2)
# Example: lambda obs, act: -sqrt((obs[19] - obs[0])**2 + (obs[20] - obs[1])**2) # Distance between agent 1 and tomato
# 2. Action-based:
# lambda obs, act: 1 if act == desired_action else 0
# Example: lambda obs, act: 1 if act == 5 else 0 # Reward for 'Interact' action
# 3. Status-based:
# lambda obs, act: 1 if obs[e_status] == desired_status else 0
# Example: lambda obs, act: 1 if obs[2] == 1 else 0 # Reward if tomato is chopped
# 4. Proximity-based:
# lambda obs, act: r_prox if sqrt((obs[e1_x] - obs[e2_x])**2 + (obs[e1_y] - obs[e2_y])**2) <= d else 0
# Example: lambda obs, act: 0.5 if sqrt((obs[19] - obs[13])**2 + (obs[20] - obs[14])**2) <= 1 else 0 # Reward if agent 1 is near knife 1
# 5. Time-based penalty:
# lambda obs, act, t: -beta * t
# Example: lambda obs, act, t: -0.001 * t # Increasing penalty over time
# 6. Success-based:
# lambda obs, act: r_success if goal_condition_met(obs) else 0
# Example: lambda obs, act: 10 if obs[25] == 1 and obs[17] == obs[19] and obs[18] == obs[20] else 0 # Reward for delivering completed order
# 7. Energy-based penalty:
# lambda obs, act: -gamma * energy_cost(act)
# Example: lambda obs, act: -0.2 * (1 if act != 0 else 0) # Penalty for non-zero actions
# 8. Composite reward:
# lambda obs, act: sum(weight_i * reward_function_i(obs, act) for i in range(n))
# To add custom templates:
# 1. Edit prompt.py
# 2. Add your template to REWARD_FUNCTION_BUILD_PROMPT
# 3. Provide examples showing how to parameterize it
# 4. Include documentation for the LLM to understand when to use it-
OpenAI API Errors:
export OPENAI_API_KEY="your-key-here" # Or disable LLM: python m3hf_main.py --no_openai
-
Ray/RLLib Issues:
pip install ray[rllib]==2.0.0 # Check Ray version compatibility -
SDL Dependencies:
sudo apt-get install libsdl2-dev libsdl2-image-dev libsdl2-mixer-dev
-
CUDA/GPU Issues:
# Check CUDA installation and GPU availability nvidia-smi # Use CPU-only mode python m3hf.py --num_gpus 0 # Select specific GPUs to avoid conflicts python m3hf.py --gpu_devices "0,2" --num_gpus 2 # Check GPU memory usage and free up memory if needed nvidia-smi --query-gpu=memory.used,memory.total --format=csv
-
GPU Memory Issues:
# Reduce batch sizes if running out of memory # Edit m3hf.py and reduce train_batch_size and sgd_minibatch_size # Or use fewer workers python m3hf.py --num_workers 2 --workers_per_cpu 1 # Monitor GPU usage during training watch -n 1 nvidia-smi
- Multi-GPU Training:
# Use specific GPUs for optimal performance python m3hf.py --num_gpus 2 --gpu_devices "0,2"
- Parallel Workers:
# Increase workers based on CPU cores (recommended: CPU_cores / 2) python m3hf.py --num_workers 8 --workers_per_cpu 2 - Memory Usage:
# Monitor memory and adjust if needed # Batch sizes are auto-optimized for multi-GPU setups # Default: train_batch_size=20480, sgd_minibatch_size=2048
- Hardware Recommendations:
- Minimum: 1 GPU, 8 CPU cores, 16GB RAM
- Recommended: 2+ GPUs, 16+ CPU cores, 32GB+ RAM
- Optimal: 3+ GPUs (A30/V100/A100), 32+ CPU cores, 64GB+ RAM
- Wandb Logging: Enable for tracking:
--use_wandb
If you use this code in your research, please cite:
@article{wang2025m3hf,
title={M3HF: Multi-agent Reinforcement Learning from Multi-phase Human Feedback of Mixed Quality},
author={Wang, Ziyan and Zhang, Zhicheng and Fang, Fei and Du, Yali},
journal={Proceedings of the 42nd International Conference on Machine Learning},
year={2025}
}




