LatentMAS is a multi-agent reasoning framework that moves agent collaboration from token space into the modelβs latent space.
Instead of producing long textual reasoning traces, agents communicate by passing latent thoughts through their own working memory. LatentMAS has the following key features:
- Efficient multi-step reasoning with drastically fewer tokens
- Training-free latent-space alignment for stable generation
- A general technique compatible with any HF model and optionally vLLM backends.
Overall, LatentMAS achieves superior performance, lower token usage, and major wall-clock speedups of multi-agent system.
-
[2025-12-20] Check Science-LatentMAS, an excellent extension of LatentMAS developed by Prof. Markus J. Buehler and the LAMM Lab at MIT. Science-LatentMAS is specifically designed for the scientific discovery downstream applications! For more details and instructions, please check our README section "Science-LatentMAS" below and the new
Science-LatentMASbranch. -
[2025-12-15] Check out these amazing community-driven extensions of LatentMAS!
- KNN-LatentMAS β Enables more efficient KV utilization for latent memory.
- Hybrid-LatentMAS β Extends LatentMAS to support hybrid, heterogeneous multi-agent systems.
-
[2025-11-25] We have released our paper and code implementations for LatentMAS! Stay tuned for more model-backbone supports and advanced features!
-
[2025-11-25] We are featured as π€ HuggingFace 1st Paper of the Day!
Explore community-driven extensions that expand LatentMAS into new domains, architectures, and collaboration patterns:
By Prof. Markus J. Buehler & MIT LAMM Group
- New Branch: https://github.com/Gen-Verse/LatentMAS/tree/Science-LatentMAS
- Original Code: https://github.com/lamm-mit/LatentMAS/tree/flexible_agents
What it adds: Extends LatentMAS for scientific modeling and material-system collaboration, enabling flexible agent types and specialized latent communication for science domains.
By Bookmaster9
- Blog (Overview): https://bookmaster9.github.io/kNN-latentMAS/
- Code: https://github.com/Bookmaster9/kNN-latentMAS
What it adds: Introduces kNN-based latent retrieval to improve KV-cache usage, boosting memory efficiency and multi-step reasoning stability across agents.
By nhminle
- Code: https://github.com/nhminle/LatentMAS-Hybrid
What it adds: Supports heterogeneous / hybrid agent collaboration (LLM + non-LLM agents), enabling modular multi-agent pipelines that mix models, tools, and reasoning strategies.
If your work extends LatentMAS, feel free to open a PR and weβll feature it here! π
Three main tables from our paper spanning 9 tasks across math & science reasoning, commensonse reasoning, and code generation:
-
Table 1 β LatentMAS under the Sequantial MAS setting
-
Table 2 β LatentMAS under the Hierarchical MAS setting
-
Table 3 β Main Results on Reasoning Intensive Tasks
Overall, LatentMAS reduces:
- ~50β80% tokens
- ~3Γβ7Γ wall-clock time compared to standard Text-MAS or chain-of-thought baselines.
This repository provides all code for reproducing LatentMAS, TextMAS, and baseline single-agent experiments across GSM8K, AIME24/25, GPQA, ARC-Easy/Challenge, MBPP+, HumanEval+, and MedQA.
We recommend setting your HF cache directory to avoid repeated downloads:
export HF_HOME=/path/to/huggingface
export TRANSFORMERS_CACHE=$HF_HOME
export HF_DATASETS_CACHE=$HF_HOMEModels and datasets will automatically be downloaded into $HF_HOME.
conda create -n latentmas python=3.10 -y
conda activate latentmas
pip install -r requirements.txtIf you want vLLM support, also install:
pip install vllmgit clone https://github.com/Gen-Verse/LatentMAS.git
cd LatentMASLatentMAS/
βββ run.py # Main entry for experiments
βββ models.py # Wrapper for HF + vLLM + latent realignment
βββ methods/
β βββ baseline.py # Single-agent baseline
β βββ text_mas.py # Token-space multi-agent method
β βββ latent_mas.py # Latent-space multi-agent (our method)
βββ prompts.py # Prompt constructors
βββ data.py # Dataset loaders
βββ data/ # Provided data + figures (We give medqa.json as an example here)
βββ utils.py # Answer parsing / timeout / helpers
βββ example_logs/ # Example logs from LatentMAS
βββ requirements.txt
python run.py --method baseline --model_name Qwen/Qwen3-14B --task gsm8k --max_samples -1 --max_new_tokens 2048python run.py --method text_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequential --max_samples -1 --max_new_tokens 2048python run.py --method latent_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequential --max_samples -1 --max_new_tokens 2048--latent_stepsβ [0, 80] Tune for best performance.--latent_space_realignEnables latentβembedding alignment We treat this as a hyperparameter β enable/disable depending on task/model:
python run.py --method latent_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequential --max_samples -1 --latent_space_realign --max_new_tokens 2048Two example LatentMAS logs are provided for reference purposes:
example_logs/qwen3_14b_mbppplus_sequential.txtexample_logs/qwen3_14b_humanevalplus_hierarchical.txt
Please refer to additional experiment logs here. You can open them to view the full agent interaction traces and outputs.
LatentMAS supports vLLM for faster inference.
python run.py --method baseline --model_name Qwen/Qwen3-14B --task gsm8k --max_samples -1 --use_vllm --max_new_tokens 2048python run.py --method text_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequential --max_samples -1 --use_vllm --max_new_tokens 2048LatentMAS supports a hybrid HF + vLLM pipeline for fast inference:
- vLLM handles final text generation (with prefix caching, tensor parallelism, etc.)
- A HuggingFace model handles latent-space rollout and hidden-state alignment
For this setup, we recommend using two GPUs:
- One GPU for vLLM (
--device, e.g.,cuda:0) - One GPU for the auxiliary HF model (
--device2, e.g.,cuda:1)
CUDA_VISIBLE_DEVICES=0,1 python run.py --method latent_mas --model_name Qwen/Qwen3-14B --task gsm8k --prompt sequential --max_samples -1 --max_new_tokens 2048 \
--use_vllm \
--use_second_HF_model \
--enable_prefix_caching \
--device2 cuda:1πImportant Note:
vLLM does not officially support modifying KV-cache or prompting via latent embeddings. We modify the partial inner package inside vLLM backend for our method implementation. Note minor numeric differences may arise compared to offical HF backend due to different decoding (generation) strategies. Please Use the HF backend to reproduce the official published results.
π« If you find LatentMAS helpful, please kindly give us a star βοΈ and cite below. Thanks!
@article{zou2025latentmas,
title={Latent Collaboration in Multi-Agent Systems},
author={Zou, Jiaru and Yang, Xiyuan and Qiu, Ruizhong and Li, Gaotang and Tieu, Katherine and Lu, Pan and Shen, Ke and Tong, Hanghang and Choi, Yejin and He, Jingrui and Zou, James and Wang, Mengdi and Yang, Ling},
journal={arXiv preprint arXiv:2511.20639},
year={2025}
}
This code is partially based on the amazing work of vLLM.




