MCPMark is a comprehensive evaluation suite for evaluating the agentic ability of frontier models.
MCPMark includes Model Context Protocol (MCP) service in following environments
- Notion
- Github
- Filesystem
- Postgres
- Playwright
- Playwright-WebArena
MCPMark is designed to run agentic tasks in complex environment safely. Specifically, it sets up an isolated environment for the experiment, completing the task, and then destroy the environment without affecting existing user profile or information.
- MCPMark Installation.
- Authorize service (for Github and Notion).
- Configure the environment variables in
.mcp_env. - Run MCPMark experiment.
Please refer to Quick Start for details regarding how to start a sample filesystem experiment in properly, and Task Page for task details. Please visit Installation and Docker Uusage information of full MCPMark setup.
MCPMark supports the following mode to run experiments (suppose the experiment is named as new_exp, and the model used are o3 and gpt-4.1 and the environment is notion), with K repetive experiments.
# Evaluate ALL tasks
python -m pipeline --exp-name new_exp --mcp notion --tasks all --models o3 --k K
# Evaluate a single task group (online_resume)
python -m pipeline --exp-name new_exp --mcp notion --tasks online_resume --models o3 --k K
# Evaluate one specific task (task_1 in online_resume)
python -m pipeline --exp-name new_exp --mcp notion --tasks online_resume/task_1 --models o3 --k K
# Evaluate multiple models
python -m pipeline --exp-name new_exp --mcp notion --tasks all --models o3,gpt-4.1 --k K# Run all tasks for one service
./run-task.sh --mcp notion --models o3 --exp-name new_exp --tasks all
# Run comprehensive benchmark across all services
./run-benchmark.sh --models o3,gpt-4.1 --exp-name new_exp --dockerFor re-run experiments, only unfinished tasks will be executed. Tasks that previously failed due to pipeline errors (such as State Duplication Error or MCP Network Error) will also be retried automatically.
The experiment results are written to ./results/ (JSON + CSV).
MCP supports aggreated metrics of pass@1, pass@K, pass^K, avg@K.
python -m src.aggregators.aggregate_results --exp-name new_expMCPMark supports the following models with according providers (model codes in the brackets).
- GPT-5 (gpt-5)
- o3 (o3)
- Claude-4.1-Opus (claude-4.1-opus)
- Claude-4-Sonnet (claude-4-sonnet)
- Gemini-2.5-Pro (gemini-2.5-pro)
- Grok-4 (grok-4)
- DeepSeek-Chat (deepseek-chat)
- Qwen3-Coder (qwen-3-coder)
- Kimi-K2 (k2)
Visit Contributing Page to learn how to make contribution to MCPMark.