MAGMA: Multistep Algorithmic Reasoning Benchmark

Summary

MAGMA v2 evaluates large language models (LLMs) on five classical graph algorithms—BFS, DFS, Dijkstra’s, Floyd–Warshall, and Prim’s MST—with an emphasis on multistep reasoning and intermediate-step accuracy. Despite progress in LLMs, structured tasks like graph algorithms remain challenging. This benchmark highlights where LLMs excel and where they fall short.

Features

Five Graph Algorithms: BFS, DFS, Dijkstra’s, Floyd–Warshall, and MST (Prim).
Intermediate Steps: Measure chain-of-thought accuracy, not just final outputs.
Multiple Sources: Synthetic and real-world graph data.
Flexible Prompting: Chain-of-thought or instruction-based queries.
Modular & Extensible: Easy to add new tasks or adapt data generation.

Installation

Clone the Repo:

git clone https://github.com/ataylor24/MAGMA_v2.git
cd MAGMA_v2

Install (editable mode recommended):
```
pip install -e .
```

Conda users:

conda env create --file environment.yml
conda activate nar2
pip install -e .

Data Generation

Standard usage for all algorithms:

python magma_generation.py all

This runs BFS, DFS, Dijkstra’s, Floyd–Warshall, and Prim’s MST with default settings in globals.py. Adjust via CLI flags as needed.

Key Defaults (in `globals.py`)

TRAIN_TEST_SPLIT maps graph sizes to train/test counts.
_OOD_TRAIN_LENGTHS & _OOD_EVAL_LENGTHS for out-of-distribution sizes.
OUTPUT_FORMATS: e.g. cot_analysis, magma, etc.
FORMATTED_ALGORITHMS: instructions/output formatting for each algorithm.
COT_PROMPT: chain-of-thought prompt template.

Example Usage

python sample_data.py bfs \
  --graph_sizes 5 7 10 \
  --seed 1234 \
  --ood_generation False \
  --output_dir /path/to/output \
  --output_formats cot_analysis

bfs can be replaced with dfs, dijkstra, floyd_warshall, mst_prim, or all.
graph_sizes sets node counts.
ood_generation enables out-of-distribution sampling.
output_formats toggles data format (chain-of-thought, etc.).

Performance Metrics

Exact Match Accuracy: Matches final solution exactly.
F1 Score: Partial correctness.
Intermediate Steps: Evaluates step-by-step reasoning.
Final Step: Only the final result.
Trajectory: Entire chain-of-thought.
Independent: Each step treated independently.

Contributing

Fork the repo.
Create a branch.
Commit changes.
Push the branch.
Open a Pull Request.

Please run pytest before submitting.

Reproducibility

Seed: 100898 ensures consistent data generation.
Other defaults in code or config.

License

Acknowledgements

Data adapted from CLRS benchmark
Model training adapted from Hugging Face Alignment Handbook

Thank you for using MAGMA v2! Accelerate research into LLM-based graph algorithmic reasoning.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
magma.egg-info		magma.egg-info
magma		magma
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MAGMA: Multistep Algorithmic Reasoning Benchmark

Summary

Features

Installation

Data Generation

Key Defaults (in `globals.py`)

Example Usage

Performance Metrics

Contributing

Reproducibility

License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

ataylor24/MAGMA_v2

Folders and files

Latest commit

History

Repository files navigation

MAGMA: Multistep Algorithmic Reasoning Benchmark

Summary

Features

Installation

Data Generation

Key Defaults (in globals.py)

Example Usage

Performance Metrics

Contributing

Reproducibility

License

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Key Defaults (in `globals.py`)

Packages