LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

[COLM 2025 Paper]

LongProc (Long Procedural Generation) is a benchmark that evaluates long-context LLMs through long procedural generation, which requires models to follow specified procedures and generate structured outputs. See examples tasks below.

Data

LongProc consists of 6 tasks. Each tasks generally includes three difficulty levels with maximum numbers of output tokens set at 500, 2K, and 8K. The 6 tasks are included as follows:

html_to_tsv(HTML TO TSV): Extract specified information from HTML pages and structure it into a table format (TSV)
pseudo_to_code (Pseudocode to Code): Translate pseudocode that is structured line-by-line into corresponding C++ code.
path_traversal (Path Traversal): Traverse a route that connects two cities in a graph where each city has only one outgoing connection.
tom_tracking (Theory-of-Mind Tracking): Track the locations and beliefs in stories about object placement asked in the question.
countdown (Countdown): Search to combine a set of numbers with basic arithmetic operations to reach a target number.
travel_planning (Travel Planning): Search to create a trip plan based on constraints regarding duration of stays, and direct flights.

Example

NOTE: This repository mainly serves for managing data, we provide minimal code exmples for playing around with loading and evaluating on our dataset. Please install the necessary packages with pip install -r requirements.txt.

We provide easily understandable examples for loading data and evaluating the results in example_usage.py

Using OPENAI: please install openai and setup your OPENAI_API_KEY if you would like to run the example.

python example_usage.py --dataset html_to_tsv_0.5k --model gpt-4o-mini
# dataset names are specified as [task_name]_[length]

Using VLLM: please set up your environment with vllm

python example_usage.py --dataset html_to_tsv_0.5k --model meta-llama/Llama-3.1-8B

Loading Data and Evaluation Function

Call load_longproc_data in longproc.longproc_data. The function returns:

A list of data points, each is a dict with input_prompt (a string of the prompt) reference_output (the ground truth procedure trace), and item (some meta info for the data point).
The corresponding evaluation function for the task. A evaluation function (e.g. eval_path_traversal in longproc.longproc_data), will take in the prediction (a string) and the data point, and returns: 1) metrics, and 2) additional information such as parsed outputs or brief descriptions of the errors.

Running Evaluation

With the code for loading data and evaluating predictions above you should be able to flexibly use your own codebase for running experiments.

In addition, we recommend using the HELMET codebase for more convenient evaluation. Please refer to the LongProc Addon of HELMET for evaluating LongProc.

Contacts

For questions, feel free to open issues or email xi.ye@princeton.edu.

References

# Dataset
@inproceedings{ye25longproc,
    title={LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation},
    author={Ye, Xi and Yin, Fangcong and He, Yinghui and Zhang, Joie and Yen, Howard and Gao, Tianyu and Durrett, Greg and Chen, Danqi},
    journal={Conference on Language Modeling},
    year={2025}
}

# Running evaluation with HELMET
@inproceedings{yen2025helmet,
      title={HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly}, 
      author={Howard Yen and Tianyu Gao and Minmin Hou and Ke Ding and Daniel Fleischer and Peter Izsak and Moshe Wasserblat and Danqi Chen},
      year={2025},
      booktitle={International Conference on Learning Representations (ICLR)},
}

Our benchmarks adapts several existing datasets. Please also cite the original datasets, listed below:

@article{arborist,
author = {Li, Xiang and Zhou, Xiangyu and Dong, Rui and Zhang, Yihong and Wang, Xinyu},
title = {Efficient Bottom-Up Synthesis for Programs with Local Variables},
year = {2024},
issue_date = {January 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {8},
number = {POPL},
url = {https://doi.org/10.1145/3632894},
doi = {10.1145/3632894},
journal = {Proc. ACM Program. Lang.},
month = jan,
articleno = {52},
numpages = {29},
keywords = {Observational Equivalence, Program Synthesis, Web Automation}
}

@inproceedings{spoc,
 author = {Kulal, Sumith and Pasupat, Panupong and Chandra, Kartik and Lee, Mina and Padon, Oded and Aiken, Alex and Liang, Percy S},
 booktitle = {Proceedings of the Conference on Advances in Neural Information Processing Systems (NeurIPS)},
 title = {{SPoC: Search-based Pseudocode to Code}},
}

@inproceedings{
gandhi2024stream,
title={{Stream of Search (SoS): Learning to Search in Language}},
author={Kanishk Gandhi and Denise H J Lee and Gabriel Grand and Muxin Liu and Winson Cheng and Archit Sharma and Noah Goodman},
booktitle={First Conference on Language Modeling},
year={2024},
}

@article{natplan,
  title={{NATURAL PLAN: Benchmarking LLMs on Natural Language Planning}},
  author={Zheng, Huaixiu Steven and Mishra, Swaroop and Zhang, Hugh and Chen, Xinyun and Chen, Minmin and Nova, Azade and Hou, Le and Cheng, Heng-Tze and Le, Quoc V and Chi, Ed H and others},
  journal={arXiv preprint arXiv:2406.04520},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
longproc		longproc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
example_usage.py		example_usage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

Data

Example

Loading Data and Evaluation Function

Running Evaluation

Contacts

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

princeton-pli/LongProc

Folders and files

Latest commit

History

Repository files navigation

LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

Data

Example

Loading Data and Evaluation Function

Running Evaluation

Contacts

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages