ContextLM

This repository contains the official implementation of the paper "Context-level Language Modeling by Learning Predictive Context Embeddings".

ContextLM enhances large language models (LLMs) by introducing a context-level prediction objective alongside standard next-token prediction (NTP), enabling capture of higher-level semantic structures and long-range dependencies—while remaining fully compatible with existing autoregressive LLMs and evaluation paradigms (e.g., perplexity).

📖 Paper: Context-level Language Modeling by Learning Predictive Context Embeddings
📅 Under review at ICLR 2026

Installation

1. Clone the Repository

git clone https://github.com/dbylynn/ContextLM.git
cd ContextLM

2. Install Dependencies

conda create -n contextlm python=3.9
conda activate contextlm
pip install -r requirements.txt

Usage

Data preprocessing (generate train.bin and val.bin)

python data/openwebtext_preprocess/prepare.py

The script will download and tokenize OpenWebText using HuggingFace datasets and write out binary token files. You can adjust num_proc and other variables inside the script depending on your hardware.

Evaluate a single checkpoint (example)

test.py is a Hydra-based evaluation script. Example command to evaluate a single checkpoint:

python test.py eval_single_ckpt=true load_path=/path/to/checkpoint

Common flags:

eval_single_ckpt: boolean, evaluate a single checkpoint when true
load_path: path to the checkpoint or directory containing checkpoint steps

Evaluate multiple checkpoints in a directory (periodic evaluation)

python test.py eval_single_ckpt=false load_path=/path/to/checkpoint_dir

Evaluation results print to stdout and are written to eval_results.json or results/<load_path>/eval_results_ppl.json depending on the configuration.

Checkpoints

We release the following pre-trained ContextLM models under the GPT-2 scaling law framework. All models are enhanced with our context-level prediction objective and can be directly loaded via Hugging Face's transformers library.

Model	Hugging Face Hub Link
ContextLM-Base	https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_base
ContextLM-Medium	https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_med
ContextLM-Large	https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_large
ContextLM-XL	https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_xl

License

Check the LICENSE file in the repository root for license details (if present).

Citation

If you find this work useful, please cite our paper (ICLR 2026 under review):

@inproceedings{contextlm2026,
  title={Context-level Language Modeling by Learning Predictive Context Embeddings},
  author={Dai, Beiya and Liu, Yuliang and Xue, Daozheng and Guo, Qipeng and Chen, Kai and Wang, Xinbing and Zhou, Bowen and Lin, Zhouhan},
  booktitle={arxiv},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
conf		conf
data		data
ds_configs		ds_configs
models		models
.DS_Store		.DS_Store
README.md		README.md
contextlm_arguments.py		contextlm_arguments.py
requirements.txt		requirements.txt
test.py		test.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ContextLM

Installation

1. Clone the Repository

2. Install Dependencies

Usage

Checkpoints

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ContextLM

Installation

1. Clone the Repository

2. Install Dependencies

Usage

Checkpoints

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages