This repository contains the official implementation of the paper "Context-level Language Modeling by Learning Predictive Context Embeddings".
ContextLM enhances large language models (LLMs) by introducing a context-level prediction objective alongside standard next-token prediction (NTP), enabling capture of higher-level semantic structures and long-range dependencies—while remaining fully compatible with existing autoregressive LLMs and evaluation paradigms (e.g., perplexity).
📖 Paper: Context-level Language Modeling by Learning Predictive Context Embeddings
📅 Under review at ICLR 2026
git clone https://github.com/dbylynn/ContextLM.git
cd ContextLMconda create -n contextlm python=3.9
conda activate contextlm
pip install -r requirements.txt- Data preprocessing (generate
train.binandval.bin)
python data/openwebtext_preprocess/prepare.pyThe script will download and tokenize OpenWebText using HuggingFace datasets and write out binary token files. You can adjust num_proc and other variables inside the script depending on your hardware.
- Evaluate a single checkpoint (example)
test.py is a Hydra-based evaluation script. Example command to evaluate a single checkpoint:
python test.py eval_single_ckpt=true load_path=/path/to/checkpoint Common flags:
eval_single_ckpt: boolean, evaluate a single checkpoint when trueload_path: path to the checkpoint or directory containing checkpoint steps
- Evaluate multiple checkpoints in a directory (periodic evaluation)
python test.py eval_single_ckpt=false load_path=/path/to/checkpoint_dirEvaluation results print to stdout and are written to eval_results.json or results/<load_path>/eval_results_ppl.json depending on the configuration.
We release the following pre-trained ContextLM models under the GPT-2 scaling law framework. All models are enhanced with our context-level prediction objective and can be directly loaded via Hugging Face's transformers library.
| Model | Hugging Face Hub Link |
|---|---|
| ContextLM-Base | https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_base |
| ContextLM-Medium | https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_med |
| ContextLM-Large | https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_large |
| ContextLM-XL | https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_xl |
Check the LICENSE file in the repository root for license details (if present).
If you find this work useful, please cite our paper (ICLR 2026 under review):
@inproceedings{contextlm2026,
title={Context-level Language Modeling by Learning Predictive Context Embeddings},
author={Dai, Beiya and Liu, Yuliang and Xue, Daozheng and Guo, Qipeng and Chen, Kai and Wang, Xinbing and Zhou, Bowen and Lin, Zhouhan},
booktitle={arxiv},
year={2026}
}
