Skip to content

LUMIA-Group/ContextLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ContextLM

This repository contains the official implementation of the paper "Context-level Language Modeling by Learning Predictive Context Embeddings".

ContextLM enhances large language models (LLMs) by introducing a context-level prediction objective alongside standard next-token prediction (NTP), enabling capture of higher-level semantic structures and long-range dependencies—while remaining fully compatible with existing autoregressive LLMs and evaluation paradigms (e.g., perplexity).

📖 Paper: Context-level Language Modeling by Learning Predictive Context Embeddings
📅 Under review at ICLR 2026

ContextLM Overview

Installation

1. Clone the Repository

git clone https://github.com/dbylynn/ContextLM.git
cd ContextLM

2. Install Dependencies

conda create -n contextlm python=3.9
conda activate contextlm
pip install -r requirements.txt

Usage

  1. Data preprocessing (generate train.bin and val.bin)
python data/openwebtext_preprocess/prepare.py

The script will download and tokenize OpenWebText using HuggingFace datasets and write out binary token files. You can adjust num_proc and other variables inside the script depending on your hardware.

  1. Evaluate a single checkpoint (example)

test.py is a Hydra-based evaluation script. Example command to evaluate a single checkpoint:

python test.py eval_single_ckpt=true load_path=/path/to/checkpoint 

Common flags:

  • eval_single_ckpt: boolean, evaluate a single checkpoint when true
  • load_path: path to the checkpoint or directory containing checkpoint steps
  1. Evaluate multiple checkpoints in a directory (periodic evaluation)
python test.py eval_single_ckpt=false load_path=/path/to/checkpoint_dir

Evaluation results print to stdout and are written to eval_results.json or results/<load_path>/eval_results_ppl.json depending on the configuration.

Checkpoints

We release the following pre-trained ContextLM models under the GPT-2 scaling law framework. All models are enhanced with our context-level prediction objective and can be directly loaded via Hugging Face's transformers library.

Model Hugging Face Hub Link
ContextLM-Base https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_base
ContextLM-Medium https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_med
ContextLM-Large https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_large
ContextLM-XL https://huggingface.co/daibeiya/ContextLM/tree/main/contextlm_gpt2_xl

License

Check the LICENSE file in the repository root for license details (if present).

Citation

If you find this work useful, please cite our paper (ICLR 2026 under review):

@inproceedings{contextlm2026,
  title={Context-level Language Modeling by Learning Predictive Context Embeddings},
  author={Dai, Beiya and Liu, Yuliang and Xue, Daozheng and Guo, Qipeng and Chen, Kai and Wang, Xinbing and Zhou, Bowen and Lin, Zhouhan},
  booktitle={arxiv},
  year={2026}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages