This is the official implementation of the ICLR'25 paper "QERA: an Analytical Framework for Quantization Error Reconstruction".
git clone git@github.com:ChengZhang-98/QERA.git
cd QERA
git submodule update --init
conda env create -f environment.yml
conda activate qera
pip install -r requirements.txt
pip install -e .In the source code and scripts, we use the following abbreviations for the low-rank term types:
- If
--disable-qerais set, no low-rank terms are used, i.e., weight-only quantization. - Else:
identity: Truncated SVD on the quantized weight matrix, i.e., ZeroQuant-V2lqer: The heuristic method proposed in LQER paperdiag: QERA-approx in our paper.exact: QERA-exact in our paper.
ptq_bf16_baseline.pyevaluates BF16 baseline.ptq_q_baseline.pyevaluates PTQ baseline.ptq_pipeline.pyruns data calibration (if needed), computes low-rank terms, and evaluates the quantized model.ptq_pipeline_chunked.pyruns data calibration (if needed), and computes low-rank terms for a chunk of layers. This is useful for large models. If all chunks (layers) are computed, this script also triggers the evaluation of the quantized model.chunk_checker.pychecks the completion of the chunks (optional).
adapt_and_save.pyrun data calibration, quantizes the model, computes the initial value of the low-rank terms, and saves the quantized model + low-rank terms.glue_train.pyfine-tunes the qLoRA-adapted model with low-rank terms on GLUE tasks.clm_train.pyfine-tunes the qLoRA-adapted model with low-rank terms on WikiText2.gsm8k_train.pyfine-tunes the qLoRA-adapted model with low-rank terms on GSM8K.
See experiments/ptq and experiments/qpeft for PTQ and qLoRA fine-tuning experiments, respectively.
@article{zhang2024qera,
title={QERA: an Analytical Framework for Quantization Error Reconstruction},
author={Zhang, Cheng and Wong, Jeffrey TH and Xiao, Can and Constantinides, George A and Zhao, Yiren},
journal={arXiv preprint arXiv:2410.06040},
year={2024}
}