Skip to content

SamsungLabs/BOA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BoA

This repository contains the code for the ICML 2025 paper BoA: Attention-Aware Post-Training Quantization without Backpropagation.

The current release includes the following features:

  • Implementation of the proposed BoA: boa.py
  • Quantization of OPT, Llama, Llama2, Llama3, Qwen2.5, Qwen3 models: main.py
  • Evaluating the perplexity and 0-shot accuracy (8 tasks) of quantized models

Dependencies

  • see requirements.txt

BoA options

  • block_v: whether to apply block-wise objective for the value projection layer. In memory-limited cases, we can significantly reduce memory by de-activating this option, but at the expense of a slight performance degradation.
  • act_order_col: whether to re-order columns before the quantization based on the column-wise Hessian $\mathbf{H}_{col}$ (GPTQ heuristic)
  • act_order_row: whether to re-order rows before the quantization based on the row-wise Hessian $\mathbf{H}_{row}$
  • qparam_comput: how to select quantization grids. Grids can be determined with a naive MinMax or to minimize the weight perturbation (MMSE) or the layer-wise reconstruction error (Hessian)

For more details on other arguments, please refer to process_args.py.

Experimental Results

  • Setup
    • NVIDIA H100 GPU has been used.
    • block_v option has been activated.
    • qparam_comput option has been set to Hessian.
    • Test all cases for act_order_row and act_order_col and report the best results with respect to Wiki2 PPL.

Results on Qwen2.5 Models

  • INT2 weight-only quantization

    Size act_order_row act_order_col Wiki2 ($\downarrow$) C4-new ($\downarrow$) 0-shot ($\uparrow$)
    0.5B O O 144.7 455.8 32.56
    1.5B O O 58.09 235.7 36.93
    3B X O 26.55 90.77 43.28
    7B X O 23.14 103.4 43.79
    14B O O 12.05 37.64 57.4
  • INT3 weight-only quantization

    Size act_order_row act_order_col Wiki2 ($\downarrow$) C4-new ($\downarrow$) 0-shot ($\uparrow$)
    0.5B O O 20.12 40.14 45.68
    1.5B X O 12.03 23.52 56.90
    3B O O 9.541 17.32 60.29
    7B O O 9.054 18.19 66.12
    14B O O 6.492 12.20 71.80
  • Quantization processing time

    Size Time (min)
    0.5B 6.145
    1.5B 22.81
    3B 39.59
    7B 63.14
    14B 150.9

Results on Qwen3 Models

  • INT2 weight-only quantization

    Size act_order_row act_order_col Wiki2 ($\downarrow$) C4-new ($\downarrow$) 0-shot ($\uparrow$)
    4B X O 78.57 302.8 35.13
    8B X O 32.53 96.43 38.47
    14B X O 24.29 75.76 42.41
  • INT3 weight-only quantization

    Size act_order_row act_order_col Wiki2 ($\downarrow$) C4-new ($\downarrow$) 0-shot ($\uparrow$)
    4B X O 28.19 48.24 45.25
    8B X O 15.62 29.60 53.69
    14B X O 10.62 18.61 66.69
  • Quantization processing time

    Size Time (min)
    4B 43.10
    8B 83.77
    14B 132.6

License

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC).

Citation

If you find this work is useful for your research, please cite our paper:

@inproceedings{kimboa,
  title={BoA: Attention-aware Post-training Quantization without Backpropagation},
  author={Kim, Junhan and Kim, Ho-young and Cho, Eulrang and Lee, Chungman and Kim, Joonyoung and Jeon, Yongkweon},
  booktitle={Forty-second International Conference on Machine Learning}
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages