|
| 1 | +--- |
| 2 | +layout: post |
| 3 | +title: "ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization" |
| 4 | +date: 2025-03-23 |
| 5 | +description: "ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization" |
| 6 | +tags: ml paper llm quantization |
| 7 | +comments: true |
| 8 | +--- |
| 9 | + |
| 10 | +<style> |
| 11 | +li { |
| 12 | + font-size: 1.1rem; /* Adjust as needed */ |
| 13 | +} |
| 14 | +</style> |
| 15 | + |
| 16 | +# [ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization](https://arxiv.org/abs/2502.02631) |
| 17 | +> [TL;DR] |
| 18 | +> The paper introduces ParetoQ, a unified framework that compares LLM quantization across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit settings. It discovers a key transition between 2-bit and 3-bit quantization, where models retain their original representations at 3-bit and higher, but undergo substantial changes at lower bit widths. ParetoQ shows that 2-bit quantization is a strong alternative to 4-bit due to its superior efficiency-accuracy trade-offs. |
| 19 | +
|
| 20 | + |
| 21 | + |
| 22 | +## Highlights |
| 23 | +- Demonstrates that 2-bit, 3-bit, and ternary quantization often outperform 4-bit in terms of accuracy-memory trade-offs. |
| 24 | +<div class="row mt-3"> |
| 25 | + <div class="col-sm-6 mt-3 mt-md-0 offset-3"> |
| 26 | + {% include figure.html path="assets/img/posts/paretoq/pareto.png" class="img-fluid rounded z-depth-1" %} |
| 27 | + </div> |
| 28 | +</div> |
| 29 | +<br> |
| 30 | +- Identifies a sharp transition between 2-bit and 3-bit quantization, where 3-bit models and above retain pre-trained distributions, while 2-bit models undergo major representation shifts. |
| 31 | +<div class="row mt-3"> |
| 32 | + <div class="col-sm-6 mt-3 mt-md-0 offset-3"> |
| 33 | + {% include figure.html path="assets/img/posts/paretoq/compansate_reconstruct.png" class="img-fluid rounded z-depth-1" %} |
| 34 | + </div> |
| 35 | +</div> |
| 36 | +<br> |
| 37 | +- Quantization-aware training (QAT) consistently surpasses both post-training quantization (PTQ, no fine-tuning) and QAT from scratch. |
| 38 | +<div class="row mt-3"> |
| 39 | + <div class="col-sm-6 mt-3 mt-md-0 offset-3"> |
| 40 | + {% include figure.html path="assets/img/posts/paretoq/qat_vs_ptq.png" class="img-fluid rounded z-depth-1" %} |
| 41 | + </div> |
| 42 | +</div> |
| 43 | +<br> |
| 44 | +- Propose a refined quantization functions, Stretched Elastic Quant (SEQ), for low-bit settings. |
| 45 | +$$ |
| 46 | +\mathbf{W}_Q^i = \alpha \left( \left\lfloor \text{Clip} \left( \frac{\mathbf{W}_R^i}{\alpha}, -1, 1 \right) \times \frac{k}{2} - 0.5 \right\rfloor + 0.5 \right) / k \times 2 |
| 47 | +$$ |
| 48 | +$$ |
| 49 | +\mathbf{W}_Q^i = \alpha \mathbf{\hat{W}}_Q^i |
| 50 | += |
| 51 | +\begin{cases} |
| 52 | +\alpha \cdot \text{Sign}(\mathbf{W}_R^i), & \text{if } N_{bit} = 1 \\ |
| 53 | +\alpha \left( \left\lfloor \text{Clip} \left( \frac{\mathbf{W}_R^i}{\alpha}, -1, 1 \right) \times \frac{k}{2} - 0.5 \right\rfloor + 0.5 \right) / k \times 2, & \text{if } N_{bit} = 1.58, 2 \\ |
| 54 | +\alpha \lfloor \text{Clip} \left( \frac{\mathbf{W}_R^i}{\alpha}, n, p \right) \rfloor, & \text{if } N_{bit} = 3, 4 |
| 55 | +\end{cases} |
| 56 | +$$ |
| 57 | +<div class="row mt-3"> |
| 58 | + <div class="col-sm-10 mt-3 mt-md-0 offset-1"> |
| 59 | + {% include figure.html path="assets/img/posts/paretoq/seq.png" class="img-fluid rounded z-depth-1" %} |
| 60 | + </div> |
| 61 | +</div> |
| 62 | +<br> |
| 63 | + |
| 64 | + |
| 65 | +## Summary |
| 66 | +- **Observation 1**: Recent studies on scaling laws in the low-precision domain have reached conflicting conclusions. |
| 67 | + - [Dettmers & Zettlemoyer](https://proceedings.mlr.press/v202/dettmers23a) and [Kumar et al](https://arxiv.org/abs/2411.04330) argue that 4-bit or 6-bit quantization often resides on the Pareto frontier, balancing accuracy and efficiency. |
| 68 | + - In contrast, [Ma et al.](https://storage.prod.researchhub.com/uploads/papers/2024/02/29/2402.17764.pdf) and [Kaushal et al.](https://arxiv.org/abs/2407.12327) suggest that bit-widths as low as 1.58 bits per parameter offer significant potential for optimal scaling trade-offs. |
| 69 | +- **Observation 2**: Prior studies overlook the impact of the training scheme, denoted as $$\mathbf{S}_{\text{train}}$$, and the bit-specific quantization function $$\mathcal{F}$$. |
| 70 | +- **The problem statement**: How to determine the optimal trade-off between bit-width and model size while ensuring accuracy? |
| 71 | +- **The solution**: The authors propose a scaling law $$\mathcal{L}(\mathcal{N}, \mathcal{D}, \mathcal{P}, \mathbf{S}_{\text{train}}, \mathcal{F})$$ comprising five dimensions, and systematically optimizes quantization functions and training schemes across different bit-widths. |
| 72 | + - Introduces Stretched Elastic Quantization (SEQ), which balances quantization grids for 2-bit and ternary settings. |
| 73 | + - Applies learnable quantization ranges, outperforming static min-max methods. |
| 74 | +- **The proposed framework**: The quantized framework evaluates models under 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit precision. |
| 75 | + |
| 76 | + |
| 77 | +## Experiments |
| 78 | + |
| 79 | +- Accuracy-compression and Accuracy-speed Trade-off |
| 80 | +<div class="row mt-3"> |
| 81 | + <div class="col-sm-12 mt-3 mt-md-0 offset-0"> |
| 82 | + {% include figure.html path="assets/img/posts/paretoq/pareto_models_latency.png" class="img-fluid rounded z-depth-1" %} |
| 83 | + </div> |
| 84 | +</div> |
| 85 | +<br> |
| 86 | + |
| 87 | + |
| 88 | +- 2-bit / 3-bit / 4-bit Comparisons |
| 89 | +<div class="row mt-3"> |
| 90 | + <div class="col-sm-12 mt-3 mt-md-0 offset-0"> |
| 91 | + {% include figure.html path="assets/img/posts/paretoq/bits_2_3_4.png" class="img-fluid rounded z-depth-1" %} |
| 92 | + </div> |
| 93 | +</div> |
| 94 | +<br> |
| 95 | + |
| 96 | +- 1.58-bit Comparison on Sub-8B Models |
| 97 | + - Note: floating-point LLaMA-3 3B model achieves 69.9 accuracy |
| 98 | +<div class="row mt-3"> |
| 99 | + <div class="col-sm-6 mt-3 mt-md-0 offset-3"> |
| 100 | + {% include figure.html path="assets/img/posts/paretoq/sub_8b_model.png" class="img-fluid rounded z-depth-1" %} |
| 101 | + </div> |
| 102 | +</div> |
| 103 | +<br> |
| 104 | + |
| 105 | +- Main Results |
| 106 | +<div class="row mt-3"> |
| 107 | + <div class="col-sm-8 mt-3 mt-md-0 offset-2"> |
| 108 | + {% include figure.html path="assets/img/posts/paretoq/main_table.png" class="img-fluid rounded z-depth-1" %} |
| 109 | + </div> |
| 110 | +</div> |
| 111 | +<br> |
| 112 | + |
| 113 | +## Conclusions |
| 114 | +- 2-bit quantization outperforms 4-bit in efficiency-accuracy trade-offs. |
| 115 | +- Fine-tuning is crucial for sub-4-bit quantization, especially for binary and ternary models. |
| 116 | +- Quantization-aware training (QAT) finetuning consistently surpasses both post-training quantization (PTQ, no fine-tuning) and QAT from scratch |
| 117 | +- QAT serves as a compensation mechanism for bit widths above 2-bit and as a reconstruction process for bit widths below 2-bit, where weights adapt to form new representations. |
| 118 | +- Extreme low-bit quantization is highly sensitive to quantization function selection, with no single optimal function for all bit widths. |
0 commit comments