Skip to content

Commit 31570cf

Browse files
committed
add paretoq
1 parent 0b35317 commit 31570cf

12 files changed

+121
-3
lines changed

_posts/2025-02-16-halo-summary.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ comments: true
88
---
99
<style>
1010
li {
11-
font-size: 1.1em; /* Adjust as needed */
11+
font-size: 1.1rem; /* Adjust as needed */
1212
}
1313
</style>
1414

_posts/2025-02-23-PEQA-summary.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ comments: true
99

1010
<style>
1111
li {
12-
font-size: 1.1em; /* Adjust as needed */
12+
font-size: 1.1rem; /* Adjust as needed */
1313
}
1414
</style>
1515

_posts/2025-03-16-minions-summary.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ comments: true
99

1010
<style>
1111
li {
12-
font-size: 1.1em; /* Adjust as needed */
12+
font-size: 1.1rem; /* Adjust as needed */
1313
}
1414
</style>
1515

Lines changed: 118 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,118 @@
1+
---
2+
layout: post
3+
title: "ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization"
4+
date: 2025-03-23
5+
description: "ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization"
6+
tags: ml paper llm quantization
7+
comments: true
8+
---
9+
10+
<style>
11+
li {
12+
font-size: 1.1rem; /* Adjust as needed */
13+
}
14+
</style>
15+
16+
# [ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization](https://arxiv.org/abs/2502.02631)
17+
> [TL;DR]
18+
> The paper introduces ParetoQ, a unified framework that compares LLM quantization across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit settings. It discovers a key transition between 2-bit and 3-bit quantization, where models retain their original representations at 3-bit and higher, but undergo substantial changes at lower bit widths. ParetoQ shows that 2-bit quantization is a strong alternative to 4-bit due to its superior efficiency-accuracy trade-offs.
19+
20+
21+
22+
## Highlights
23+
- Demonstrates that 2-bit, 3-bit, and ternary quantization often outperform 4-bit in terms of accuracy-memory trade-offs.
24+
<div class="row mt-3">
25+
<div class="col-sm-6 mt-3 mt-md-0 offset-3">
26+
{% include figure.html path="assets/img/posts/paretoq/pareto.png" class="img-fluid rounded z-depth-1" %}
27+
</div>
28+
</div>
29+
<br>
30+
- Identifies a sharp transition between 2-bit and 3-bit quantization, where 3-bit models and above retain pre-trained distributions, while 2-bit models undergo major representation shifts.
31+
<div class="row mt-3">
32+
<div class="col-sm-6 mt-3 mt-md-0 offset-3">
33+
{% include figure.html path="assets/img/posts/paretoq/compansate_reconstruct.png" class="img-fluid rounded z-depth-1" %}
34+
</div>
35+
</div>
36+
<br>
37+
- Quantization-aware training (QAT) consistently surpasses both post-training quantization (PTQ, no fine-tuning) and QAT from scratch.
38+
<div class="row mt-3">
39+
<div class="col-sm-6 mt-3 mt-md-0 offset-3">
40+
{% include figure.html path="assets/img/posts/paretoq/qat_vs_ptq.png" class="img-fluid rounded z-depth-1" %}
41+
</div>
42+
</div>
43+
<br>
44+
- Propose a refined quantization functions, Stretched Elastic Quant (SEQ), for low-bit settings.
45+
$$
46+
\mathbf{W}_Q^i = \alpha \left( \left\lfloor \text{Clip} \left( \frac{\mathbf{W}_R^i}{\alpha}, -1, 1 \right) \times \frac{k}{2} - 0.5 \right\rfloor + 0.5 \right) / k \times 2
47+
$$
48+
$$
49+
\mathbf{W}_Q^i = \alpha \mathbf{\hat{W}}_Q^i
50+
=
51+
\begin{cases}
52+
\alpha \cdot \text{Sign}(\mathbf{W}_R^i), & \text{if } N_{bit} = 1 \\
53+
\alpha \left( \left\lfloor \text{Clip} \left( \frac{\mathbf{W}_R^i}{\alpha}, -1, 1 \right) \times \frac{k}{2} - 0.5 \right\rfloor + 0.5 \right) / k \times 2, & \text{if } N_{bit} = 1.58, 2 \\
54+
\alpha \lfloor \text{Clip} \left( \frac{\mathbf{W}_R^i}{\alpha}, n, p \right) \rfloor, & \text{if } N_{bit} = 3, 4
55+
\end{cases}
56+
$$
57+
<div class="row mt-3">
58+
<div class="col-sm-10 mt-3 mt-md-0 offset-1">
59+
{% include figure.html path="assets/img/posts/paretoq/seq.png" class="img-fluid rounded z-depth-1" %}
60+
</div>
61+
</div>
62+
<br>
63+
64+
65+
## Summary
66+
- **Observation 1**: Recent studies on scaling laws in the low-precision domain have reached conflicting conclusions.
67+
- [Dettmers & Zettlemoyer](https://proceedings.mlr.press/v202/dettmers23a) and [Kumar et al](https://arxiv.org/abs/2411.04330) argue that 4-bit or 6-bit quantization often resides on the Pareto frontier, balancing accuracy and efficiency.
68+
- In contrast, [Ma et al.](https://storage.prod.researchhub.com/uploads/papers/2024/02/29/2402.17764.pdf) and [Kaushal et al.](https://arxiv.org/abs/2407.12327) suggest that bit-widths as low as 1.58 bits per parameter offer significant potential for optimal scaling trade-offs.
69+
- **Observation 2**: Prior studies overlook the impact of the training scheme, denoted as $$\mathbf{S}_{\text{train}}$$, and the bit-specific quantization function $$\mathcal{F}$$.
70+
- **The problem statement**: How to determine the optimal trade-off between bit-width and model size while ensuring accuracy?
71+
- **The solution**: The authors propose a scaling law $$\mathcal{L}(\mathcal{N}, \mathcal{D}, \mathcal{P}, \mathbf{S}_{\text{train}}, \mathcal{F})$$ comprising five dimensions, and systematically optimizes quantization functions and training schemes across different bit-widths.
72+
- Introduces Stretched Elastic Quantization (SEQ), which balances quantization grids for 2-bit and ternary settings.
73+
- Applies learnable quantization ranges, outperforming static min-max methods.
74+
- **The proposed framework**: The quantized framework evaluates models under 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit precision.
75+
76+
77+
## Experiments
78+
79+
- Accuracy-compression and Accuracy-speed Trade-off
80+
<div class="row mt-3">
81+
<div class="col-sm-12 mt-3 mt-md-0 offset-0">
82+
{% include figure.html path="assets/img/posts/paretoq/pareto_models_latency.png" class="img-fluid rounded z-depth-1" %}
83+
</div>
84+
</div>
85+
<br>
86+
87+
88+
- 2-bit / 3-bit / 4-bit Comparisons
89+
<div class="row mt-3">
90+
<div class="col-sm-12 mt-3 mt-md-0 offset-0">
91+
{% include figure.html path="assets/img/posts/paretoq/bits_2_3_4.png" class="img-fluid rounded z-depth-1" %}
92+
</div>
93+
</div>
94+
<br>
95+
96+
- 1.58-bit Comparison on Sub-8B Models
97+
- Note: floating-point LLaMA-3 3B model achieves 69.9 accuracy
98+
<div class="row mt-3">
99+
<div class="col-sm-6 mt-3 mt-md-0 offset-3">
100+
{% include figure.html path="assets/img/posts/paretoq/sub_8b_model.png" class="img-fluid rounded z-depth-1" %}
101+
</div>
102+
</div>
103+
<br>
104+
105+
- Main Results
106+
<div class="row mt-3">
107+
<div class="col-sm-8 mt-3 mt-md-0 offset-2">
108+
{% include figure.html path="assets/img/posts/paretoq/main_table.png" class="img-fluid rounded z-depth-1" %}
109+
</div>
110+
</div>
111+
<br>
112+
113+
## Conclusions
114+
- 2-bit quantization outperforms 4-bit in efficiency-accuracy trade-offs.
115+
- Fine-tuning is crucial for sub-4-bit quantization, especially for binary and ternary models.
116+
- Quantization-aware training (QAT) finetuning consistently surpasses both post-training quantization (PTQ, no fine-tuning) and QAT from scratch
117+
- QAT serves as a compensation mechanism for bit widths above 2-bit and as a reconstruction process for bit widths below 2-bit, where weights adapt to form new representations.
118+
- Extreme low-bit quantization is highly sensitive to quantization function selection, with no single optimal function for all bit widths.
185 KB
Loading
167 KB
Loading
380 KB
Loading
130 KB
Loading
343 KB
Loading
249 KB
Loading

0 commit comments

Comments
 (0)