|
| 1 | +--- |
| 2 | +layout: page |
| 3 | +title: Quamba2 |
| 4 | +full_title: "Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models" |
| 5 | +authors: Hung-Yueh Chiang, Chi-Chih Chang, Natalia Frumkin, Kai-Chiang Wu, Mohamed S. Abdelfattah, Diana Marculescu |
| 6 | +description: "A Quantization Framework for Selective State Space Models" |
| 7 | +img: assets/img/publication_preview/quamba2_blog.jpg |
| 8 | +importance: 1 |
| 9 | +category: research |
| 10 | +--- |
| 11 | + |
| 12 | +<style> |
| 13 | +li { |
| 14 | + font-size: 1.1rem; /* Adjust as needed */ |
| 15 | +} |
| 16 | +</style> |
| 17 | + |
| 18 | +<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/4.7.0/css/font-awesome.min.css"> |
| 19 | + |
| 20 | +<div style="text-align: center; padding-bottom: 1rem;"> |
| 21 | +<!-- <abbr class="badge" style="background-color:#00369f; margin-left:0.1rem; margin-right:0.1rem; font-size:1.1rem;">ICLR 2025</abbr> --> |
| 22 | +<abbr class="badge" style="background-color:#BF5700; margin-left:0.1rem; margin-right:0.1rem; font-size:1.1rem; width:80px; display:inline-block; text-align:center;">Arxiv</abbr> |
| 23 | +</div> |
| 24 | + |
| 25 | +<div class="authors"> |
| 26 | + <a href="https://hychiang.info">Hung-Yueh Chiang</a><sup>1</sup>, |
| 27 | + <a href="https://ccchang.info/">Chi-Chih Chang</a><sup>2</sup>, |
| 28 | + <a href="https://www.nfrumkin.com/">Natalia Frumkin</a><sup>1</sup>, |
| 29 | + <br> |
| 30 | + <a href="https://people.cs.nycu.edu.tw/~kcw/">Kai-Chiang Wu</a><sup>3</sup>, |
| 31 | + <a href="https://www.mohsaied.com/">Mohamed S. Abdelfattah</a><sup>2</sup>, |
| 32 | + <a href="https://users.ece.utexas.edu/~dianam/">Diana Marculescu</a><sup>1</sup> |
| 33 | +</div> |
| 34 | +<div class="authors"> |
| 35 | + <sup>1</sup>The University of Texas at Austin, |
| 36 | + <sup>2</sup>Cornell University, |
| 37 | + <sup>3</sup>National Yang Ming Chiao Tung University |
| 38 | +</div> |
| 39 | +<div style="text-align: center; margin-top:12px;"> |
| 40 | + <a href="https://arxiv.org/abs/2503.22879"><i class="fa fa-file-pdf-o" style="font-size:24px;color"></i><b> Paper </b></a> |
| 41 | + <a href="https://github.com/enyac-group/Quamba"><i class="fa fa-github" style="font-size:24px;color"></i><b> Code </b></a> |
| 42 | + <a href="https://huggingface.co/ut-enyac"><span style="font-size: 22px;">🤗</span><b> Models </b></a> |
| 43 | +</div> |
| 44 | + |
| 45 | + |
| 46 | +<br> |
| 47 | +<div style="text-align: center;"> |
| 48 | + <p style="font-family: Comic Neue; font-size: 1.4rem;"> |
| 49 | + :small_red_triangle_down: <b>4<span>×</span> memory reduction</b> |
| 50 | + :rocket: <b>13 Token-per-second on Orin Nano 8G </b> |
| 51 | + </p> |
| 52 | +</div> |
| 53 | +<div class="row mt-3"> |
| 54 | + <div class="col-sm-8 mt-3 mt-md-0 offset-2"> |
| 55 | + {% include figure.html path="assets/img/projects/quamba2/quamba2.png" title="example image" class="img-fluid rounded z-depth-1" %} |
| 56 | + </div> |
| 57 | +</div> |
| 58 | +<br> |
| 59 | + |
| 60 | +# 4-bit Mamba1 and Mamba2 blocks |
| 61 | +- **W4A8**, **W4A16**, **W4AX**, and **W8A8** for both **Mamba1** and **Mamba2** |
| 62 | +- **Headto-toe (H2T)** 4/8-bit quantization from the embedding layer, SSM blocks, to the final output layer |
| 63 | +<div class="row"> |
| 64 | + <div class="col-sm mt-3 mt-md-0"> |
| 65 | + {% include gif.html path="assets/img/projects/quamba2/quamba2_supports.jpg" title="example image" class="img-fluid rounded z-depth-1" %} |
| 66 | + </div> |
| 67 | +</div> |
| 68 | +<br> |
| 69 | + |
| 70 | +# Storage reduction |
| 71 | +- Achieve **4** $$\times$$ memory reduction by Head-to-toe (H2T) 4-bit quantization |
| 72 | +- Enable deploying Mamba2-8B on **Nano 8G** |
| 73 | +<div class="row mt-3"> |
| 74 | + <div class="col-sm-10 mt-3 mt-md-0 offset-1"> |
| 75 | + {% include figure.html path="assets/img/projects/quamba2/quamba2_size_2.png" title="example image" class="img-fluid rounded z-depth-1" %} |
| 76 | + </div> |
| 77 | +</div> |
| 78 | +<br> |
| 79 | + |
| 80 | +# End-to-end latency speedup |
| 81 | +- Speedup the generation by **3** $$\times$$ on the A5000 GPU |
| 82 | +- Run **13** tokens/second on **Nano 8G** |
| 83 | +<div class="row mt-4"> |
| 84 | + <div class="col-sm mt-3 mt-md-0 offset-0"> |
| 85 | + {% include figure.html path="assets/img/projects/quamba2/quamba2_latency_2.jpg" title="example image" class="img-fluid rounded z-depth-1" %} |
| 86 | + </div> |
| 87 | +</div> |
| 88 | +<br> |
| 89 | + |
| 90 | +# Generalization and robustness |
| 91 | +We search **W4A**$$X$$-mixed (the last row in red) to improve the generalization and robustness for low bit-width SSMs. We evaluate low bit-width SSMs on the large multitask dataset MMLU. |
| 92 | +<div class="row mt-4"> |
| 93 | + <div class="col-sm mt-3 mt-md-0 offset-0"> |
| 94 | + {% include figure.html path="assets/img/projects/quamba2/quamba2_searched_2.jpg" title="example image" class="img-fluid rounded z-depth-1" %} |
| 95 | + </div> |
| 96 | +</div> |
| 97 | +<br> |
| 98 | + |
| 99 | + |
| 100 | +# Zero-shot evaluation |
| 101 | +<div class="row mt-4"> |
| 102 | + <div class="col-sm-10 mt-4 mt-md-0 offset-1"> |
| 103 | + {% include figure.html path="assets/img/projects/quamba2/quamba2_main_table.png" title="example image" class="img-fluid rounded z-depth-1" %} |
| 104 | + </div> |
| 105 | +</div> |
| 106 | +<br> |
| 107 | + |
| 108 | + |
| 109 | + |
| 110 | + |
| 111 | +# Citation |
| 112 | +{% raw %} |
| 113 | +```latex |
| 114 | +@article{chiang2025quamba2, |
| 115 | + title={Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models}, |
| 116 | + author={Chiang, Hung-Yueh and Chang, Chi-Chih and Frumkin, Natalia and Wu, Kai-Chiang, Abdelfattah, Mohamed S. and Marculescu, Diana}, |
| 117 | + journal={arXiv preprint arXiv:2503.22879}, |
| 118 | + year={2025} |
| 119 | +} |
| 120 | +``` |
| 121 | +{% endraw %} |
| 122 | + |
| 123 | +<br> |
| 124 | +# Acknowledgements |
| 125 | +This work was supported in part by the ONR Minerva program, NSF CCF Grant No. 2107085, iMAGiNE - the Intelligent Machine Engineering Consortium at UT Austin, UT Cockrell School of Engineering Doctoral Fellowships, NSF Grant No. 2339084, and Taiwan’s NSTC Grant No. 111-2221-E-A49-148-MY3. |
0 commit comments