Skip to content

Commit a4ba4a3

Browse files
committed
Site updated: 2025-07-25 16:19:32
1 parent 0bf2ea0 commit a4ba4a3

File tree

4 files changed

+99
-13
lines changed

4 files changed

+99
-13
lines changed

2025/06/29/A3-modeling-mlp/index.html

Lines changed: 24 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@
2828
<meta property="og:description" content="对于本次作业,我们将继续 Modeling 任务,以帮助你更深入地理解 Transformer 的各个组成模块。本次将特别关注 Transformer 结构核心的关键层之一:MLP 层。 Task 1: Dense MLPMulti-Layer Perceptron (MLP) 模块是深度学习中的一个基本模块,特别适用于处理复杂模式和非线性关系的任务。它已被广泛应用于基于 Transformer">
2929
<meta property="og:locale" content="zh_CN">
3030
<meta property="article:published_time" content="2025-06-29T13:54:26.000Z">
31-
<meta property="article:modified_time" content="2025-06-30T09:39:08.458Z">
31+
<meta property="article:modified_time" content="2025-07-25T08:17:10.554Z">
3232
<meta property="article:author" content="DeepEngine">
3333
<meta property="article:tag" content="MLP">
3434
<meta property="article:tag" content="LoRA">
@@ -187,7 +187,7 @@ <h1 class="post-title" itemprop="name headline">
187187
<i class="far fa-calendar-check"></i>
188188
</span>
189189
<span class="post-meta-item-text">更新于</span>
190-
<time title="修改时间:2025-06-30 17:39:08" itemprop="dateModified" datetime="2025-06-30T17:39:08+08:00">2025-06-30</time>
190+
<time title="修改时间:2025-07-25 16:17:10" itemprop="dateModified" datetime="2025-07-25T16:17:10+08:00">2025-07-25</time>
191191
</span>
192192
<span class="post-meta-item">
193193
<span class="post-meta-item-icon">
@@ -320,8 +320,27 @@ <h1 id="Sparse-MLP-小结"><a href="#Sparse-MLP-小结" class="headerlink" title
320320
<li>接收输入 $\mathbf{X}$,对于每个 <code>token t</code>,计算其 top-k expert 子集,仅对与当前 <code>rank</code> 管理的本地 expert 集合 R 的交集执行 <code>forward</code> 计算流程,对于未路由到本地 experts 的 <code>token</code>,其输出保持为全零向量。</li>
321321
<li>最终返回与输入 $\mathbf{X}$ 具有相同形状的输出 <code>hidden states</code> $\mathbf{O}$,最终某个 <code>token t</code> 的非零输出为路由到的本地 experts 所产生的子输出的加权和。</li>
322322
</ol>
323-
<p>以下是一些可能对你完成该任务有帮助的参考资料,也可以用于加深或拓宽你对 dense MLP 层、sparse-moe MLP 层、LoRA Adapters 和激活函数的理解:</p>
324-
<h2 id="TODO-补充参考文献"><a href="#TODO-补充参考文献" class="headerlink" title="TODO 补充参考文献"></a>TODO 补充参考文献</h2>
323+
<h1 id="References"><a href="#References" class="headerlink" title="References"></a>References</h1><ul>
324+
<li><a target="_blank" rel="noopener" href="https://github.com/huggingface/transformers/blob/v4.46.3/src/transformers/models/llama/modeling_llama.py#L229">Llama MLP Module</a></li>
325+
<li><a target="_blank" rel="noopener" href="https://huggingface.co/THUDM/chatglm3-6b/blob/main/modeling_chatglm.py#L459">ChatGLM MLP Module</a></li>
326+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/abs/1612.08083">GLU Paper</a></li>
327+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/abs/2002.05202">GLU Variants Paper</a></li>
328+
<li><a target="_blank" rel="noopener" href="https://huggingface.co/docs/peft/index">PEFT Documentation</a></li>
329+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/abs/2106.09685">LoRA Paper</a></li>
330+
<li><a target="_blank" rel="noopener" href="https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L400">PEFT LoRA-Linear Layer Implementation</a></li>
331+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/generated/torch.nn.functional.silu.html">Pytorch SiLU Functional</a></li>
332+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/generated/torch.nn.functional.gelu.html">Pytorch GELU Functional</a></li>
333+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/generated/torch.nn.functional.relu.html">Pytorch ReLU Functional</a></li>
334+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/generated/torch.nn.functional.sigmoid.html">Pytorch Sigmoid Functional</a></li>
335+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_normal_">Pytorch Kaiming Normal Initialization</a></li>
336+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_normal_">Pytorch Xavier Normal Initialization</a></li>
337+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/abs/1701.06538">MoE Paper</a></li>
338+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/abs/2401.04088">Mixtral Paper</a></li>
339+
<li><a target="_blank" rel="noopener" href="https://github.com/huggingface/transformers/blob/v4.46.3/src/transformers/models/mixtral/modeling_mixtral.py#L610">Mixtral MoE MLP Module</a></li>
340+
</ul>
341+
<p>以上是一些可能对你完成任务有帮助的参考资料,也可以用来加深或拓宽你对 <code>Dense MLP</code> 层、<code>LoRA Adapter</code><code>稀疏 MoE(Mixture of Experts)MLP</code> 以及深度学习中激活函数的理解。</p>
342+
<p>!!请记住:查阅论文、源码以及官方文档,并从中进行思考和学习,是一项基本且至关重要的能力。请尽量不要过度依赖一些带有偏见或内容浅显的博客,例如 CSDN!!</p>
343+
325344
</div>
326345

327346

@@ -405,7 +424,7 @@ <h2 id="TODO-补充参考文献"><a href="#TODO-补充参考文献" class="heade
405424

406425
<!--noindex-->
407426
<div class="post-toc-wrap sidebar-panel">
408-
<div class="post-toc motion-element"><ol class="nav"><li class="nav-item nav-level-1"><a class="nav-link" href="#Task-1-Dense-MLP"><span class="nav-number">1.</span> <span class="nav-text">Task 1: Dense MLP</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO"><span class="nav-number">1.1.</span> <span class="nav-text">TODO</span></a></li></ol></li><li class="nav-item nav-level-1"><a class="nav-link" href="#Task-2-Dense-MLP-with-LoRA-Adapters"><span class="nav-number">2.</span> <span class="nav-text">Task 2: Dense MLP with LoRA Adapters</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO-1"><span class="nav-number">2.1.</span> <span class="nav-text">TODO</span></a></li></ol></li><li class="nav-item nav-level-1"><a class="nav-link" href="#Dense-MLP-%E5%B0%8F%E7%BB%93"><span class="nav-number">3.</span> <span class="nav-text">Dense MLP 小结</span></a></li><li class="nav-item nav-level-1"><a class="nav-link" href="#Task-3-Sparse-MLP"><span class="nav-number">4.</span> <span class="nav-text">Task 3: Sparse MLP</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO-2"><span class="nav-number">4.1.</span> <span class="nav-text">TODO</span></a></li></ol></li><li class="nav-item nav-level-1"><a class="nav-link" href="#Sparse-MLP-%E5%B0%8F%E7%BB%93"><span class="nav-number">5.</span> <span class="nav-text">Sparse MLP 小结</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO-%E8%A1%A5%E5%85%85%E5%8F%82%E8%80%83%E6%96%87%E7%8C%AE"><span class="nav-number">5.1.</span> <span class="nav-text">TODO 补充参考文献</span></a></li></ol></li></ol></div>
427+
<div class="post-toc motion-element"><ol class="nav"><li class="nav-item nav-level-1"><a class="nav-link" href="#Task-1-Dense-MLP"><span class="nav-number">1.</span> <span class="nav-text">Task 1: Dense MLP</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO"><span class="nav-number">1.1.</span> <span class="nav-text">TODO</span></a></li></ol></li><li class="nav-item nav-level-1"><a class="nav-link" href="#Task-2-Dense-MLP-with-LoRA-Adapters"><span class="nav-number">2.</span> <span class="nav-text">Task 2: Dense MLP with LoRA Adapters</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO-1"><span class="nav-number">2.1.</span> <span class="nav-text">TODO</span></a></li></ol></li><li class="nav-item nav-level-1"><a class="nav-link" href="#Dense-MLP-%E5%B0%8F%E7%BB%93"><span class="nav-number">3.</span> <span class="nav-text">Dense MLP 小结</span></a></li><li class="nav-item nav-level-1"><a class="nav-link" href="#Task-3-Sparse-MLP"><span class="nav-number">4.</span> <span class="nav-text">Task 3: Sparse MLP</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO-2"><span class="nav-number">4.1.</span> <span class="nav-text">TODO</span></a></li></ol></li><li class="nav-item nav-level-1"><a class="nav-link" href="#Sparse-MLP-%E5%B0%8F%E7%BB%93"><span class="nav-number">5.</span> <span class="nav-text">Sparse MLP 小结</span></a></li><li class="nav-item nav-level-1"><a class="nav-link" href="#References"><span class="nav-number">6.</span> <span class="nav-text">References</span></a></li></ol></div>
409428
</div>
410429
<!--/noindex-->
411430

2025/07/13/A4-attention-module/index.html

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -31,7 +31,7 @@
3131
<meta property="og:image" content="https://big-trex.github.io/2025/07/13/A4-attention-module/window.svg">
3232
<meta property="og:image" content="https://big-trex.github.io/2025/07/13/A4-attention-module/bottom-right.svg">
3333
<meta property="article:published_time" content="2025-07-13T03:01:59.000Z">
34-
<meta property="article:modified_time" content="2025-07-25T08:06:36.817Z">
34+
<meta property="article:modified_time" content="2025-07-25T08:19:23.419Z">
3535
<meta property="article:author" content="DeepEngine">
3636
<meta property="article:tag" content="Attention">
3737
<meta property="article:tag" content="Mask">
@@ -192,7 +192,7 @@ <h1 class="post-title" itemprop="name headline">
192192
<i class="far fa-calendar-check"></i>
193193
</span>
194194
<span class="post-meta-item-text">更新于</span>
195-
<time title="修改时间:2025-07-25 16:06:36" itemprop="dateModified" datetime="2025-07-25T16:06:36+08:00">2025-07-25</time>
195+
<time title="修改时间:2025-07-25 16:19:23" itemprop="dateModified" datetime="2025-07-25T16:19:23+08:00">2025-07-25</time>
196196
</span>
197197
<span class="post-meta-item">
198198
<span class="post-meta-item-icon">
@@ -358,6 +358,30 @@ <h2 id="TODO-1"><a href="#TODO-1" class="headerlink" title="TODO"></a>TODO</h2><
358358
<li>需要注意的是,<code>OnlineSlidingWindowAttn</code> 模块中 <code>forward</code> 方法的每一次 <code>online attention</code> 计算,都应被视为对应 <code>OfflineSlidingWindowAttn</code> 模块中的一次内部迭代步骤(inner iterative step)。即如果我们遍历每一个合法的块索引:$\text{bq}_i \in [0, \frac{\text{sq}}{\text{block_size_q}})$,$\text{bkv}_j \in [0, \frac{\text{skv}}{\text{block_size_kv}})$,并依次在该在线模块中执行对应的 <code>forward</code> 操作,那么最终更新得到的全局输出 $\mathbf{O}$,在忽略数值累积误差(accumulation error)的前提下,应当与 <code>OfflineSlidingWindowAttn</code> 模块输出的结果完全一致。</li>
359359
</ul>
360360
<h2 id="Online-Sliding-Window-Attention-小结"><a href="#Online-Sliding-Window-Attention-小结" class="headerlink" title="Online Sliding-Window Attention 小结"></a>Online Sliding-Window Attention 小结</h2><p>总结来说,你需要实现 <code>OnlineSlidingWindowAttn</code> 模块,该模块以块索引 <code>block_idx_q</code><code>block_idx_kv</code> 为输入,接收格式为 <code>AttnQKVLayout.BSHD</code> 布局和 <code>AttnQKVPackFormat.Q_K_V</code> 打包格式的一组张量 $\mathbf{Q}_{\text{bq}_i},\mathbf{K}_{\text{bkv}_j},\mathbf{V}_{\text{bkv}_j}$,对该块应用本地的离线滑动窗口注意力操作,计算出该局部输出 $\mathbf{O}_{\text{bq}_i}^{\text{bkv}_j}$ 及其对应的局部统计量 $\text{lse}_{bq_i}^{bkv_j}$,并将其就地更新到给定的全局输出 $\mathbf{O}$ 和全局统计量 <code>lse</code> 中。</p>
361+
<h1 id="References"><a href="#References" class="headerlink" title="References"></a>References</h1><ul>
362+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/pdf/2410.16682">Nvidia Methods of Improving LLM Training Stability</a></li>
363+
<li><a target="_blank" rel="noopener" href="https://github.com/huggingface/transformers/blob/v4.46.3/src/transformers/models/llama/modeling_llama.py#L275">Llama Attention Layer</a></li>
364+
<li><a target="_blank" rel="noopener" href="https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf">Google MHA paper</a></li>
365+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/pdf/1911.02150">Google MQA paper</a></li>
366+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/pdf/2305.13245">Google GQA paper</a></li>
367+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/generated/torch.repeat_interleave.html#torch.repeat_interleave">Pytorch Repeat Interleave Functional</a></li>
368+
<li><a target="_blank" rel="noopener" href="https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf">Transformer paper</a></li>
369+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/pdf/2112.05682">Online Softmax Paper</a></li>
370+
<li><a target="_blank" rel="noopener" href="https://en.wikipedia.org/wiki/LogSumExp">LSE Wiki</a></li>
371+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/generated/torch.logsumexp.html#torch-logsumexp">Pytorch LSE Functional</a></li>
372+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/generated/torch.log1p.html#torch.log1p">Pytorch Log1p Functional</a></li>
373+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/generated/torch.nn.functional.softplus.html#torch.nn.functional.softplus">Pytorch Softplus Functional</a></li>
374+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/pdf/2410.16682">Nvidia Methods of Improving LLM Training Stability</a></li>
375+
<li><a target="_blank" rel="noopener" href="https://github.com/huggingface/transformers/blob/v4.46.3/src/transformers/models/llama/modeling_llama.py#L275">Llama Attention Layer</a></li>
376+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/generated/torch.repeat_interleave.html#torch.repeat_interleave">Pytorch Repeat Interleave Functional</a></li>
377+
<li><a target="_blank" rel="noopener" href="https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf">Transformer paper</a></li>
378+
<li><a target="_blank" rel="noopener" href="https://arxiv.org/pdf/2307.08691.pdf">Flash Attention 2 Paper</a></li>
379+
<li><a target="_blank" rel="noopener" href="https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/flash_attn_interface.py">Flash Attention Interface</a></li>
380+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention">Pytorch SDPA Functional</a></li>
381+
<li><a target="_blank" rel="noopener" href="https://pytorch.org/docs/main/nn.attention.flex_attention.html#module-torch.nn.attention.flex_attention">Pytorch FlexAttention Functional</a></li>
382+
</ul>
383+
<p>提示:以上是一些可能对你的任务有帮助的参考资料,也可以加深或拓宽你对 Transformer 中注意力机制的理解。</p>
384+
<p>!!请记住:查阅论文、源码以及官方文档,并从中进行思考和学习,是一项基本且至关重要的能力。请尽量不要过度依赖一些带有偏见或内容浅显的博客,例如 CSDN!!</p>
361385

362386
</div>
363387

@@ -443,7 +467,7 @@ <h2 id="Online-Sliding-Window-Attention-小结"><a href="#Online-Sliding-Window-
443467

444468
<!--noindex-->
445469
<div class="post-toc-wrap sidebar-panel">
446-
<div class="post-toc motion-element"><ol class="nav"><li class="nav-item nav-level-1"><a class="nav-link" href="#Task-1-Offline-Sliding-Window-Attention"><span class="nav-number">1.</span> <span class="nav-text">Task 1: Offline Sliding-Window Attention</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO"><span class="nav-number">1.1.</span> <span class="nav-text">TODO</span></a></li><li class="nav-item nav-level-2"><a class="nav-link" href="#Offline-Sliding-Window-Attention-%E5%B0%8F%E7%BB%93"><span class="nav-number">1.2.</span> <span class="nav-text">Offline Sliding-Window Attention 小结</span></a></li></ol></li><li class="nav-item nav-level-1"><a class="nav-link" href="#Optional-Task2%EF%BC%9AOnline-Sliding-Window-Attention"><span class="nav-number">2.</span> <span class="nav-text">[Optional] Task2:Online Sliding-Window Attention</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO-1"><span class="nav-number">2.1.</span> <span class="nav-text">TODO</span></a></li><li class="nav-item nav-level-2"><a class="nav-link" href="#Online-Sliding-Window-Attention-%E5%B0%8F%E7%BB%93"><span class="nav-number">2.2.</span> <span class="nav-text">Online Sliding-Window Attention 小结</span></a></li></ol></li></ol></div>
470+
<div class="post-toc motion-element"><ol class="nav"><li class="nav-item nav-level-1"><a class="nav-link" href="#Task-1-Offline-Sliding-Window-Attention"><span class="nav-number">1.</span> <span class="nav-text">Task 1: Offline Sliding-Window Attention</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO"><span class="nav-number">1.1.</span> <span class="nav-text">TODO</span></a></li><li class="nav-item nav-level-2"><a class="nav-link" href="#Offline-Sliding-Window-Attention-%E5%B0%8F%E7%BB%93"><span class="nav-number">1.2.</span> <span class="nav-text">Offline Sliding-Window Attention 小结</span></a></li></ol></li><li class="nav-item nav-level-1"><a class="nav-link" href="#Optional-Task2%EF%BC%9AOnline-Sliding-Window-Attention"><span class="nav-number">2.</span> <span class="nav-text">[Optional] Task2:Online Sliding-Window Attention</span></a><ol class="nav-child"><li class="nav-item nav-level-2"><a class="nav-link" href="#TODO-1"><span class="nav-number">2.1.</span> <span class="nav-text">TODO</span></a></li><li class="nav-item nav-level-2"><a class="nav-link" href="#Online-Sliding-Window-Attention-%E5%B0%8F%E7%BB%93"><span class="nav-number">2.2.</span> <span class="nav-text">Online Sliding-Window Attention 小结</span></a></li></ol></li><li class="nav-item nav-level-1"><a class="nav-link" href="#References"><span class="nav-number">3.</span> <span class="nav-text">References</span></a></li></ol></div>
447471
</div>
448472
<!--/noindex-->
449473

css/main.css

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1168,7 +1168,7 @@ pre .javascript .function {
11681168
}
11691169
.links-of-author a::before,
11701170
.links-of-author span.exturl::before {
1171-
background: #a28c21;
1171+
background: #0e062a;
11721172
border-radius: 50%;
11731173
content: ' ';
11741174
display: inline-block;

0 commit comments

Comments
 (0)