Skip to content

Commit 5c875ce

Browse files
committed
update A4
1 parent a4162bb commit 5c875ce

File tree

2 files changed

+52
-5
lines changed

2 files changed

+52
-5
lines changed

source/_posts/A3-modeling-mlp.md

Lines changed: 23 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -166,7 +166,26 @@ $$
166166
2. 接收输入 $\mathbf{X}$,对于每个 `token t`,计算其 top-k expert 子集,仅对与当前 `rank` 管理的本地 expert 集合 R 的交集执行 `forward` 计算流程,对于未路由到本地 experts 的 `token`,其输出保持为全零向量。
167167
3. 最终返回与输入 $\mathbf{X}$ 具有相同形状的输出 `hidden states` $\mathbf{O}$,最终某个 `token t` 的非零输出为路由到的本地 experts 所产生的子输出的加权和。
168168

169-
170-
以下是一些可能对你完成该任务有帮助的参考资料,也可以用于加深或拓宽你对 dense MLP 层、sparse-moe MLP 层、LoRA Adapters 和激活函数的理解:
171-
172-
##TODO 补充参考文献
169+
# References
170+
171+
* [Llama MLP Module](https://github.com/huggingface/transformers/blob/v4.46.3/src/transformers/models/llama/modeling_llama.py#L229)
172+
* [ChatGLM MLP Module](https://huggingface.co/THUDM/chatglm3-6b/blob/main/modeling_chatglm.py#L459)
173+
* [GLU Paper](https://arxiv.org/abs/1612.08083)
174+
* [GLU Variants Paper](https://arxiv.org/abs/2002.05202)
175+
* [PEFT Documentation](https://huggingface.co/docs/peft/index)
176+
* [LoRA Paper](https://arxiv.org/abs/2106.09685)
177+
* [PEFT LoRA-Linear Layer Implementation](https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py#L400)
178+
* [Pytorch SiLU Functional](https://pytorch.org/docs/stable/generated/torch.nn.functional.silu.html)
179+
* [Pytorch GELU Functional](https://pytorch.org/docs/stable/generated/torch.nn.functional.gelu.html)
180+
* [Pytorch ReLU Functional](https://pytorch.org/docs/stable/generated/torch.nn.functional.relu.html)
181+
* [Pytorch Sigmoid Functional](https://pytorch.org/docs/stable/generated/torch.nn.functional.sigmoid.html)
182+
* [Pytorch Kaiming Normal Initialization](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.kaiming_normal_)
183+
* [Pytorch Xavier Normal Initialization](https://pytorch.org/docs/stable/nn.init.html#torch.nn.init.xavier_normal_)
184+
* [MoE Paper](https://arxiv.org/abs/1701.06538)
185+
* [Mixtral Paper](https://arxiv.org/abs/2401.04088)
186+
* [Mixtral MoE MLP Module](https://github.com/huggingface/transformers/blob/v4.46.3/src/transformers/models/mixtral/modeling_mixtral.py#L610)
187+
188+
以上是一些可能对你完成任务有帮助的参考资料,也可以用来加深或拓宽你对 `Dense MLP` 层、`LoRA Adapter``稀疏 MoE(Mixture of Experts)MLP` 以及深度学习中激活函数的理解。
189+
190+
191+
!!请记住:查阅论文、源码以及官方文档,并从中进行思考和学习,是一项基本且至关重要的能力。请尽量不要过度依赖一些带有偏见或内容浅显的博客,例如 CSDN!!

source/_posts/A4-attention-module.md

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -211,4 +211,32 @@ $$
211211

212212
## Online Sliding-Window Attention 小结
213213

214-
总结来说,你需要实现 `OnlineSlidingWindowAttn` 模块,该模块以块索引 `block_idx_q``block_idx_kv` 为输入,接收格式为 `AttnQKVLayout.BSHD` 布局和 `AttnQKVPackFormat.Q_K_V` 打包格式的一组张量 $\mathbf{Q}_{\text{bq}_i},\mathbf{K}_{\text{bkv}_j},\mathbf{V}_{\text{bkv}_j}$,对该块应用本地的离线滑动窗口注意力操作,计算出该局部输出 $\mathbf{O}_{\text{bq}_i}^{\text{bkv}_j}$ 及其对应的局部统计量 $\text{lse}_{bq_i}^{bkv_j}$,并将其就地更新到给定的全局输出 $\mathbf{O}$ 和全局统计量 `lse` 中。
214+
总结来说,你需要实现 `OnlineSlidingWindowAttn` 模块,该模块以块索引 `block_idx_q``block_idx_kv` 为输入,接收格式为 `AttnQKVLayout.BSHD` 布局和 `AttnQKVPackFormat.Q_K_V` 打包格式的一组张量 $\mathbf{Q}_{\text{bq}_i},\mathbf{K}_{\text{bkv}_j},\mathbf{V}_{\text{bkv}_j}$,对该块应用本地的离线滑动窗口注意力操作,计算出该局部输出 $\mathbf{O}_{\text{bq}_i}^{\text{bkv}_j}$ 及其对应的局部统计量 $\text{lse}_{bq_i}^{bkv_j}$,并将其就地更新到给定的全局输出 $\mathbf{O}$ 和全局统计量 `lse` 中。
215+
216+
# References
217+
218+
* [Nvidia Methods of Improving LLM Training Stability](https://arxiv.org/pdf/2410.16682)
219+
* [Llama Attention Layer](https://github.com/huggingface/transformers/blob/v4.46.3/src/transformers/models/llama/modeling_llama.py#L275)
220+
* [Google MHA paper](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)
221+
* [Google MQA paper](https://arxiv.org/pdf/1911.02150)
222+
* [Google GQA paper](https://arxiv.org/pdf/2305.13245)
223+
* [Pytorch Repeat Interleave Functional](https://pytorch.org/docs/stable/generated/torch.repeat_interleave.html#torch.repeat_interleave)
224+
* [Transformer paper](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)
225+
* [Online Softmax Paper](https://arxiv.org/pdf/2112.05682)
226+
* [LSE Wiki](https://en.wikipedia.org/wiki/LogSumExp)
227+
* [Pytorch LSE Functional](https://pytorch.org/docs/stable/generated/torch.logsumexp.html#torch-logsumexp)
228+
* [Pytorch Log1p Functional](https://pytorch.org/docs/stable/generated/torch.log1p.html#torch.log1p)
229+
* [Pytorch Softplus Functional](https://pytorch.org/docs/stable/generated/torch.nn.functional.softplus.html#torch.nn.functional.softplus)
230+
* [Nvidia Methods of Improving LLM Training Stability](https://arxiv.org/pdf/2410.16682)
231+
* [Llama Attention Layer](https://github.com/huggingface/transformers/blob/v4.46.3/src/transformers/models/llama/modeling_llama.py#L275)
232+
* [Pytorch Repeat Interleave Functional](https://pytorch.org/docs/stable/generated/torch.repeat_interleave.html#torch.repeat_interleave)
233+
* [Transformer paper](https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf)
234+
* [Flash Attention 2 Paper](https://arxiv.org/pdf/2307.08691.pdf)
235+
* [Flash Attention Interface](https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/flash_attn_interface.py)
236+
* [Pytorch SDPA Functional](https://pytorch.org/docs/stable/generated/torch.nn.functional.scaled_dot_product_attention.html#torch.nn.functional.scaled_dot_product_attention)
237+
* [Pytorch FlexAttention Functional](https://pytorch.org/docs/main/nn.attention.flex_attention.html#module-torch.nn.attention.flex_attention)
238+
239+
240+
提示:以上是一些可能对你的任务有帮助的参考资料,也可以加深或拓宽你对 Transformer 中注意力机制的理解。
241+
242+
!!请记住:查阅论文、源码以及官方文档,并从中进行思考和学习,是一项基本且至关重要的能力。请尽量不要过度依赖一些带有偏见或内容浅显的博客,例如 CSDN!!

0 commit comments

Comments
 (0)