blogs/mlsys/cs25.md at master · LukeLIN-web/blogs

斯坦福cs25 学习笔记

第一课

V2 I Introduction to Transformers w/ Andrej Karpathy

https://www.youtube.com/watch?v=XfpMkf4rD6E 感觉v4的Lec1不用看，直接看V2 Andrej Karpathy讲transformer的部分就行了

RNN , 不适合长序列和context.

想要domain specfic models,医生GPT, 律师GPT.

encode token embedding和 position embedding. 然后直接加起来. optional dropout 一下.

经过多个block 传播, 之后layer norm, 之后用linear 层 decode logits for the next word.

block也很简单, 就是layer norm然后attention, 然后layernorm然后mlp. mlp是两层linear 加一个dropout.

attention是啥? attention

先通过x linear出 q k v. 四维度 , (batch, head, 时间T, hs 就是feature )

mask fill 把不需要通讯的填上-inf, 为了 softmax之后变成zero. 删除mask fill的话, 就是从decode only的GPT 架构变成有 encoder block. 比如 T5架构 ,需要 cross attention,计算就比较复杂了 .

attention 是多个node communication

alphafold 里面也是transformer.

为什么LLM 可以work well?

手动inspect data