Transformer Decoder: extend docs on input embedding scale#91
Transformer Decoder: extend docs on input embedding scale#91NeoLegends wants to merge 3 commits intomainfrom
Conversation
I don't think this is true in general. I have seen both variants. Also for LMs. For example, Gemma3: self.embed_tokens = Gemma3TextScaledWordEmbedding(
config.vocab_size, config.hidden_size, self.padding_idx, embed_scale=self.config.hidden_size**0.5
)Note, there are some other things to consider: If you don't apply the scale in elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=1.0)I also saw that some people use custom (much larger) LRs for embeddings, which again might compensate the fact of not using a scale. E.g. see nanochat. If you share the embedding weights with the LM head, this might affect whether you want such a scale or not (I'm not sure in what way, though...). Most LMs do this. |
| input_dropout: Dropout applied to the input embedding. | ||
| input_embedding_scale: Scale applied to the input embedding. | ||
| Set to `None` to apply a (tuned) default. | ||
| Set to `None` to apply a default that is suitable for ASR AED decoder models. |
There was a problem hiding this comment.
I would not mention any specific model at all here. I think this is just confusing. I would instead just say what default you use.
Mohammad did not use the scale in a pure LM setup.