TransformerDecoder: optional positional encoding and final matmul by Gerstenberger · Pull Request #93 · rwth-i6/i6_models

Gerstenberger · 2026-01-29T10:23:09Z

Changes for positional encoding and the final matrix multiplication of model output and output embedding matrix to be both optional.

This allows us to use the implementation for self-normalized LM Transformer training, where positional encoding is not required and the final matmul is replaced by another matmul in the sampling loss.

My only question is: should this be a TransformerDecoderV2 instead?

NeoLegends · 2026-01-29T10:37:48Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

    num_output: int
    logits_bias: bool
    share_embedding: bool
+    use_positional_encoding: bool = True


I wonder if, instead of being a flag, this should be a configurable module instead, which you simply replace with a noop if you don't want any positional encoding. This would allow using other positional encoding schemes other than sinusoidal as well.

Yes, agree, better would be to have this more dynamic.

ConformerMHSARelPosV1._sinusoidal_pe should maybe be moved to a separate function, and then you would have positional_encoding=absolute_sinusoidal_positional_encoding as default, and None is also allowed.

NeoLegends · 2026-01-29T10:38:22Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

    logits_bias: bool
    share_embedding: bool
+    use_positional_encoding: bool = True
+    do_output_embedding_matmul: bool = True


Perhaps

Suggested change

do_output_embedding_matmul: bool = True

embed_outputs_to_vocab_dim: bool = True

is clearer naming-wise?

I don't think it's cleaner. But I also don't like the original name. But I'm also not sure whether I like the logic at all (see my separate comment on this, why to have the out_logits at all if it is not used).

albertz · 2026-01-29T11:17:34Z

As a first comment (I will try to comment in more detail later): The same questions have been thought about in the RF implementation, for Transformer encoder, decoder, and very related also Conformer encoder (to make the frontend optional, etc).

Current RF TransformerDecoder implementation. It already has the pos_enc configurable, and you can pass None to it to disable this. It doesn't have the logits optional though yet. But I have variants of this (in own code, not in RETURNN) where I have removed this part, or made it optional.

RF ConformerEncoderV2.

albertz · 2026-01-29T14:01:18Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

@@ -190,13 +194,20 @@ def __init__(self, cfg: TransformerDecoderV1Config):
        else:
            self.out_logits = nn.Linear(self.model_dim, cfg.num_output, bias=cfg.logits_bias)


I just realize, this sharing is weird. I would always set self.out_logits. If sharing, you can just do self.out_logits.weights = self.input_embedding.weight. That would simplify the other code.

Also, self.out_logits should always be set (be None if not used). But with my suggestion, you don't need to care about this.

And then you would also allow to have logits_bias=True with share_embedding=True.

albertz · 2026-01-29T14:05:43Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

    logits_bias: bool
    share_embedding: bool
+    use_positional_encoding: bool = True
+    do_output_embedding_matmul: bool = True


If this is False, and not cfg.share_embedding, the out_logits are not used at all. Does it make sense to even have them then?

Gerstenberger · 2026-02-11T16:13:03Z

I made a initial proposal but I am not really sure about the changes. Maybe the proposal is too complicated/not straightforward enough, which I tend towards.

In general, would a Callable be sufficient or you want a ModuleFactoryV1?
If a ModuleFactoryV1 would be optional, i.e. we do not apply a positional encoding at all, how do you specify 'not specified' and have the sinusoidal positional encoding as a default?
Having the callable is easier to handle.

Please let me know.

Gerstenberger · 2026-02-11T16:17:01Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

+        :param lengths: input lengths
+        :param state: current state of positional encoding.
+        """
+        sinus_pe = ConformerMHSARelPosV1._sinusoidal_pe(


Maybe should be moved to primitives?

Gerstenberger · 2026-02-11T16:26:23Z

i6_models/assemblies/transformer/transformer_decoder_v1.py

+        :param state: current state of positional encoding.
+        """
+        sinus_pe = ConformerMHSARelPosV1._sinusoidal_pe(
+            torch.arange(labels.shape[-1], device=labels.device) + state["pos"], self.embed_dim


Do we have the constraint labels.shape[-1] == lenghts.max()? If so labels input can be removed.
Or do we only return sinus_pre.unsqueeze(0) and apply the addition later?

Gerstenberger · 2026-02-12T10:12:20Z

i6_models/assemblies/transformer/transformer_decoder_v1.py


    block_state: List[TransformerDecoderBlockV1State]
-    pos: Tensor
+    pos_state: NotRequired[PositionalEncodingV1State]


Okay, maybe should not change names and type as this breaks existing setups.

TransformerDecoder: optional positional encoding and final matmul

0fa1a41

Gerstenberger requested a review from NeoLegends January 29, 2026 10:23

NeoLegends reviewed Jan 29, 2026

View reviewed changes

albertz reviewed Jan 29, 2026

View reviewed changes

initial proposal to commentary

a2897e5

formating

9ec338f

Gerstenberger commented Feb 11, 2026

View reviewed changes

fix input to _sinusoidal_pe

86bfe3e

Gerstenberger commented Feb 11, 2026

View reviewed changes

Gerstenberger commented Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TransformerDecoder: optional positional encoding and final matmul#93

TransformerDecoder: optional positional encoding and final matmul#93
Gerstenberger wants to merge 4 commits intomainfrom
gerstenberger-transformer-decoder-opt-pos

Gerstenberger commented Jan 29, 2026 •

edited

Loading

Uh oh!

NeoLegends Jan 29, 2026

Uh oh!

albertz Jan 29, 2026

Uh oh!

NeoLegends Jan 29, 2026

Uh oh!

albertz Jan 29, 2026

Uh oh!

albertz commented Jan 29, 2026

Uh oh!

albertz Jan 29, 2026

Uh oh!

albertz Jan 29, 2026

Uh oh!

Gerstenberger commented Feb 11, 2026

Uh oh!

Gerstenberger Feb 11, 2026

Uh oh!

Gerstenberger Feb 11, 2026

Uh oh!

Gerstenberger Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	do_output_embedding_matmul: bool = True
	embed_outputs_to_vocab_dim: bool = True

		@@ -190,13 +194,20 @@ def __init__(self, cfg: TransformerDecoderV1Config):
		else:
		self.out_logits = nn.Linear(self.model_dim, cfg.num_output, bias=cfg.logits_bias)

Conversation

Gerstenberger commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

albertz commented Jan 29, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Gerstenberger commented Feb 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Gerstenberger commented Jan 29, 2026 •

edited

Loading