Add transducer components by Stefanwuu · Pull Request #79 · rwth-i6/i6_models

Stefanwuu · 2025-05-22T15:09:38Z

I add transducer prediction and joint network here. The advantage is that the interface supports three modes: recognition, fixed path training(viterbi), and fullsum training(standard RNN-T). Other NN structures like LSTM prediction network(I tested training but not recognition so it's not here) can also be expanded simply. I also supported embedding reduction as introduced in this paper, which gave a light improvement in my test. Many lines of code originate from Simon's setup.

NeoLegends · 2025-07-14T14:08:40Z

Perhaps merge master and the CI issues will go away.

NeoLegends

Nice! Left some comments.

NeoLegends · 2025-09-15T14:45:16Z

i6_models/parts/ffnn.py

@@ -1,8 +1,13 @@
-__all__ = ["FeedForwardConfig", "FeedForwardModel"]


What about these old items here? Why are they gone from __all__?

NeoLegends · 2025-09-15T14:46:09Z

i6_models/parts/ffnn.py

+
+    input_dim: int
+    layer_sizes: List[int]
+    dropouts: List[float]


Are you customizing these values per layer? If not, consider allowing the simple one-value-fits-all variant as well like:

Suggested change

dropouts: List[float]

dropouts: Union[float, Sequence[float]]

NeoLegends · 2025-09-15T14:46:39Z

i6_models/parts/ffnn.py

+    """
+
+    input_dim: int
+    layer_sizes: List[int]


I generally prefer

Suggested change

layer_sizes: List[int]

layer_sizes: Sequence[int]

in config places like this as that also is correct for tuples, and I think tuples fit the configs better because they are immutable (but this is debatable).

NeoLegends · 2025-09-15T14:47:31Z

i6_models/parts/ffnn.py

+    def __post_init__(self):
+        super().__post_init__()
+        assert all(0.0 <= dropout <= 1.0 for dropout in self.dropouts), "Dropout values must be probabilities"
+        assert len(self.layer_sizes) > 0, "layer_sizes must not be empty"
+        assert len(self.layer_sizes) == len(self.layer_activations)
+        assert len(self.layer_sizes) == len(self.dropouts)


NeoLegends · 2025-09-15T14:48:40Z

i6_models/parts/ffnn.py

+            network_layers.append(nn.Linear(prev_size, layer_size))
+            prev_size = layer_size
+            if cfg.layer_activations[i] is not None:
+                network_layers.append(cfg.layer_activations[i])


I have a feeling we should be using ModuleFactorys instead, or at least add support for them, too.

NeoLegends · 2025-09-15T14:57:36Z

i6_models/assemblies/transducer/prediction_network.py

+        """
+        Reduces the context embedding using a weighted sum based on position vectors.
+        """
+        emb_expanded = emb.unsqueeze(3)  # [B, S, H, 1, E]


Consider unsqueezing from the back.

NeoLegends · 2025-09-15T14:57:45Z

i6_models/assemblies/transducer/prediction_network.py

+        pos_expanded = self.position_vectors.unsqueeze(0).unsqueeze(0)  # [1, 1, H, K, E]
+        alpha = (emb_expanded * pos_expanded).sum(dim=-1, keepdim=True)  # [B, S, H, K, 1]
+        weighted = alpha * emb_expanded  # [B, S, H, K, E]
+        reduced = weighted.sum(dim=2).sum(dim=2)  # [B, S, E]


Consider indexing dims from the back.

dim can be a tuple of ints, so we could do it in one step
https://docs.pytorch.org/docs/stable/generated/torch.sum.html

Suggested change

reduced = weighted.sum(dim=2).sum(dim=2) # [B, S, E]

reduced = weighted.sum(dim=(-2, -1)) # [B, S, E]

NeoLegends · 2025-09-15T15:03:03Z

i6_models/assemblies/transducer/prediction_network.py

+    """
+
+    def __init__(self, cfg: FfnnTransducerPredictionNetworkV1Config):
+        super().__init__(EmbeddingTransducerPredictionNetworkV1Config.from_child(cfg))


Since the first config inherits from the second one, you are able to just:

Suggested change

super().__init__(EmbeddingTransducerPredictionNetworkV1Config.from_child(cfg))

super().__init__(cfg)

EDIT: With composition instead of inheritance, this comment is no longer relevant.

NeoLegends · 2025-09-15T15:05:09Z

i6_models/assemblies/transducer/prediction_network.py

+        cfg.ffnn_cfg.input_dim = self.output_dim
+        self.ffnn = FeedForwardBlockV1(cfg.ffnn_cfg)


Leave the configs immutable. Always safer wrt. bugs.

Suggested change

cfg.ffnn_cfg.input_dim = self.output_dim

self.ffnn = FeedForwardBlockV1(cfg.ffnn_cfg)

self.ffnn = FeedForwardBlockV1(

dataclasses.replace(

cfg,

ffnn_cfg=dataclasses.replace(cfg.ffnn_cfg, input_dim=self.output_dim),

)

)

This creates copies of the dataclasses as needed.

or we could not change anything and throw an error if a value is configured that is wrong..

NeoLegends · 2025-09-15T15:06:50Z

i6_models/assemblies/transducer/prediction_network.py

+    ffnn_cfg: FeedForwardBlockV1Config
+
+
+class FfnnTransducerPredictionNetworkV1(EmbeddingTransducerPredictionNetworkV1):


I think this class would benefit from using composition instead of inheritance. Make it contain/own an EmbeddingTransducerPredictionNetworkV1 instead of inheriting from one. That resolves all your issues wrt. config nesting/updating.

michelwi · 2025-11-04T11:15:29Z

i6_models/assemblies/transducer/joint_network.py

+        if not self.training:
+            output = torch.log_softmax(output, dim=-1)  # [B, T, S, F]


+1, I think we usually get logits in the train step and apply the appropriate softmax function there

michelwi · 2025-11-04T11:17:37Z

i6_models/assemblies/transducer/joint_network.py

+        source_encodings: torch.Tensor,  # [1, T, E]
+        target_encodings: torch.Tensor,  # [B, S, P]


are source_encodings the output of the acoustic encoder and the target_encodings the output of the prediction network? Maybe we could rename (+ document) this better.

michelwi · 2025-11-04T12:33:57Z

i6_models/assemblies/transducer/prediction_network.py

+        pos_expanded = self.position_vectors.unsqueeze(0).unsqueeze(0)  # [1, 1, H, K, E]
+        alpha = (emb_expanded * pos_expanded).sum(dim=-1, keepdim=True)  # [B, S, H, K, 1]
+        weighted = alpha * emb_expanded  # [B, S, H, K, E]
+        reduced = weighted.sum(dim=2).sum(dim=2)  # [B, S, E]


dim can be a tuple of ints, so we could do it in one step
https://docs.pytorch.org/docs/stable/generated/torch.sum.html

Suggested change

reduced = weighted.sum(dim=2).sum(dim=2) # [B, S, E]

reduced = weighted.sum(dim=(-2, -1)) # [B, S, E]

michelwi · 2025-11-04T12:36:32Z

i6_models/assemblies/transducer/prediction_network.py

+        Processes the input history through the embedding layer and optional reduction.
+        """
+        if len(history.shape) == 2:  # reshape if input shape [B, H]
+            history = history.view(*history.shape[:-1], 1, history.shape[-1])  # [B, 1, H]


*history.shape[:-1] reads odd.. that should be the same as history.shape[0], since we have len(history.shape) == 2.. but talk to @NeoLegends about making this work with more batch dim.

michelwi · 2025-11-04T12:46:17Z

i6_models/assemblies/transducer/prediction_network.py

+        cfg.ffnn_cfg.input_dim = self.output_dim
+        self.ffnn = FeedForwardBlockV1(cfg.ffnn_cfg)


or we could not change anything and throw an error if a value is configured that is wrong..

michelwi · 2025-11-04T12:49:54Z

i6_models/assemblies/transducer/prediction_network.py

+    def forward_fullsum(
+        self,
+        targets: torch.Tensor,  # [B, S]
+        target_lengths: torch.Tensor,  # [B]


the target_lengths seems to be unused in any of the forward calls.. is it needed?

michelwi · 2025-11-04T13:04:07Z

i6_models/assemblies/transducer/prediction_network.py

+    reduce_embedding: bool
+    num_reduction_heads: Optional[int]
+
+    def __post__init__(self):


Suggested change

def __post__init__(self):

def __post_init__(self):

typo

michelwi · 2025-11-04T13:08:42Z

i6_models/assemblies/transducer/prediction_network.py

+        non_context_padding = torch.full(
+            (targets.size(0), self.cfg.context_history_size),
+            fill_value=self.blank_id,
+            dtype=targets.dtype,
+            device=targets.device,
+        )  # [B, H]
+        extended_targets = torch.cat([non_context_padding, targets], dim=1)  # [B, S+H]
+        history = torch.stack(
+            [
+                extended_targets[:, self.cfg.context_history_size - 1 - i : (-i if i != 0 else None)]
+                for i in reversed(range(self.cfg.context_history_size))
+            ],
+            dim=-1,
+        )  # [B, S+1, H]


ChatGPT suggested this code

B, S = targets.shape H = self.cfg.context_history_size # Pad left with H blanks: [B, S+H] extended = F.pad(targets, (H, 0), value=self.blank_id) # Unfold over sequence dim to get [B, S+1, H] # (PyTorch: unfold(size=H, step=1) "slides" a length-H window) history = extended.unfold(dimension=1, size=H, step=1) # [B, S+1, H]

Haotian Wu and others added 3 commits May 21, 2025 05:19

add new transducer parts and assemblies

e2c47c0

some fix after testing

745f6d7

code style and comment change

02b4573

Stefanwuu requested review from JackTemaki, Judyxujj, SimBe195, albertz, larissakl and robin-p-schmitt May 22, 2025 15:09

Haotian Wu added 4 commits May 22, 2025 11:21

use ruff instead of black

5620c38

reformat

806be26

fix dropout problem

69444ff

fix small issue

e73180f

Haotian Wu added 4 commits August 6, 2025 11:42

adjust linear position

a7297c7

change to additive joint network

6ce4f64

add joint normalization

7385d1f

remove joint normalization(useless), update viterbi forward

e707f5f

NeoLegends reviewed Sep 15, 2025

View reviewed changes

michelwi reviewed Nov 4, 2025

View reviewed changes

Haotian Wu added 2 commits December 15, 2025 08:16

add projection into joint network

dceb1d1

fix prediction net

517d687

		@@ -1,8 +1,13 @@
		__all__ = ["FeedForwardConfig", "FeedForwardModel"]

	dropouts: List[float]
	dropouts: Union[float, Sequence[float]]

	reduced = weighted.sum(dim=2).sum(dim=2) # [B, S, E]
	reduced = weighted.sum(dim=(-2, -1)) # [B, S, E]

	super().__init__(EmbeddingTransducerPredictionNetworkV1Config.from_child(cfg))
	super().__init__(cfg)

		cfg.ffnn_cfg.input_dim = self.output_dim
		self.ffnn = FeedForwardBlockV1(cfg.ffnn_cfg)

		ffnn_cfg: FeedForwardBlockV1Config


		class FfnnTransducerPredictionNetworkV1(EmbeddingTransducerPredictionNetworkV1):

		if not self.training:
		output = torch.log_softmax(output, dim=-1) # [B, T, S, F]

		source_encodings: torch.Tensor, # [1, T, E]
		target_encodings: torch.Tensor, # [B, S, P]

Conversation

Stefanwuu commented May 22, 2025

Uh oh!

NeoLegends commented Jul 14, 2025

Uh oh!

NeoLegends left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NeoLegends Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NeoLegends Sep 15, 2025 •

edited

Loading