Support using different musan augmentations for the same audio. #1975

csukuangfj · 2025-07-01T09:08:13Z

The K2SpeechRecognitionDataset is changed so that

For each audio, it first creates two copies of it, then it augments the two copies with different musan audios at the feature level
In the end, it returns 3 audios
- The original one without any augmentation
- Two copies with different musan augmentations.

To make the code review easier, we post the changes below

diff --git a/egs/librispeech/ASR/zipformer/speech_recognition.py b/egs/librispeech/ASR/zipformer/speech_recognition.py
index 4a3520b3..828602fc 100644
--- a/egs/librispeech/ASR/zipformer/speech_recognition.py
+++ b/egs/librispeech/ASR/zipformer/speech_recognition.py
@@ -103,13 +103,15 @@ class K2SpeechRecognitionDataset(torch.utils.data.Dataset):
         # Sort the cuts by duration so that the first one determines the batch time dimensions.
         cuts = cuts.sort_by_duration(ascending=False)

-        # Optional CutSet transforms - e.g. padding, or speed perturbation that adjusts
-        # the supervision boundaries.
-        for tnfm in self.cut_transforms:
-            cuts = tnfm(cuts)
+        if self.cut_transforms:
+            orig_cuts = cuts

-        # Sort the cuts again after transforms
-        cuts = cuts.sort_by_duration(ascending=False)
+            cuts = cuts.repeat(times=2)
+
+            for tnfm in self.cut_transforms:
+                cuts = tnfm(cuts)
+
+            cuts = orig_cuts + cuts

Usage

#!/usr/bin/env python3

from lhotse import CutSet, Fbank, FbankConfig, load_manifest, load_manifest_lazy
from lhotse.dataset import CutMix, SimpleCutSampler
from lhotse.dataset.input_strategies import PrecomputedFeatures
from torch.utils.data import DataLoader
from zipformer.speech_recognition import K2SpeechRecognitionDataset


def main():
    cuts_musan = load_manifest("./data/fbank/musan_cuts.jsonl.gz")
    test_clean = load_manifest_lazy("./data/fbank/librispeech_cuts_test-clean.jsonl.gz")

    transforms = [
        CutMix(
            cuts=cuts_musan,
            p=1.0,
            snr=(10, 20),
            preserve_id=True,
            pad_to_longest=False,
        )
    ]

    train = K2SpeechRecognitionDataset(
        input_strategy=PrecomputedFeatures(),
        cut_transforms=transforms,
        input_transforms=[],
        return_cuts=True,
    )

    train_sampler = SimpleCutSampler(
        test_clean,
        max_duration=50,
        shuffle=True,
    )

    train_dl = DataLoader(
        train,
        sampler=train_sampler,
        batch_size=None,
        num_workers=1,
        persistent_workers=False,
    )

    for batch in train_dl:
        features = batch["inputs"]

        cuts = batch["supervisions"]["cut"]

        num_orig_audio = features.shape[0] // 3
        orig_cuts = cuts[:num_orig_audio]
        copy0_cuts = cuts[num_orig_audio : 2 * num_orig_audio]
        copy1_cuts = cuts[2 * num_orig_audio : 3 * num_orig_audio]

        orig_feats = features[:num_orig_audio]
        copy0_feats = features[num_orig_audio : 2 * num_orig_audio]
        copy1_feats = features[2 * num_orig_audio : 3 * num_orig_audio]

        for i in range(num_orig_audio):
            assert f"{orig_cuts[i].id}_repeat0" == copy0_cuts[i].id, (
                orig_cuts[i].id,
                copy0_cuts[i].id,
            )
            assert f"{orig_cuts[i].id}_repeat1" == copy1_cuts[i].id, (
                orig_cuts[i].id,
                copy1_cuts[i].id,
            )

            # augmentation with musan does not change the feature frames
            assert (
                orig_cuts[i].num_frames
                == copy0_cuts[i].num_frames
                == copy1_cuts[i].num_frames
            )


if __name__ == "__main__":
    main()

Note that the following transform with musan won't change the number of feature frames of a cut.

    transforms = [
        CutMix(
            cuts=cuts_musan,
            p=1.0,
            snr=(10, 20),
            preserve_id=True,
            pad_to_longest=False,
        )
    ]

In addition, it returns the original audio without augmentation.

danpovey · 2025-07-01T09:16:16Z

Cool... so this would return a batch whose size is a multiple of 3, right? And the first block would be the un-augmented features?

csukuangfj · 2025-07-01T09:17:18Z

Cool... so this would return a batch whose size is a multiple of 3, right? And the first block would be the un-augmented features?

Yes, you are right

shylockasr · 2025-07-07T08:29:15Z

思想非常棒，我们这边在内部数据测试，用原始的 musan 加噪方案可以收敛，用三倍数据的加噪方案不收敛。

而且训练的 batch 要降低。

csukuangfj · 2025-07-07T10:35:32Z

加噪方案可以收敛，用三倍数

这个主要是配合 cr ctc 一起使用。

shylockasr · 2025-07-07T11:34:11Z

加噪方案可以收敛，用三倍数

这个主要是配合 cr ctc 一起使用。

明白您的意思，感谢！

csukuangfj added 2 commits July 1, 2025 16:32

copy files from lhotse

075e74b

Support using different musan augmentations for the same audio.

85f6deb

In addition, it returns the original audio without augmentation.

csukuangfj changed the title ~~Support using different musan augmentations for the same audio. @csukuangfj~~ Support using different musan augmentations for the same audio. Jul 1, 2025

Fix for asr_datamodule.py

eaaab47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support using different musan augmentations for the same audio. #1975

Support using different musan augmentations for the same audio. #1975

csukuangfj commented Jul 1, 2025

Uh oh!

danpovey commented Jul 1, 2025

Uh oh!

csukuangfj commented Jul 1, 2025

Uh oh!

shylockasr commented Jul 7, 2025

Uh oh!

csukuangfj commented Jul 7, 2025

Uh oh!

shylockasr commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Support using different musan augmentations for the same audio. #1975

Are you sure you want to change the base?

Support using different musan augmentations for the same audio. #1975

Conversation

csukuangfj commented Jul 1, 2025

Usage

Uh oh!

danpovey commented Jul 1, 2025

Uh oh!

csukuangfj commented Jul 1, 2025

Uh oh!

shylockasr commented Jul 7, 2025

Uh oh!

csukuangfj commented Jul 7, 2025

Uh oh!

shylockasr commented Jul 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants