Skip to content

Conversation

@csukuangfj
Copy link
Collaborator

The K2SpeechRecognitionDataset is changed so that

  • For each audio, it first creates two copies of it, then it augments the two copies with different musan audios at the feature level
  • In the end, it returns 3 audios
    • The original one without any augmentation
    • Two copies with different musan augmentations.

To make the code review easier, we post the changes below

diff --git a/egs/librispeech/ASR/zipformer/speech_recognition.py b/egs/librispeech/ASR/zipformer/speech_recognition.py
index 4a3520b3..828602fc 100644
--- a/egs/librispeech/ASR/zipformer/speech_recognition.py
+++ b/egs/librispeech/ASR/zipformer/speech_recognition.py
@@ -103,13 +103,15 @@ class K2SpeechRecognitionDataset(torch.utils.data.Dataset):
         # Sort the cuts by duration so that the first one determines the batch time dimensions.
         cuts = cuts.sort_by_duration(ascending=False)

-        # Optional CutSet transforms - e.g. padding, or speed perturbation that adjusts
-        # the supervision boundaries.
-        for tnfm in self.cut_transforms:
-            cuts = tnfm(cuts)
+        if self.cut_transforms:
+            orig_cuts = cuts

-        # Sort the cuts again after transforms
-        cuts = cuts.sort_by_duration(ascending=False)
+            cuts = cuts.repeat(times=2)
+
+            for tnfm in self.cut_transforms:
+                cuts = tnfm(cuts)
+
+            cuts = orig_cuts + cuts

Usage

#!/usr/bin/env python3

from lhotse import CutSet, Fbank, FbankConfig, load_manifest, load_manifest_lazy
from lhotse.dataset import CutMix, SimpleCutSampler
from lhotse.dataset.input_strategies import PrecomputedFeatures
from torch.utils.data import DataLoader
from zipformer.speech_recognition import K2SpeechRecognitionDataset


def main():
    cuts_musan = load_manifest("./data/fbank/musan_cuts.jsonl.gz")
    test_clean = load_manifest_lazy("./data/fbank/librispeech_cuts_test-clean.jsonl.gz")

    transforms = [
        CutMix(
            cuts=cuts_musan,
            p=1.0,
            snr=(10, 20),
            preserve_id=True,
            pad_to_longest=False,
        )
    ]

    train = K2SpeechRecognitionDataset(
        input_strategy=PrecomputedFeatures(),
        cut_transforms=transforms,
        input_transforms=[],
        return_cuts=True,
    )

    train_sampler = SimpleCutSampler(
        test_clean,
        max_duration=50,
        shuffle=True,
    )

    train_dl = DataLoader(
        train,
        sampler=train_sampler,
        batch_size=None,
        num_workers=1,
        persistent_workers=False,
    )

    for batch in train_dl:
        features = batch["inputs"]

        cuts = batch["supervisions"]["cut"]

        num_orig_audio = features.shape[0] // 3
        orig_cuts = cuts[:num_orig_audio]
        copy0_cuts = cuts[num_orig_audio : 2 * num_orig_audio]
        copy1_cuts = cuts[2 * num_orig_audio : 3 * num_orig_audio]

        orig_feats = features[:num_orig_audio]
        copy0_feats = features[num_orig_audio : 2 * num_orig_audio]
        copy1_feats = features[2 * num_orig_audio : 3 * num_orig_audio]

        for i in range(num_orig_audio):
            assert f"{orig_cuts[i].id}_repeat0" == copy0_cuts[i].id, (
                orig_cuts[i].id,
                copy0_cuts[i].id,
            )
            assert f"{orig_cuts[i].id}_repeat1" == copy1_cuts[i].id, (
                orig_cuts[i].id,
                copy1_cuts[i].id,
            )

            # augmentation with musan does not change the feature frames
            assert (
                orig_cuts[i].num_frames
                == copy0_cuts[i].num_frames
                == copy1_cuts[i].num_frames
            )


if __name__ == "__main__":
    main()

Note that the following transform with musan won't change the number of feature frames of a cut.

    transforms = [
        CutMix(
            cuts=cuts_musan,
            p=1.0,
            snr=(10, 20),
            preserve_id=True,
            pad_to_longest=False,
        )
    ]

In addition, it returns the original audio without augmentation.
@csukuangfj csukuangfj changed the title Support using different musan augmentations for the same audio. @csukuangfj Support using different musan augmentations for the same audio. Jul 1, 2025
@danpovey
Copy link
Collaborator

danpovey commented Jul 1, 2025

Cool... so this would return a batch whose size is a multiple of 3, right? And the first block would be the un-augmented features?

@csukuangfj
Copy link
Collaborator Author

Cool... so this would return a batch whose size is a multiple of 3, right? And the first block would be the un-augmented features?

Yes, you are right

@shylockasr
Copy link

思想非常棒,我们这边在内部数据测试,用原始的 musan 加噪方案可以收敛, 用 三倍数据的加噪方案 不收敛。

而且训练的 batch 要降低。

@csukuangfj
Copy link
Collaborator Author

加噪方案可以收敛, 用 三倍数

这个主要是配合 cr ctc 一起使用。

@shylockasr
Copy link

加噪方案可以收敛, 用 三倍数

这个主要是配合 cr ctc 一起使用。

明白您的意思,感谢!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants