Skip to content

[wip] Decode WAV with Python backend#1222

Draft
Dan-Flores wants to merge 8 commits intometa-pytorch:mainfrom
Dan-Flores:wav
Draft

[wip] Decode WAV with Python backend#1222
Dan-Flores wants to merge 8 commits intometa-pytorch:mainfrom
Dan-Flores:wav

Conversation

@Dan-Flores
Copy link
Contributor

@Dan-Flores Dan-Flores commented Feb 4, 2026

To review:

  • See the top level changes to _audio_decoder.py.
  • Read the definition of class WavDecoder, and the associated try_create.
  • Skim through _parse_wav_chunks, it looks complicated but just handles the WAV file header.
  • Skim through _samples_from_bytes. This conversion is considerably easier to read than the C++ implementation.

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 4, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/meta-pytorch/torchcodec/1222

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 28 New Failures

As of commit 16472d5 with merge base 377c638 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Feb 4, 2026
# Try fast WAV path
self._wav_decoder = WavDecoder.validate_and_init(
source, sample_rate, num_channels, stream_index
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._wav_decoder is only populated if WAV decoding was successful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a sense of how expensive this check is? It'll be run for every single input file so it's important that it's quick.

)
if self._wav_decoder is not None:
self.stream_index = self._wav_decoder.stream_index
self.metadata = self._wav_decoder.metadata
Copy link
Contributor Author

@Dan-Flores Dan-Flores Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because WavDecoder is a python class, we can set the AudioStreamMetadata class easily.

Comment on lines 272 to 284
elif isinstance(source, (str, Path)):
path = Path(source)
if path.suffix.lower() == ".wav":
try:
with open(path, "rb") as f:
source_bytes = f.read()
except OSError:
return None
elif isinstance(source, (io.RawIOBase, io.BufferedReader)) or (
hasattr(source, "read") and hasattr(source, "seek")
):
source_bytes = source.read()
# Will reset seek position below if we can't use fast path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this works but it also reads / loads / downloads the entire content of a file-like object in memory. We should try to see how FFmpeg behaves when decoding a very long wav file for example. Does it need to load the entire file at once? If not, we might be losing that functionality here, which is something to consider.

Same for the C++ alternative, which I haven't checked yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not need to, I've updated both implementations to only read the header at first. Thanks for the suggestion!

stream_index: int | None = None,
):
"""
Create a WavDecoder for the given source.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function handles all input source types, and calls _parse_wav_chunks with a function that can read the detected input type.

  • If successful, this function initializes a WavDecoder object.
  • If it fails, it will raise an error. This init is designed to be called by try_create below, which catches the error so we can fallback to FFmpeg backend.

bytes_per_sample = metadata.bits_per_sample // 8
num_samples = len(audio_bytes) // bytes_per_sample // metadata.num_channels

# Convert to tensor based on format
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here various WAV formats are normalized to [-1, 1].
Some formats use the full range of a dtype, so torch.iinfo is used to get the maximum value of that type.

audio_format = 0
num_channels = 0
sample_rate = 0
bits_per_sample = 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we find the RIFF and WAVE signatures, we can look for the fmt chunk which contains metadata, and the data chunk which contains the actual samples. We find the data chunk now to store bytes_per_sample and num_samples, which will be needed later for decoding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants