Name	Name	Last commit message	Last commit date
parent directory ..
visual	visual
whisperx	whisperx
readme.md	readme.md

Name

Last commit message

Last commit date

HowTo100M Data Release

We share additional features/data of the HowTo100M dataset for future research. What is HowTo100M?

WhisperX

It is commonly known that the ASR from YouTube has some noises, including synchronization issue (the timestamp does not associated with the speech perfectly) and translation issue (failure to recognize language and then transcribe in EN).

We use the time-wise accurate WhisperX package to process all the HowTo100M audio files, which gives word-level timestamps and highly accurate language recognition. WhisperX is build on OpenAI's Whisper with additional phoneme alignment module to ensure accurate timestamp of the ASR.

We used the whisper-large-v2 version, the best whisper version provided by OpenAI. For non-EN language, we provide ASR and word-level timestamps in the local langauge (if supported by whisperX), as well as English translation, but with sentence-level timestamps.

Downloading script: here (64 tar.gz files of json, totally 25GB)
Extraction script: here (after unzip about 130GB)
Language detection output: download link
For future reference: our language detection script and whisperx script

Visual Features

We provide recent/stronger visual features for HowTo100M. Following Miech et al., we provide features at 1 vector-per-second. For the original S3D features, please refer to Miech et al.

Currently we provide the following visual features:

InternVideo-MM-L14.
- Downloading script: here (64 tar files, totally 1.4TB)
- (Optional) sha256 checksum: here, 10KB
- Our feature extraction script
CLIP-ViT-L-14
- Downloading script: here (64 tar files, totally 700GB)
- Our feature extraction script

Feature quality benchmarked on HTM-Align

Without any (learnable) joint visual-language model, we measure the backbone visual-langauge feature quality on HTM-Align -- which is similar to a retrieval setting.

Model	Setting	Recall
MILNCE	global	0.287
MILNCE	overlap-seq	0.342
CLIP ViT/B-32	global	0.175
CLIP ViT/B-32	overlap-seq	0.234
CLIP ViT/B-16	global	0.221
CLIP ViT/B-16	overlap-seq	0.278
CLIP ViT/L-14	global	0.256
CLIP ViT/L-14	overlap-seq	0.309
InternVideo-MM-L14	global	0.406
InternVideo-MM-L14	overlap-seq	0.437

Reference

If you find these data helpful, please consider citing us:

@InProceedings{han2022align,
  title={Temporal Alignment Network for long-term Video},  
  author={Tengda Han and Weidi Xie and Andrew Zisserman},  
  booktitle={CVPR},  
  year={2022}}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme.md

HowTo100M Data Release

WhisperX

Visual Features

Feature quality benchmarked on HTM-Align

Reference

FilesExpand file tree

htm_zoo

Directory actions

More options

Directory actions

More options

Latest commit

History

htm_zoo

Folders and files

parent directory

readme.md

HowTo100M Data Release

WhisperX

Visual Features

Feature quality benchmarked on HTM-Align

Reference