Pre-processed dataset with train-test splits for all languages is avaialble at Hugging Face as described in this Paper.
To set up the environment and install necessary dependencies, run:
conda env create -f environment.ymlTo generate transcripts using various ASR systems on the TalkBank dataset, use the following scripts:
python src/run_canary_prediction_segment.py # Canary 1b
python src/run_whisper_prediction_segment.py # Whisper large-v3
python src/run_wav2vec2_prediction_segment.py # Wav2vec2
python src/run_wav2vec2multi_prediction_segment.py # Wav2vec2 multilingualpython src/run_canary_prediction_switch.py # Canary 1b
python src/run_whisper_prediction_switch.py # Whisper large-v3After generating the transcripts, consolidate them into a CSV file for further analysis:
python src/collect_talkbank_segment.py # Produces talkbank_df_segments.csv
python src/collect_talkbank_switch.py # Produces talkbank_df_switch.csvTo evaluate ASR systems on the Librispeech, Fleurs, and CommonVoice datasets, place each dataset in the appropriate directory structure with the following format:
- Add datasets in
commonvoice,fleurs,libri_speechdirectories - Add CSV files (
librispeech_dataset.csv,fleurs_dataset.csv,commonvoice_dataset.csv) in each directory, formatted as follows:
Directory Structure Example
fleurs/
βββ en/
βββ fr/
βββ de/
βββ ...
βββ fleurs_dataset.csv
commonvoice/
βββ ...
Generate Transcripts for libri speech
python src/run_libri_speech_canary_prediction.py # Canary 1b
python src/run_libri_speech_wav2vec2_prediction.py # Wav2vec2
python src/run_libri_speech_wav2vec2multi_prediction.py # Wav2vec2 multilingual
python src/run_libri_speech_whisper_prediction.py # Whisper largeCollect the transcripts generated over libiri speech into one csv
python src/collect_libri_speech.pySimilarly one can generate for fleurs, and commonvoice.
All results and analysis are available in the ResultAnalysis.ipynb file.
The transcript processing including speech disfluency normalization and CHAT template paring is availalble in transcript_processing folder.
- Pre-processing code: Coming soon! We will upload scripts for cleaning, formatting, and preparing TalkBannk dataset subset itself. For now refer to hugging face link to download the already processed dataset.
Stay tuned for updates!