Skip to content

Diabolocom-Research/ConversationalDataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

ConversationalDataset: Benchmarking Conversations

This repository will host benchmarks and datasets related to conversational AI tasks.

πŸ—‚ Processed TalkBank Dataset for ASR benchmarking

Pre-processed dataset with train-test splits for all languages is avaialble at Hugging Face as described in this Paper.

πŸš€ Setting Up the Environment

To set up the environment and install necessary dependencies, run:

conda env create -f environment.yml

πŸ“Š Benchmarking over TalkBank

To generate transcripts using various ASR systems on the TalkBank dataset, use the following scripts:

Segment

python src/run_canary_prediction_segment.py #  Canary 1b
python src/run_whisper_prediction_segment.py # Whisper large-v3
python src/run_wav2vec2_prediction_segment.py # Wav2vec2
python src/run_wav2vec2multi_prediction_segment.py # Wav2vec2 multilingual

Switch

python src/run_canary_prediction_switch.py # Canary 1b
python src/run_whisper_prediction_switch.py # Whisper large-v3

After generating the transcripts, consolidate them into a CSV file for further analysis:

python src/collect_talkbank_segment.py # Produces talkbank_df_segments.csv
python src/collect_talkbank_switch.py # Produces talkbank_df_switch.csv

πŸ“Š Benchmarking with Librispeech, Fleurs, and CommonVoice

To evaluate ASR systems on the Librispeech, Fleurs, and CommonVoice datasets, place each dataset in the appropriate directory structure with the following format:

  • Add datasets in commonvoice, fleurs, libri_speech directories
  • Add CSV files (librispeech_dataset.csv, fleurs_dataset.csv, commonvoice_dataset.csv) in each directory, formatted as follows:

Directory Structure Example

fleurs/
β”œβ”€β”€ en/
β”œβ”€β”€ fr/
β”œβ”€β”€ de/
β”œβ”€β”€ ...
└── fleurs_dataset.csv

commonvoice/
β”œβ”€β”€ ...

Generate transcription

Generate Transcripts for libri speech

python src/run_libri_speech_canary_prediction.py  # Canary 1b
python src/run_libri_speech_wav2vec2_prediction.py # Wav2vec2
python src/run_libri_speech_wav2vec2multi_prediction.py # Wav2vec2 multilingual
python src/run_libri_speech_whisper_prediction.py # Whisper large

Collect the transcripts generated over libiri speech into one csv

python src/collect_libri_speech.py

Similarly one can generate for fleurs, and commonvoice.

πŸ“ Result And Analysis

All results and analysis are available in the ResultAnalysis.ipynb file.

🧹 Transcript Processing

The transcript processing including speech disfluency normalization and CHAT template paring is availalble in transcript_processing folder.

🚧 Work in Progress

  • Pre-processing code: Coming soon! We will upload scripts for cleaning, formatting, and preparing TalkBannk dataset subset itself. For now refer to hugging face link to download the already processed dataset.

Stay tuned for updates!

About

All benchmarks related to conversations

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published