ConversationalDataset: Benchmarking Conversations

This repository will host benchmarks and datasets related to conversational AI tasks.

🗂 Processed TalkBank Dataset for ASR benchmarking

Pre-processed dataset with train-test splits for all languages is avaialble at Hugging Face as described in this Paper.

🚀 Setting Up the Environment

To set up the environment and install necessary dependencies, run:

conda env create -f environment.yml

📊 Benchmarking over TalkBank

To generate transcripts using various ASR systems on the TalkBank dataset, use the following scripts:

Segment

python src/run_canary_prediction_segment.py #  Canary 1b
python src/run_whisper_prediction_segment.py # Whisper large-v3
python src/run_wav2vec2_prediction_segment.py # Wav2vec2
python src/run_wav2vec2multi_prediction_segment.py # Wav2vec2 multilingual

Switch

python src/run_canary_prediction_switch.py # Canary 1b
python src/run_whisper_prediction_switch.py # Whisper large-v3

After generating the transcripts, consolidate them into a CSV file for further analysis:

python src/collect_talkbank_segment.py # Produces talkbank_df_segments.csv
python src/collect_talkbank_switch.py # Produces talkbank_df_switch.csv

📊 Benchmarking with Librispeech, Fleurs, and CommonVoice

To evaluate ASR systems on the Librispeech, Fleurs, and CommonVoice datasets, place each dataset in the appropriate directory structure with the following format:

Add datasets in commonvoice, fleurs, libri_speech directories
Add CSV files (librispeech_dataset.csv, fleurs_dataset.csv, commonvoice_dataset.csv) in each directory, formatted as follows:

Directory Structure Example

fleurs/
├── en/
├── fr/
├── de/
├── ...
└── fleurs_dataset.csv

commonvoice/
├── ...

Generate transcription

Generate Transcripts for libri speech

python src/run_libri_speech_canary_prediction.py  # Canary 1b
python src/run_libri_speech_wav2vec2_prediction.py # Wav2vec2
python src/run_libri_speech_wav2vec2multi_prediction.py # Wav2vec2 multilingual
python src/run_libri_speech_whisper_prediction.py # Whisper large

Collect the transcripts generated over libiri speech into one csv

python src/collect_libri_speech.py

Similarly one can generate for fleurs, and commonvoice.

📝 Result And Analysis

All results and analysis are available in the ResultAnalysis.ipynb file.

🧹 Transcript Processing

The transcript processing including speech disfluency normalization and CHAT template paring is availalble in transcript_processing folder.

🚧 Work in Progress

Pre-processing code: Coming soon! We will upload scripts for cleaning, formatting, and preparing TalkBannk dataset subset itself. For now refer to hugging face link to download the already processed dataset.

Stay tuned for updates!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ConversationalDataset: Benchmarking Conversations

This repository will host benchmarks and datasets related to conversational AI tasks.

🗂 Processed TalkBank Dataset for ASR benchmarking

🚀 Setting Up the Environment

📊 Benchmarking over TalkBank

Segment

Switch

📊 Benchmarking with Librispeech, Fleurs, and CommonVoice

Generate transcription

📝 Result And Analysis

🧹 Transcript Processing

🚧 Work in Progress

About

Uh oh!

Releases

Packages

Languages

Diabolocom-Research/ConversationalDataset

Folders and files

Latest commit

History

Repository files navigation

ConversationalDataset: Benchmarking Conversations

This repository will host benchmarks and datasets related to conversational AI tasks.

🗂 Processed TalkBank Dataset for ASR benchmarking

🚀 Setting Up the Environment

📊 Benchmarking over TalkBank

Segment

Switch

📊 Benchmarking with Librispeech, Fleurs, and CommonVoice

Generate transcription

📝 Result And Analysis

🧹 Transcript Processing

🚧 Work in Progress

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages