Skip to content

The supplementary materials for the paper: Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic Analysis (Interspeech 2025)

License

Notifications You must be signed in to change notification settings

pacscilab/CV_clientID_cleaning

Repository files navigation

CV_clientID_cleaning

This repository contains the supplementary materials and links to additional processing code and data for the paper: Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic Analysis (Interspeech 2025).

Quick link: download the client ID similarity scores here: Hugging Face

Automatic speaker verification

The automatic speaker verification was conducted on the validated portion of language-specific datasets from the Mozilla Common Voice Corpus for each client ID. For each language-specific client ID, we used the final recording made by each client with at least three tokenized words, or the final recording for all other cases. Client IDs with only one recording are not included in this procedure or output.

The Interspeech 2025 paper employed data from 76 languages: the list of those languages and their version numbers is in languages_interspeech2025.txt and the full data output for that paper is in similarity_scores_interspeech2025.txt (located in the Hugging Face repository described below). Within the processing for Interspeech 2025, Georgian (ka) and Turkish (tr) had a very small number of invalidated recordings unfortunately mixed in to the processing; a very small percentage of data was affected, but we do have plans to update these languages in the future (see Hugging Face repository).

Hugging Face storage

Since Interspeech 2025, we have added a few languages using the same procedure as described above (validated portion only). We therefore recommend downloading the most recent language-specific similarity scores from our Hugging Face repository VoxCommunis within the similarity_scores folder. The filenames have the structure of {Common Voice language ID}_{Common Voice Version Number). The files have the column structure:

  • enroll: The enrollment filename
  • test: The test filename
  • score: The cosine similarity score generated by the automatic speaker verification system

Code repo for ASV

The code used for the automatic speaker verification can be found at areffarhadi/asv-commonvoice.

Data and perceptual analysis

The repository also contains the code and datasets from the perceptual analysis reported in the Interspeech 2025 paper.

auditing_result_analysis.qmd: An annotated R script with the analyses and statistics reported in the Interspeech 2025 paper. The format is R Quarto.

auditing_result_analysis.html: auditing_result_analysis.qmd in html format.

auditing_result_analysis_files/: Folder containing figures and output generated by auditing_result_analysis.qmd that is necessary for viewing the corresponding html file.

audit_r1.csv: Dataset read in by auditing_result_analysis.qmd. This dataset contains the initial perceptual evaluation of 30 trials (enrollment + test: 5 files each from the intervals <0.1, [0.1, 0.2), [0.2, 0.3), [0.3, 0.4), [0.4, 0.5), >=0.5) from all 76 languages. Annotators responded with same speaker, different speaker, missing speech, audio quality issue, or not sure.

audit_r2.csv: Dataset read in by auditing_result_analysis.qmd. This dataset contains the second perceptual evaluation on a random sample of 150 trials, completed by all five annotators. Annotators responded with same speaker, different speaker, missing speech, audio quality issue, or not sure. This was used to assess the interannotator agreement.

Referencing

For use of materials included or linked on this page, please reference:

Zhang, M., Farhadipour, A., Baker, A., Ma, J., Pricop, B., Chodroff, E. (2025) Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic Analysis. Proc. Interspeech 2025, 3933-3937, doi: 10.21437/Interspeech.2025-2027

@inproceedings{zhang25s_interspeech,
  title     = {Quantifying and Reducing Speaker Heterogeneity within the {Common Voice Corpus} for Phonetic Analysis},
  author    = {{Miao Zhang and Aref Farhadipour and Annie Baker and Jiachen Ma and Bogdan Pricop and Eleanor Chodroff}},
  year      = {{2025}},
  booktitle = {{Interspeech 2025}},
  pages     = {{3933--3937}},
  doi       = {{10.21437/Interspeech.2025-2027}},
  issn      = {{2958-1796}}
}

About

The supplementary materials for the paper: Quantifying and Reducing Speaker Heterogeneity within the Common Voice Corpus for Phonetic Analysis (Interspeech 2025)

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •