The Voice Type Classifier is a classification model that given a input audio file, outputs a precise segmentation of speakers.
The four classes that the model will output are:
- FEM stands for adult female speech
- MAL stands for adult male speech
- KCHI stands for key-child speech
- OCH stands for other child speech
The model has been specifically trained to work with child-centered long-form recordings. These are recordings that can span multiple hours and have been collected using a portable recorder attached to the vest of a child (usually 0 to 5 years of age).
To use the model, you will need a unix-based machine (Linux or MacOS) and python version 3.13 or higher installed. Windows is not supported for the moment. As system dependencies, ensure that uv, ffmpeg, and git-lfs are installed. You can check that by running:
./check_sys_dependencies.shYou can now clone the repo with:
git lfs install
git clone --recurse-submodules https://github.com/LAAC-LSCP/VTC.git
cd VTCFinally, you can install python dependencies with the following command (recommended):
uv syncAlternatively, you can also isntall python dependencies using pip and python 3.13 or higher (not recommended):
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtInference is done using a checkpoint of the model, linking the corresponding config file used for training and the list of audio files to run the model on. You audio files should be in the .wav format, sampled at 16 000 kHz and contain a single channel (mono).
If not, you can use the scripts/convert.py file to convert your audios to 16 000 Hz and average the channels.
uv run scripts/infer.py \
--wavs audios \ # path to the folder containing the audio files
--output predictions \ # output folder
--device cpu # device to run the model on: ('cpu', 'cuda' or 'gpu', 'mps')The model outputs are saved to <output_folder>/ with the following structure:
<output_folder>/
β
βββ π rttm/ # Final output (with segment merging applied)
βββ π raw_rttm/ # Raw output (without segment merging)
βββ π rttm.csv # CSV version of final speaker segments
βββ π raw_rttm.csv # CSV version of raw speaker segmentsNote
Segment merging is applied to the main output. See the pyannote.audio description for details.
An example of a bash script is given to perform inference in scripts/run.sh. Simply set the correct variables in the script and run it:
sh scripts/run.shWe tested the inference pipeline on multiple GPUs and CPUs and display the expected speedup factors that can be used to estimate the total duration needed to process
| Table 1: GPU times | Table 2: CPU times | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
It takes approximatively
- For a
$1\text{ h}$ long audio, the inference will run for approximatively$\approx 4$ seconds. ($3600 / 905$ ) - For a
$16\text{ h}$ longform audio, the inference will run for$\approx 1 \text{ minute}$ and$4 \text{ seconds}$ . ($16 * 3600 / 905$ )
On a Intel(R) Xeon(R) Silver 4214R CPU with a batch size of 64, the inference pipeline will be quite slow:
- For a
$1\text{ h}$ long audio, the inference will run for approximatively$\approx 4$ minutes. ($3600 / 15$ ) - For a
$16\text{ h}$ longform audio, the inference will run for$\approx 1 \text{ hour}$ and$4 \text{ minutes}$ . ($16 * 3600 / 15$ )
We evaluate the new model, VTC 2.0, on a heldout set and compare it to the previous models and the Human performance (Human 2).
| Model | KCHI | OCH | MAL | FEM | Average F1-score |
|---|---|---|---|---|---|
| VTC 1.0 | 68.2 | 30.5 | 41.2 | 63.7 | 50.9 |
| VTC 1.5 | 68.4 | 20.6 | 56.7 | 68.9 | 53.6 |
| VTC 2.0 | 71.8 | 51.4 | 60.3 | 74.8 | 64.6 |
| Human 2 | 79.7 | 60.4 | 67.6 | 71.5 | 69.8 |
Table 1: F1-scores (%) obtained on the standard test set VTC 1.0, VTC 1.5, VTC 2.0, and a second human annotator. The best model is indicated in bold.
As displayed in table 1, our model performs better than previous iterations with performances close to the Human performances. VTC 2.0 even surpasses human like performance on the FEM class.
- OVL: is the overlap between speakers.
- SIL: are the section with silence/noise.
To cite this work, please use the following bibtex.
@misc{charlot2025babyhubertmultilingualselfsupervisedlearning,
title={BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings},
author={ThΓ©o Charlot and Tarek Kunze and Maxime Poli and Alejandrina Cristia and Emmanuel Dupoux and Marvin Lavechin},
year={2025},
eprint={2509.15001},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2509.15001},
}The Voice Type Classifier has benefited from numerous contributions over time, following publications document its evolution, listed in reverse chronological order.
GitHub repository: github.com/LAAC-LSCP/VTC-IS-25
@inproceedings{kunze25_interspeech,
title = {{Challenges in Automated Processing of Speech from Child Wearables: The Case of Voice Type Classifier}},
author = {Tarek Kunze and Marianne MΓ©tais and Hadrien Titeux and Lucas Elbert and Joseph Coffey and Emmanuel Dupoux and Alejandrina Cristia and Marvin Lavechin},
year = {2025},
booktitle = {{Interspeech 2025}},
pages = {2845--2849},
doi = {10.21437/Interspeech.2025-1962},
issn = {2958-1796},
}GitHub repository: github.com/MarvinLvn/voice-type-classifier
@inproceedings{lavechin20_interspeech,
title = {An Open-Source Voice Type Classifier for Child-Centered Daylong Recordings},
author = {Marvin Lavechin and Ruben Bousbib and HervΓ© Bredin and Emmanuel Dupoux and Alejandrina Cristia},
year = {2020},
booktitle = {Interspeech 2020},
pages = {3072--3076},
doi = {10.21437/Interspeech.2020-1690},
issn = {2958-1796},
}This work uses the segma library which is heavely inspired by pyannote.audio.
This work was performed using HPC resources from GENCI-IDRIS (Grant 2024-AD011015450 and 2025-AD011016414) and was developed as part of the ExELang project funded by the European Union (ERC, ExELang, Grant No 101001095).

