PhonemeCVAE

SUPERVISED CONTRASTIVE VARIATIONAL AUTOENCODERS FOR PHONEME-DISENTANGLED SPEECH SYNTHESIS

Abstract

The generation of high-quality, natural-sounding speech via text-to-speech (TTS) models is a major aim in the field of au- dio synthesis, which often suffer from tangled representations that blur distinctions between phoneme classes, making pre- cise control of phoneme pronunciation difficult. This paper presents a novel approach that combines contrastive learning with Variational Autoencoders (VAE), PhonemeCVAE, to produce disentangled latent embeddings for distinct phoneme classes. We demonstrate that by carefully integrating su- pervised contrastive learning into the VAE paradigm and training a phoneme-conditioned VAE with Gaussian priors per phoneme class, the latent space achieves much stronger phoneme separability and more compactness within-class clustering than training without the contrastive loss. We show that by adding the supervised contrastive loss into our training objective, we enable the Gaussian priors to learn disentangled phonetic representations that we can later use at inference stage to generate gradually interpolated phonemes.

Architecture overview

Installation

git clone https://github.com/nina-goes/PhonemeCVAE.git
cd PhonemeCVAE
conda env create -f environment.yml

Audio Samples

Interpolate audio samples between the centroids of two specifically selected phoneme classes, where the interpolation factor α controls the weighting between the source and the target class centroid, with α=0.0 sampling from the source class and α=1.0 sampling from the target classs.

LibriSpeech/test

Transforming /s/ to /ʃ/

Sample	Input	α = -0.5 (150% /s/, 0% /ʃ/)	α = 1.0 (0% /s/, 100% /ʃ/)	α = 0.5 (50% /s/, 50% /ʃ/)
1

Transforming /æ/ to /ɑ/

Sample	Input	α = -0.5 (150% /æ/, 0% /ɑ/)	α = 1.0 (0% /æ/, 100% /ɑ/)	α = 0.5 (50% /æ/, 50% /ɑ/)
1

Audio Samples LJSpeech

Transforming /s/ to /ʃ/

Sample	Input	α = -0.5 (150% /s/, 0% /ʃ/)	α = 1.0 (0% /s/, 100% /ʃ/)	α = 0.5 (50% /s/, 50% /ʃ/)
1

Transforming /ɒ/ to /i/

Sample	Input	α = -0.5 (150% /ɒ/, 0% /i/)	α = 1.0 (0% /ɒ/, 100% /i/)	α = 0.5 (50% /ɒ/, 50% /i/)
1

Training

# Prepare training data
python scripts/build_manifests_from_mfa.py \
  --librispeech_root LibriSpeech/ \
  --mfa_align_root  train-clean-100/ \
  --out_root        librispeech/ \
  --n_mels 80 \
  --split train-clean-100

# Train model
python train.py --config configs/default.yaml

# Perform phoneme editing 
 python phoneme_editing.py 
    --checkpoint #your checkpoint 
    --phones_json phones.json 
    --test_manifest test-clean_manifest.jsonl
    --source_phoneme "s" 
    --target_phoneme "ʃ" 
    --edit_type interpolate 
    --alpha 1.0

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
audio_samples		audio_samples
configs		configs
figures		figures
phoneme_margin_vae		phoneme_margin_vae
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhonemeCVAE

SUPERVISED CONTRASTIVE VARIATIONAL AUTOENCODERS FOR PHONEME-DISENTANGLED SPEECH SYNTHESIS

Abstract

Architecture overview

Installation

Audio Samples

LibriSpeech/test

Transforming /s/ to /ʃ/

Transforming /æ/ to /ɑ/

Audio Samples LJSpeech

Transforming /s/ to /ʃ/

Transforming /ɒ/ to /i/

Training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PhonemeCVAE

SUPERVISED CONTRASTIVE VARIATIONAL AUTOENCODERS FOR PHONEME-DISENTANGLED SPEECH SYNTHESIS

Abstract

Architecture overview

Installation

Audio Samples

LibriSpeech/test

Transforming /s/ to /ʃ/

Transforming /æ/ to /ɑ/

Audio Samples LJSpeech

Transforming /s/ to /ʃ/

Transforming /ɒ/ to /i/

Training

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages