You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SUPERVISED CONTRASTIVE VARIATIONAL AUTOENCODERS FOR PHONEME-DISENTANGLED SPEECH SYNTHESIS
Abstract
The generation of high-quality, natural-sounding speech via
text-to-speech (TTS) models is a major aim in the field of au-
dio synthesis, which often suffer from tangled representations
that blur distinctions between phoneme classes, making pre-
cise control of phoneme pronunciation difficult. This paper
presents a novel approach that combines contrastive learning
with Variational Autoencoders (VAE), PhonemeCVAE, to
produce disentangled latent embeddings for distinct phoneme
classes. We demonstrate that by carefully integrating su-
pervised contrastive learning into the VAE paradigm and
training a phoneme-conditioned VAE with Gaussian priors
per phoneme class, the latent space achieves much stronger
phoneme separability and more compactness within-class
clustering than training without the contrastive loss. We
show that by adding the supervised contrastive loss into our
training objective, we enable the Gaussian priors to learn
disentangled phonetic representations that we can later use at
inference stage to generate gradually interpolated phonemes.
Architecture overview
Installation
git clone https://github.com/nina-goes/PhonemeCVAE.git
cd PhonemeCVAE
conda env create -f environment.yml
Audio Samples
Interpolate audio samples between the centroids of two specifically selected phoneme classes, where the interpolation factor α controls the weighting between the source and the target class centroid, with α=0.0 sampling from the source class and α=1.0 sampling from the target classs.