PhonemeCVAE

PhonemeCVAE: Contrastive Latent Clustering and Class-Conditioned Priors for Disentangled Phoneme Interpolation

Abstract

In order to achieve precise pronunciation control, high-quality TTS systems rely on structured and disentangled speech representations. We propose PhonemeCVAE, a phoneme-conditioned variational autoencoder that learns a structured continuous latent space for phoneme-level representations. The model introduces class-conditioned Gaussian priors for each phoneme and employs a contrastive objective to promote compact intra-class clustering and clear phoneme class disentanglement. The resulting latent space enables controllable phoneme modifications via interpolation at inference time, allowing smooth transitions between phonological classes and semantically meaningful modifications of synthesized speech. Furthermore, the combination of contrastive regularization and phoneme-conditioned priors forms a structured latent topology in the VAE that generalizes across English speech datasets, supporting consistent phoneme interpolation without compromising synthesis quality.

Audio Samples

Interpolate audio samples between the centroids of two specifically selected phoneme classes, where the interpolation factor α controls the weighting between the source and the target class centroid, with α=0.0 sampling from the source class and α=1.0 sampling from the target classs.

Code

The code will be released after the paper is accepted.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
audio_samples		audio_samples
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PhonemeCVAE

PhonemeCVAE: Contrastive Latent Clustering and Class-Conditioned Priors for Disentangled Phoneme Interpolation

Abstract

Audio Samples

Code

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PhonemeCVAE

PhonemeCVAE: Contrastive Latent Clustering and Class-Conditioned Priors for Disentangled Phoneme Interpolation

Abstract

Audio Samples

Code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages