PhonemeCVAE: Contrastive Latent Clustering and Class-Conditioned Priors for Disentangled Phoneme Interpolation
In order to achieve precise pronunciation control, high-quality TTS systems rely on structured and disentangled speech representations. We propose PhonemeCVAE, a phoneme-conditioned variational autoencoder that learns a structured continuous latent space for phoneme-level representations. The model introduces class-conditioned Gaussian priors for each phoneme and employs a contrastive objective to promote compact intra-class clustering and clear phoneme class disentanglement. The resulting latent space enables controllable phoneme modifications via interpolation at inference time, allowing smooth transitions between phonological classes and semantically meaningful modifications of synthesized speech. Furthermore, the combination of contrastive regularization and phoneme-conditioned priors forms a structured latent topology in the VAE that generalizes across English speech datasets, supporting consistent phoneme interpolation without compromising synthesis quality.
Interpolate audio samples between the centroids of two specifically selected phoneme classes, where the interpolation factor α controls the weighting between the source and the target class centroid, with α=0.0 sampling from the source class and α=1.0 sampling from the target classs.
The code will be released after the paper is accepted.