Singing MOS Predictor: A predictor for singing mean-opinion-score prediction.
Our paper link: SingMOS-Pro: A Comprehensive Benchmark for Singing Quality Assessment
The SingMOS repository provides an easy-to-use way to perform singing voice MOS prediction.
Currently we provide below models:
| Model | specifier | Train Data | Backbone Model | paper |
|---|---|---|---|---|
| Singing-SSL-MOS | singmos_pro |
SingMOS-Pro | wav2vec2_large_ll60k | Tang (2025) |
| Singing-SSL-MOS | singmos_v1 |
SingMOS-v1 | wav2vec2-base-960 | Tang (2024) |
singmos_pro: Benchmark for Singing MOS Prediction: train a ssl-mos model in South-Twilight/SingMOS-Predictor repository using SingMOS-Pro dataset.singmos_v1: Baseline for Singing Track in VoiceMOS Challenge 2024: train a ssl-mos model in nii-yamagishilab/mos-finetune-ssl repository using SingMOS-v1 dataset.
All models were trained at a 16 kHz sampling rate.
- [2025.11.29]: Release SingMOS:v1.1.2 version, fix README.
- [2025.11.11]: Release SingMOS:v1.1.1 version, fix bugs with batch inference.
- [2025.11.06]: Release SingMOS:v1.1.0 version, train with SingMOS-Pro.
- [2025.06.30]: Release SingMOS:v0.3.0 version, train with more data.
- [2024.08.28]: Release SingMOS:v0.2.1 version, support S3PRL models as base models instead of fairseq models.
- [2024.06.28]: Release SingMOS:v0.1.0 version.
Predict naturalness (Naturalness Mean-Opinion-Score) of your audio by Singing-SSL-MOS:
import torch
import librosa
wave, sr = librosa.load("your_audio.wav", sr=None, mono=True)
# if sample rate != 16000, resample the wave.
if sr != 16000:
wave = librosa.resample(wave, orig_sr=sr, target_sr=16000)
sr = 16000
wave = torch.from_numpy(wave).unsqueeze(0) # [1, T]
length = torch.tensor([wave.shape[1]], dtype=torch.long) # [1]
predictor = torch.hub.load("South-Twilight/SingMOS:v1.1.2", "singmos_pro", trust_repo=True)
with torch.no_grad():
score = predictor(wave, length)
print(f"Pred MOS: {score.item():.4f}")SingMOS use torch.hub built-in model loader, so no needs of library import😉
(As general dependencies, SingMOS requires Python=>3.8, torch, librosa and s3prl.)
First, instantiate a MOS predictor with model specifier string:
import torch
predictor = torch.hub.load("South-Twilight/SingMOS:v1.1.2", "specifier>", trust_repo=True)Then, pass tensor of singings : wave in (Batch, Time), length in (Batch):
waves = torch.rand((2, 16000)) # Two clips, each 1 sec (sr=16,000)
lengths = []
for i in range(waves.shape[0]):
lengths.append(waves[i].shape[0])
lengths = torch.tensor(lengths)
# wave: [2, T], length: [2]
score = predictor(waves, lengths)
# tensor([4.0321, 2.0943])Returned scores :: (Batch,) are each singing's predicted MOS.
If you hope MOS average over singings (e.g. for SVS model evaluation), just average them:
average_score = score.mean().item()
# 2.0632- SingMOS-Predictor
- MOS-Finetune-SSL
- SpeechMOS
@misc{tang2025singmosprocomprehensivebenchmarksinging,
title={SingMOS-Pro: An Comprehensive Benchmark for Singing Quality Assessment},
author={Yuxun Tang and Lan Liu and Wenhao Feng and Yiwen Zhao and Jionghao Han and Yifeng Yu and Jiatong Shi and Qin Jin},
year={2025},
eprint={2510.01812},
archivePrefix={arXiv},
primaryClass={cs.SD},
url={https://arxiv.org/abs/2510.01812},
}