Estimate the maximum achievable correlation between model predictions and human ratings, given the inherent noise in subjective data.
Reference: Cumlin, F., "Rho-Perfect: Correlation Ceiling for Subjective Evaluation Datasets", ICASSP 2026.
# From GitHub
pip install git+https://github.com/fcumlin/rho-perfect.git
# With conda (install dependencies first)
conda create -n myenv python=3.10 numpy pandas scipy
conda activate myenv
pip install git+https://github.com/fcumlin/rho-perfect.gitimport pandas as pd
from rho_perfect import calculate_rho_perfect
# The data: one row per item with aggregated statistics.
ratings = pd.DataFrame({
'filename': ['item_001', 'item_002', ...],
'mean': [3.2, 4.1, ...], # mean rating per item
'std': [0.5, 0.3, ...], # sample std per item
'n': [8, 8, ...] # number of ratings per item
})
rho_perfect = calculate_rho_perfect(ratings)
print(f"$\rho$-Perfect = {rho_perfect:.3f}")
# Compare to a model on the same data.
model_pcc = 0.85 # pcc = Pearson correlation coefficient
if model_pcc >= 0.95 * rho_perfect:
print("Model is close to ceiling. Improve data quality for further gains.")
else:
print(f"Model can improve. Gap to ceiling: {rho_perfect - model_pcc:.3f}")From individual ratings:
from rho_perfect import calculate_rho_perfect_from_ratings
# Raw ratings: one row per rating
ratings = pd.DataFrame({
'filename': ['item_001', 'item_001', 'item_002', ...],
'rating': [3.0, 3.5, 4.0, ...]
})
rho_perfect = calculate_rho_perfect_from_ratings(ratings)- Ratings are conditionally independent given an item
- Rating noise may vary across items (heteroscedasticity)
- Each item has at least 3 ratings to allow estimation of within-item variance (altough in practice, it might be ok if some have fewer ratings per item as we take the mean of the variance over all items)
- The dataset exhibits non-zero between-item variability (i.e., the mean varies over items)
Violations of these assumptions may lead to unreliable estimates. The implementation emits warnings when common failure modes are detected.
Definition 2.1 (
where
where
Interpretation:
Calculate
-
Input: DataFrame with columns
filename,mean,std,n -
Output: float (0 <
$\rho$ ≤ 1) - Warnings: < 50 items or < 3 ratings per item
Calculate
-
Input: DataFrame with columns
filename,rater_id,rating -
Output: float (0 <
$\rho$ ≤ 1)
from rho_perfect import split_raters_validation, split_ratings_validation
# Validate $\rho$-Perfect^2 ≈ test-retest correlation (Section 3.1 of paper)
results = split_raters_validation(df, n_iterations=10, seed=42)
results = split_ratings_validation(df, n_iterations=10, seed=42)pip install -e ".[dev]"
pytest tests/@inproceedings{cumlin2026rhoperfect,
title={Rho-Perfect: Correlation Ceiling for Subjective Evaluation Datasets},
author={Cumlin, Fredrik},
booktitle={ICASSP 2026},
year={2026}
}MIT License - see LICENSE file for details.