HUPA Voice Disorders Dataset

This repository contains MATLAB and Python scripts for feature extraction and classification of sustained vowel voice signals (pathological vs. healthy) using the HUPA database. The focus is on Perturbation, Regularity, Noise (PRN), and Complexity features.

HUPA Database

To run the scripts, you need the HUPA database.

https://zenodo.org/uploads/17704572

After downloading and organising the data, the expected structure inside this repository is:

HUPA-Voice-Analysis/
├── toolboxes/
│   ├── AVCA-ByO-master/
│   ├── covarep-master/
│   ├── fastdfa/
│   ├── hctsa-main/
│   ├── hurst estimators/
│   ├── ME-master/
│   └── rpde/
├── data/
│   ├── HUPA_db/
│   │   ├── healthy/
│   │   │   ├── 50 kHz/      ← mono .wav files at / resampled to 50 kHz
│   │   │   └── 25 kHz/      ← mono .wav files resampled to 25 kHz
│   │   ├── pathological/
│   │   │   ├── 50 kHz/      ← mono .wav files at / resampled to 50 kHz
│   │   │   └── 25 kHz/      ← mono .wav files resampled to 25 kHz
│   │   ├── HUPA_db.xlsx
│   │   └── README.md
│   ├── figures/
│   │   ├── ROC_HUPA_50kHz_MATLAB.pdf
│   │   ├── ROC_HUPA_50kHz_MATLAB.png
│   │   ├── ROC_HUPA_50kHz_Python.pdf
│   │   ├── ROC_HUPA_50kHz_Python.png
│   │   ├── ROC_HUPA_25kHz_MATLAB.pdf
│   │   ├── ROC_HUPA_25kHz_MATLAB.png
│   │   ├── ROC_HUPA_25kHz_Python.pdf
│   │   └── ROC_HUPA_25kHz_Python.png
│   ├── HUPA_Python_Results_Summary_25kHz.csv
│   ├── HUPA_Python_Results_Summary_50kHz.csv
│   ├── HUPA_voice_features_PRN_CPP_25kHz.csv
│   └── HUPA_voice_features_PRN_CPP_50kHz.csv
├── HUPA_Features_Extraction.m
├── HUPA_PRN_GridSearch_ROC.m
├── HUPA_Python_GridSearch.py
├── requirements.txt
└── README.md

The healthy/ folder contains recordings from healthy speakers.
The pathological/ folder contains recordings from patients with different laryngeal pathologies.
Each condition is available at 50 kHz and 25 kHz (all files are mono).
Inside data/HUPA_db/ there is a spreadsheet HUPA_db.xlsx describing all speakers and recordings (age, sex, GRBAS scores, pathology codes, etc.), together with a local README.md in the same folder that documents the database structure and metadata fields in the Excel file.

MATLAB Workflow

1. Feature Extraction (`HUPA_Features_Extraction.m`)

This script:

Loads .wav files from:
- data/HUPA_db/healthy/50 kHz/
- data/HUPA_db/pathological/50 kHz/
- data/HUPA_db/healthy/25 kHz/
- data/HUPA_db/pathological/25 kHz/
Extracts:
- AVCA PRN features (Perturbation, Regularity, Noise)
- Nonlinear/complexity features (depending on AVCA configuration)
- CPP (Cepstral Peak Prominence) using Covarep
Saves two CSV files, one per sampling frequency, in the data/ folder:
- HUPA_voice_features_PRN_CPP_50kHz.csv
- HUPA_voice_features_PRN_CPP_25kHz.csv

Each CSV includes:

One row per audio file
Columns:
- All AVCA PRN (and complexity) features
- CPP
- FileName
- Label (0 = healthy, 1 = pathological)

2. Classification & ROC Analysis (`HUPA_PRN_GridSearch_ROC.m`)

For each CSV:

Loads HUPA_voice_features_PRN_CPP_50kHz.csv or HUPA_voice_features_PRN_CPP_25kHz.csv.
Defines feature groups:
- Noise
- Perturbation (including CPP and jitter/shimmer)
- Tremor
- Complexity / nonlinear measures
Cleans the data:
- Removes all-NaN / constant columns
- Imputes remaining NaNs (median)
Splits the data:
- 80% Train (for hyperparameter optimisation via 5-fold CV)
- 20% independent Test set
Trains and tunes:
- Logistic Regression (fitclinear)
- SVM (RBF) (fitcsvm + fitPosterior)
- Random Forest (TreeBagger)
- MLP (fitcnet, if available)
Evaluates models on the Test set and computes AUC.
Plots ROC curves for the four feature groups (Noise, Perturbation, Tremor, Complexity).

The script saves one figure per sampling rate in data/figures/, using the convention:

For 50 kHz:
- ROC_HUPA_50kHz_MATLAB.png
- ROC_HUPA_50kHz_MATLAB.pdf
For 25 kHz:
- ROC_HUPA_25kHz_MATLAB.png
- ROC_HUPA_25kHz_MATLAB.pdf

Python Workflow (`HUPA_Python_GridSearch.py`)

A Python implementation using scikit-learn reproduces the MATLAB analysis for both sampling frequencies.

Inputs

The script expects the two CSVs generated by MATLAB:

data/HUPA_voice_features_PRN_CPP_50kHz.csv
data/HUPA_voice_features_PRN_CPP_25kHz.csv

For each CSV, it runs the full pipeline independently.

Steps

For each sampling frequency (50 kHz, 25 kHz):

Loads the corresponding CSV.
Defines the same feature groups:
- Noise, Perturbation, Tremor, Complexity.
Uses a common train–test split:
- 80% Train, 20% Test, stratified by label.
For each group, runs a GridSearchCV with 5-fold CV and AUC as the scoring metric, over:
- Logistic Regression
- SVM (RBF)
- Random Forest
- k-NN
- MLP
Each model is wrapped in a Pipeline with:
- SimpleImputer(strategy="median")
- StandardScaler (except Random Forest, which only uses imputation)
Evaluates the best model (per algorithm) on the hold-out Test set.
Plots ROC curves (2×2 subplots for Noise/Perturbation/Tremor/Complexity) and saves them to data/figures/:
- 50 kHz:
  - ROC_HUPA_50kHz_Python.png
  - ROC_HUPA_50kHz_Python.pdf
- 25 kHz:
  - ROC_HUPA_25kHz_Python.png
  - ROC_HUPA_25kHz_Python.pdf
Saves a summary CSV with all models and groups:
- data/HUPA_Python_Results_Summary_50kHz.csv
- data/HUPA_Python_Results_Summary_25kHz.csv

Each summary file contains, for every combination of feature group and model:

Group
Model
Test_AUC
CV_AUC_Mean
Best_Params

Requirements

MATLAB

MATLAB (R2020b or newer recommended)
Statistics and Machine Learning Toolbox
Deep Learning Toolbox (optional, for fitcnet)

External Toolboxes

Place these libraries inside toolboxes/:

AVCA-ByO: Essential for P, R, N features.
Covarep: Used for CPP feature extraction.
Hurst Estimators: Implementation to compute the Hurst exponent.
RPDE: Code to compute Recurrence Period Density Entropy (Little et al., 2007).
FastDFA: Implementation to compute Detrended Fluctuation Analysis (Little et al., 2006).
HCTSA: Highly Comparative Time-Series Analysis (used for D2 and LLE).
ME (Markovian Entropies): Functions for the computation of entropies from Markov Models.

Compatibility Note for Newer MATLAB Versions

Many of these toolboxes were developed years ago. If you are using a recent version of MATLAB (e.g., R2020b+), please be aware of the following:

Legacy Code: You may need to manually update small parts of the external toolboxes to fix deprecated functions.

Path Conflicts: The script HUPA_Features_Extraction.m already handles a known conflict with Covarep (it removes backcompatibility_2015 to avoid breaking the built-in audioread).

Debugging: If you encounter "function not found" or "input argument" errors inside these toolboxes, check that their internal paths are correctly added and that they support your MATLAB version.

Python

Install dependencies via:

pip install -r requirements.txt

Citation

[Add here the reference to the HUPA database and the related publication, once finalised.]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HUPA Voice Disorders Dataset

HUPA Database

MATLAB Workflow

1. Feature Extraction (`HUPA_Features_Extraction.m`)

2. Classification & ROC Analysis (`HUPA_PRN_GridSearch_ROC.m`)

Python Workflow (`HUPA_Python_GridSearch.py`)

Inputs

Steps

Requirements

MATLAB

External Toolboxes

Python

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data/HUPA_db		data/HUPA_db
HUPA_Features_Extraction.m		HUPA_Features_Extraction.m
HUPA_PRN_GridSearch_ROC.m		HUPA_PRN_GridSearch_ROC.m
HUPA_Python_GridSearch.py		HUPA_Python_GridSearch.py
README.md		README.md
requirements.txt		requirements.txt

BYO-UPM/HUPA_database

Folders and files

Latest commit

History

Repository files navigation

HUPA Voice Disorders Dataset

HUPA Database

MATLAB Workflow

1. Feature Extraction (HUPA_Features_Extraction.m)

2. Classification & ROC Analysis (HUPA_PRN_GridSearch_ROC.m)

Python Workflow (HUPA_Python_GridSearch.py)

Inputs

Steps

Requirements

MATLAB

External Toolboxes

Python

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Feature Extraction (`HUPA_Features_Extraction.m`)

2. Classification & ROC Analysis (`HUPA_PRN_GridSearch_ROC.m`)

Python Workflow (`HUPA_Python_GridSearch.py`)

Packages