This repository contains MATLAB and Python scripts for feature extraction and classification of sustained vowel voice signals (pathological vs. healthy) using the HUPA database. The focus is on Perturbation, Regularity, Noise (PRN), and Complexity features.
To run the scripts, you need the HUPA database.
After downloading and organising the data, the expected structure inside this repository is:
HUPA-Voice-Analysis/
├── toolboxes/
│ ├── AVCA-ByO-master/
│ ├── covarep-master/
│ ├── fastdfa/
│ ├── hctsa-main/
│ ├── hurst estimators/
│ ├── ME-master/
│ └── rpde/
├── data/
│ ├── HUPA_db/
│ │ ├── healthy/
│ │ │ ├── 50 kHz/ ← mono .wav files at / resampled to 50 kHz
│ │ │ └── 25 kHz/ ← mono .wav files resampled to 25 kHz
│ │ ├── pathological/
│ │ │ ├── 50 kHz/ ← mono .wav files at / resampled to 50 kHz
│ │ │ └── 25 kHz/ ← mono .wav files resampled to 25 kHz
│ │ ├── HUPA_db.xlsx
│ │ └── README.md
│ ├── figures/
│ │ ├── ROC_HUPA_50kHz_MATLAB.pdf
│ │ ├── ROC_HUPA_50kHz_MATLAB.png
│ │ ├── ROC_HUPA_50kHz_Python.pdf
│ │ ├── ROC_HUPA_50kHz_Python.png
│ │ ├── ROC_HUPA_25kHz_MATLAB.pdf
│ │ ├── ROC_HUPA_25kHz_MATLAB.png
│ │ ├── ROC_HUPA_25kHz_Python.pdf
│ │ └── ROC_HUPA_25kHz_Python.png
│ ├── HUPA_Python_Results_Summary_25kHz.csv
│ ├── HUPA_Python_Results_Summary_50kHz.csv
│ ├── HUPA_voice_features_PRN_CPP_25kHz.csv
│ └── HUPA_voice_features_PRN_CPP_50kHz.csv
├── HUPA_Features_Extraction.m
├── HUPA_PRN_GridSearch_ROC.m
├── HUPA_Python_GridSearch.py
├── requirements.txt
└── README.md
- The healthy/ folder contains recordings from healthy speakers.
- The pathological/ folder contains recordings from patients with different laryngeal pathologies.
- Each condition is available at 50 kHz and 25 kHz (all files are mono).
- Inside
data/HUPA_db/there is a spreadsheetHUPA_db.xlsxdescribing all speakers and recordings (age, sex, GRBAS scores, pathology codes, etc.), together with a localREADME.mdin the same folder that documents the database structure and metadata fields in the Excel file.
This script:
-
Loads
.wavfiles from:data/HUPA_db/healthy/50 kHz/data/HUPA_db/pathological/50 kHz/data/HUPA_db/healthy/25 kHz/data/HUPA_db/pathological/25 kHz/
-
Extracts:
- AVCA PRN features (Perturbation, Regularity, Noise)
- Nonlinear/complexity features (depending on AVCA configuration)
- CPP (Cepstral Peak Prominence) using Covarep
-
Saves two CSV files, one per sampling frequency, in the
data/folder:HUPA_voice_features_PRN_CPP_50kHz.csvHUPA_voice_features_PRN_CPP_25kHz.csv
Each CSV includes:
-
One row per audio file
-
Columns:
- All AVCA PRN (and complexity) features
CPPFileNameLabel(0 = healthy, 1 = pathological)
For each CSV:
-
Loads
HUPA_voice_features_PRN_CPP_50kHz.csvorHUPA_voice_features_PRN_CPP_25kHz.csv. -
Defines feature groups:
- Noise
- Perturbation (including CPP and jitter/shimmer)
- Tremor
- Complexity / nonlinear measures
-
Cleans the data:
- Removes all-NaN / constant columns
- Imputes remaining NaNs (median)
-
Splits the data:
- 80% Train (for hyperparameter optimisation via 5-fold CV)
- 20% independent Test set
-
Trains and tunes:
- Logistic Regression (
fitclinear) - SVM (RBF) (
fitcsvm+fitPosterior) - Random Forest (
TreeBagger) - MLP (
fitcnet, if available)
- Logistic Regression (
-
Evaluates models on the Test set and computes AUC.
-
Plots ROC curves for the four feature groups (Noise, Perturbation, Tremor, Complexity).
The script saves one figure per sampling rate in data/figures/, using the convention:
-
For 50 kHz:
ROC_HUPA_50kHz_MATLAB.pngROC_HUPA_50kHz_MATLAB.pdf
-
For 25 kHz:
ROC_HUPA_25kHz_MATLAB.pngROC_HUPA_25kHz_MATLAB.pdf
A Python implementation using scikit-learn reproduces the MATLAB analysis for both sampling frequencies.
The script expects the two CSVs generated by MATLAB:
data/HUPA_voice_features_PRN_CPP_50kHz.csvdata/HUPA_voice_features_PRN_CPP_25kHz.csv
For each CSV, it runs the full pipeline independently.
For each sampling frequency (50 kHz, 25 kHz):
-
Loads the corresponding CSV.
-
Defines the same feature groups:
- Noise, Perturbation, Tremor, Complexity.
-
Uses a common train–test split:
- 80% Train, 20% Test, stratified by label.
-
For each group, runs a
GridSearchCVwith 5-fold CV and AUC as the scoring metric, over:- Logistic Regression
- SVM (RBF)
- Random Forest
- k-NN
- MLP
Each model is wrapped in a
Pipelinewith:SimpleImputer(strategy="median")StandardScaler(except Random Forest, which only uses imputation)
-
Evaluates the best model (per algorithm) on the hold-out Test set.
-
Plots ROC curves (2×2 subplots for Noise/Perturbation/Tremor/Complexity) and saves them to
data/figures/:-
50 kHz:
ROC_HUPA_50kHz_Python.pngROC_HUPA_50kHz_Python.pdf
-
25 kHz:
ROC_HUPA_25kHz_Python.pngROC_HUPA_25kHz_Python.pdf
-
-
Saves a summary CSV with all models and groups:
data/HUPA_Python_Results_Summary_50kHz.csvdata/HUPA_Python_Results_Summary_25kHz.csv
Each summary file contains, for every combination of feature group and model:
GroupModelTest_AUCCV_AUC_MeanBest_Params
- MATLAB (R2020b or newer recommended)
- Statistics and Machine Learning Toolbox
- Deep Learning Toolbox (optional, for
fitcnet)
Place these libraries inside toolboxes/:
- AVCA-ByO: Essential for P, R, N features.
- Covarep: Used for CPP feature extraction.
- Hurst Estimators: Implementation to compute the Hurst exponent.
- RPDE: Code to compute Recurrence Period Density Entropy (Little et al., 2007).
- FastDFA: Implementation to compute Detrended Fluctuation Analysis (Little et al., 2006).
- HCTSA: Highly Comparative Time-Series Analysis (used for D2 and LLE).
- ME (Markovian Entropies): Functions for the computation of entropies from Markov Models.
Compatibility Note for Newer MATLAB Versions
Many of these toolboxes were developed years ago. If you are using a recent version of MATLAB (e.g., R2020b+), please be aware of the following:
- Legacy Code: You may need to manually update small parts of the external toolboxes to fix deprecated functions.
- Path Conflicts: The script
HUPA_Features_Extraction.malready handles a known conflict with Covarep (it removesbackcompatibility_2015to avoid breaking the built-inaudioread).- Debugging: If you encounter "function not found" or "input argument" errors inside these toolboxes, check that their internal paths are correctly added and that they support your MATLAB version.
Install dependencies via:
pip install -r requirements.txt[Add here the reference to the HUPA database and the related publication, once finalised.]