Skip to content

BYO-UPM/HUPA_database

Repository files navigation

HUPA Voice Disorders Dataset

This repository contains MATLAB and Python scripts for feature extraction and classification of sustained vowel voice signals (pathological vs. healthy) using the HUPA database. The focus is on Perturbation, Regularity, Noise (PRN), and Complexity features.


HUPA Database

To run the scripts, you need the HUPA database.

https://zenodo.org/uploads/17704572

After downloading and organising the data, the expected structure inside this repository is:

HUPA-Voice-Analysis/
├── toolboxes/
│   ├── AVCA-ByO-master/
│   ├── covarep-master/
│   ├── fastdfa/
│   ├── hctsa-main/
│   ├── hurst estimators/
│   ├── ME-master/
│   └── rpde/
├── data/
│   ├── HUPA_db/
│   │   ├── healthy/
│   │   │   ├── 50 kHz/      ← mono .wav files at / resampled to 50 kHz
│   │   │   └── 25 kHz/      ← mono .wav files resampled to 25 kHz
│   │   ├── pathological/
│   │   │   ├── 50 kHz/      ← mono .wav files at / resampled to 50 kHz
│   │   │   └── 25 kHz/      ← mono .wav files resampled to 25 kHz
│   │   ├── HUPA_db.xlsx
│   │   └── README.md
│   ├── figures/
│   │   ├── ROC_HUPA_50kHz_MATLAB.pdf
│   │   ├── ROC_HUPA_50kHz_MATLAB.png
│   │   ├── ROC_HUPA_50kHz_Python.pdf
│   │   ├── ROC_HUPA_50kHz_Python.png
│   │   ├── ROC_HUPA_25kHz_MATLAB.pdf
│   │   ├── ROC_HUPA_25kHz_MATLAB.png
│   │   ├── ROC_HUPA_25kHz_Python.pdf
│   │   └── ROC_HUPA_25kHz_Python.png
│   ├── HUPA_Python_Results_Summary_25kHz.csv
│   ├── HUPA_Python_Results_Summary_50kHz.csv
│   ├── HUPA_voice_features_PRN_CPP_25kHz.csv
│   └── HUPA_voice_features_PRN_CPP_50kHz.csv
├── HUPA_Features_Extraction.m
├── HUPA_PRN_GridSearch_ROC.m
├── HUPA_Python_GridSearch.py
├── requirements.txt
└── README.md
  • The healthy/ folder contains recordings from healthy speakers.
  • The pathological/ folder contains recordings from patients with different laryngeal pathologies.
  • Each condition is available at 50 kHz and 25 kHz (all files are mono).
  • Inside data/HUPA_db/ there is a spreadsheet HUPA_db.xlsx describing all speakers and recordings (age, sex, GRBAS scores, pathology codes, etc.), together with a local README.md in the same folder that documents the database structure and metadata fields in the Excel file.

MATLAB Workflow

1. Feature Extraction (HUPA_Features_Extraction.m)

This script:

  1. Loads .wav files from:

    • data/HUPA_db/healthy/50 kHz/
    • data/HUPA_db/pathological/50 kHz/
    • data/HUPA_db/healthy/25 kHz/
    • data/HUPA_db/pathological/25 kHz/
  2. Extracts:

    • AVCA PRN features (Perturbation, Regularity, Noise)
    • Nonlinear/complexity features (depending on AVCA configuration)
    • CPP (Cepstral Peak Prominence) using Covarep
  3. Saves two CSV files, one per sampling frequency, in the data/ folder:

    • HUPA_voice_features_PRN_CPP_50kHz.csv
    • HUPA_voice_features_PRN_CPP_25kHz.csv

Each CSV includes:

  • One row per audio file

  • Columns:

    • All AVCA PRN (and complexity) features
    • CPP
    • FileName
    • Label (0 = healthy, 1 = pathological)

2. Classification & ROC Analysis (HUPA_PRN_GridSearch_ROC.m)

For each CSV:

  1. Loads HUPA_voice_features_PRN_CPP_50kHz.csv or HUPA_voice_features_PRN_CPP_25kHz.csv.

  2. Defines feature groups:

    • Noise
    • Perturbation (including CPP and jitter/shimmer)
    • Tremor
    • Complexity / nonlinear measures
  3. Cleans the data:

    • Removes all-NaN / constant columns
    • Imputes remaining NaNs (median)
  4. Splits the data:

    • 80% Train (for hyperparameter optimisation via 5-fold CV)
    • 20% independent Test set
  5. Trains and tunes:

    • Logistic Regression (fitclinear)
    • SVM (RBF) (fitcsvm + fitPosterior)
    • Random Forest (TreeBagger)
    • MLP (fitcnet, if available)
  6. Evaluates models on the Test set and computes AUC.

  7. Plots ROC curves for the four feature groups (Noise, Perturbation, Tremor, Complexity).

The script saves one figure per sampling rate in data/figures/, using the convention:

  • For 50 kHz:

    • ROC_HUPA_50kHz_MATLAB.png
    • ROC_HUPA_50kHz_MATLAB.pdf
  • For 25 kHz:

    • ROC_HUPA_25kHz_MATLAB.png
    • ROC_HUPA_25kHz_MATLAB.pdf

Python Workflow (HUPA_Python_GridSearch.py)

A Python implementation using scikit-learn reproduces the MATLAB analysis for both sampling frequencies.

Inputs

The script expects the two CSVs generated by MATLAB:

  • data/HUPA_voice_features_PRN_CPP_50kHz.csv
  • data/HUPA_voice_features_PRN_CPP_25kHz.csv

For each CSV, it runs the full pipeline independently.

Steps

For each sampling frequency (50 kHz, 25 kHz):

  1. Loads the corresponding CSV.

  2. Defines the same feature groups:

    • Noise, Perturbation, Tremor, Complexity.
  3. Uses a common train–test split:

    • 80% Train, 20% Test, stratified by label.
  4. For each group, runs a GridSearchCV with 5-fold CV and AUC as the scoring metric, over:

    • Logistic Regression
    • SVM (RBF)
    • Random Forest
    • k-NN
    • MLP

    Each model is wrapped in a Pipeline with:

    • SimpleImputer(strategy="median")
    • StandardScaler (except Random Forest, which only uses imputation)
  5. Evaluates the best model (per algorithm) on the hold-out Test set.

  6. Plots ROC curves (2×2 subplots for Noise/Perturbation/Tremor/Complexity) and saves them to data/figures/:

    • 50 kHz:

      • ROC_HUPA_50kHz_Python.png
      • ROC_HUPA_50kHz_Python.pdf
    • 25 kHz:

      • ROC_HUPA_25kHz_Python.png
      • ROC_HUPA_25kHz_Python.pdf
  7. Saves a summary CSV with all models and groups:

    • data/HUPA_Python_Results_Summary_50kHz.csv
    • data/HUPA_Python_Results_Summary_25kHz.csv

Each summary file contains, for every combination of feature group and model:

  • Group
  • Model
  • Test_AUC
  • CV_AUC_Mean
  • Best_Params

Requirements

MATLAB

  • MATLAB (R2020b or newer recommended)
  • Statistics and Machine Learning Toolbox
  • Deep Learning Toolbox (optional, for fitcnet)

External Toolboxes

Place these libraries inside toolboxes/:

  • AVCA-ByO: Essential for P, R, N features.
  • Covarep: Used for CPP feature extraction.
  • Hurst Estimators: Implementation to compute the Hurst exponent.
  • RPDE: Code to compute Recurrence Period Density Entropy (Little et al., 2007).
  • FastDFA: Implementation to compute Detrended Fluctuation Analysis (Little et al., 2006).
  • HCTSA: Highly Comparative Time-Series Analysis (used for D2 and LLE).
  • ME (Markovian Entropies): Functions for the computation of entropies from Markov Models.

Compatibility Note for Newer MATLAB Versions

Many of these toolboxes were developed years ago. If you are using a recent version of MATLAB (e.g., R2020b+), please be aware of the following:

  • Legacy Code: You may need to manually update small parts of the external toolboxes to fix deprecated functions.
  • Path Conflicts: The script HUPA_Features_Extraction.m already handles a known conflict with Covarep (it removes backcompatibility_2015 to avoid breaking the built-in audioread).
  • Debugging: If you encounter "function not found" or "input argument" errors inside these toolboxes, check that their internal paths are correctly added and that they support your MATLAB version.

Python

Install dependencies via:

pip install -r requirements.txt

Citation

[Add here the reference to the HUPA database and the related publication, once finalised.]

About

HUPA: a Castilian Spanish Corpus of Voice Disorders

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors