This repository contains the code accompanying the research project “Bridging Speech Therapy and Deep Learning: Automatic Sigmatism Detection Using Mel Spectrograms.”
The project investigates how deep learning models can detect sigmatism (misarticulation of sibilant sounds such as [s], [z], and [x]) from speech recordings. Multiple acoustic representations and model architectures are evaluated, with a focus on interpretable models for speech therapy support.
The best-performing model combines Mel spectrograms with an attention mechanism, achieving high detection performance and interpretable predictions via Grad-CAM.
Sigmatism is a common articulation disorder that affects speech intelligibility and often requires long-term therapy. Traditional speech therapy relies on expert feedback during supervised sessions, making effective self-practice difficult.
This project explores whether deep learning models trained on acoustic speech features can automatically detect sigmatism and provide objective feedback for pronunciation training and therapy support.
- Detect sigmatism from recorded speech
- Compare different acoustic feature representations
- Evaluate attention mechanisms for phoneme-focused learning
- Provide interpretable predictions using visualization techniques
Data/
Contains datasets and related resources used for training and evaluation.
DeeplearningPaper/
Includes implementations and notes from relevant deep learning research papers referenced during the project.
graphics/
Contains visual assets such as plots, graphs, and images generated or used in the project.
old_code/
Archive of previous versions or deprecated scripts from earlier stages of development.
audiodataloader.py
Loads and preprocesses raw audio recordings. The script extracts individual words from full recordings and creates structured word lists with metadata, which are then used during training.
Dataloader_fixedlist.py
Script for loading datasets using a predefined list of samples. Used in train_CNN.py.
Dataloader_gradcam.py
Handles data loading tailored for Grad-CAM analysis. Used in train_gmm.py.
create_fixed_list.py
Generates a fixed list of data samples used during training.
resample_data.py
Resamples audio files in a folder to match the input requirements of the Speech-to-Text model.
SpeechToText.py
Generates and visualizes Speech-to-Text (STT) probability heatmaps. Also implements the bimodal AUC evaluation approach.
cpp.py
Implements additional evaluation metrics used to distinguish the two classes, including CPP and FID.
model.py
Defines the architecture of the deep learning models used in the project.
train_CNN.py
Script for training convolutional neural networks on the prepared datasets.
train_gmm.py
Training script used by paperimplementation.py.
paperimplementation.py
Contains the implementation of the Valentini et al. baseline method used for comparison with the deep learning models.
hyperparametertuning.py
Performs hyperparameter optimization using Optuna.
data_augmentation.py
Contains methods for augmenting audio data, such as adding noise or altering pitch.
gradcam.py
Implements the Grad-CAM algorithm to visualize which regions of the input spectrogram influence model predictions.
plotting.py
Provides functions for generating plots and visualizations used for data analysis and result interpretation.
config.json
Configuration file storing parameters and settings used across different scripts in the project.
jobscript.sh
Shell script used for submitting jobs to a computing cluster or managing batch processing tasks.
Clone the repository:
git clone https://github.com/ankilab/sigmatism.git
cd sigmatismCreate a Python environment:
conda create -n sigmatism python=3.10
conda activate sigmatismInstall dependencies:
pip install -r requirements.txtThe workflow consists of the following steps:
- Audio preprocessing
- Feature extraction
- Model training
- Evaluation
- Model interpretation
The best-performing model (Mel spectrogram + Gaussian attention) achieved:
- Recognition rate: ~91%
- AUC: ~0.966
Grad-CAM visualizations confirm that the model focuses on high-frequency regions associated with sibilant articulation, supporting the interpretability of the learned representations.
pending
This project is released under the MIT License.