This repository contains the official implementation for Saliency-GFT, a novel, saliency-driven Gradient-based Feature Tayloring (GFT) architecture for classifying laryngeal and neurological vocal conditions from mel-spectrogram images. Our proposed method achieves state-of-the-art results on our dataset, with the best model reaching 85.14% test accuracy.
The classification of vocal pathologies from audio signals is a critical task in medical diagnostics. This project introduces a novel two-pass, saliency-driven training mechanism for Vision Transformers called Saliency-GFT. Unlike previous GFT methods that rely on the internal spatial structure of attention maps, our approach uses the true loss gradient (
- Saliency-Driven Patch Selection (GALA+PPS): A novel two-pass training mechanism that uses back-propagated loss gradients to intelligently prune and re-weight transformer patch tokens.
- Hybrid Backbone Architecture: An experimental model that successfully infuses CoAtNet blocks into a DINOv2 backbone to demonstrate architectural synergy.
- Comprehensive Benchmarking: Rigorous comparison against a reference GFT implementation and other strong transformer baselines.
- Reproducible Results: All code for training and evaluation is provided to ensure full reproducibility of our findings.
Our experiments show a clear performance hierarchy, with our proposed models significantly outperforming the baselines.
| Model Architecture | Backbone | Test Accuracy | Weighted F1-Score |
|---|---|---|---|
| Saliency-GFT (Our Hybrid) | CoAtNet + DINOv2 | 85.14% | 0.8511 |
| Saliency-GFT (Our GALA+PPS) | DINOv2-small | 83.78% | 0.8382 |
| Standalone CoAtNet (Baseline) | CoAtNet | 79.73% | 0.7923 |
| Reference GFT (Baseline) | ViT-B/16 | 78.38% | 0.7821 |
To set up the environment, please follow these steps:
-
Clone the repository:
git clone https://github.com/Sree14hari/DinoNet-Gradient-Focal_transformer.git cd [YourRepoName] -
Create a Python virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
A
requirements.txtfile should contain:torch torchvision timm scikit-learn numpy matplotlib seaborn tqdm transformers -
Dataset: Organize your
melspectrograms_datasetdirectory with the following structure:melspectrograms_dataset/ ├── train/ │ ├── Dysarthia/ │ │ └── ... │ └── Laryngitis/ │ └── ... ├── validation/ │ └── ... └── test/ └── ...
This project is licensed under the MIT License. See the LICENSE file for details.