Skip to content

Sree14hari/DinoNet-Gradient-Focal_transformer

Repository files navigation

Saliency-GFT: A Novel Approach to Vocal Condition Classification

This repository contains the official implementation for Saliency-GFT, a novel, saliency-driven Gradient-based Feature Tayloring (GFT) architecture for classifying laryngeal and neurological vocal conditions from mel-spectrogram images. Our proposed method achieves state-of-the-art results on our dataset, with the best model reaching 85.14% test accuracy.


📖 Abstract

The classification of vocal pathologies from audio signals is a critical task in medical diagnostics. This project introduces a novel two-pass, saliency-driven training mechanism for Vision Transformers called Saliency-GFT. Unlike previous GFT methods that rely on the internal spatial structure of attention maps, our approach uses the true loss gradient ($$\frac{\partial L}{\partial \text{attn}}$$) to identify and prune the least salient patch tokens. This forces the model to focus on the most discriminative regions of the mel-spectrogram. Through comprehensive benchmarking, we demonstrate that our method consistently outperforms standard baselines. Furthermore, we present a novel hybrid backbone fusing CoAtNet and DINOv2, which, when combined with our Saliency-GFT method, achieves the highest performance, highlighting the benefits of both our training methodology and architectural design.


✨ Key Features

  • Saliency-Driven Patch Selection (GALA+PPS): A novel two-pass training mechanism that uses back-propagated loss gradients to intelligently prune and re-weight transformer patch tokens.
  • Hybrid Backbone Architecture: An experimental model that successfully infuses CoAtNet blocks into a DINOv2 backbone to demonstrate architectural synergy.
  • Comprehensive Benchmarking: Rigorous comparison against a reference GFT implementation and other strong transformer baselines.
  • Reproducible Results: All code for training and evaluation is provided to ensure full reproducibility of our findings.

📊 Results

Our experiments show a clear performance hierarchy, with our proposed models significantly outperforming the baselines.

Model Architecture Backbone Test Accuracy Weighted F1-Score
Saliency-GFT (Our Hybrid) CoAtNet + DINOv2 85.14% 0.8511
Saliency-GFT (Our GALA+PPS) DINOv2-small 83.78% 0.8382
Standalone CoAtNet (Baseline) CoAtNet 79.73% 0.7923
Reference GFT (Baseline) ViT-B/16 78.38% 0.7821

⚙️ Setup & Installation

To set up the environment, please follow these steps:

  1. Clone the repository:

    git clone https://github.com/Sree14hari/DinoNet-Gradient-Focal_transformer.git
    cd [YourRepoName]
  2. Create a Python virtual environment:

    python -m venv venv
    source venv/bin/activate  # On Windows, use `venv\Scripts\activate`
  3. Install the required packages:

    pip install -r requirements.txt

    A requirements.txt file should contain:

    torch
    torchvision
    timm
    scikit-learn
    numpy
    matplotlib
    seaborn
    tqdm
    transformers
    
  4. Dataset: Organize your melspectrograms_dataset directory with the following structure:

    melspectrograms_dataset/
    ├── train/
    │   ├── Dysarthia/
    │   │   └── ...
    │   └── Laryngitis/
    │       └── ...
    ├── validation/
    │   └── ...
    └── test/
        └── ...
    

📝 License

This project is licensed under the MIT License. See the LICENSE file for details.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors