Skip to content

Latest commit

 

History

History
145 lines (101 loc) · 4.17 KB

File metadata and controls

145 lines (101 loc) · 4.17 KB

Spectrogram-VisionTransformer

A project that applies Vision Transformer (ViT) architectures to spectrograms for audio / speech tasks (e.g. classification, recognition).
Transforms audio signals into spectrogram “images” and uses transformer models to learn from them.

Table of Contents

Introduction

Deep learning for audio often uses convolutional networks on spectrogram inputs.
This project explores using Vision Transformers (ViT) on spectrograms to capture global patterns and long-range dependencies in time-frequency space.

The key idea:

  1. Convert a raw audio waveform → spectrogram (e.g. Mel spectrogram)
  2. Treat the spectrogram as a 2D image
  3. Feed patches / tokens into a Transformer / ViT model
  4. Train / fine-tune for audio tasks (classification, detection, etc.)

Features

  • Pipeline for audio → spectrogram → transformer input
  • Preprocessing scripts (e.g. STFT, Mel, normalization)
  • Model architectures based on ViT / Transformer
  • Training and evaluation scripts
  • Support for GPU / multi-GPU
  • Logging, checkpoints, and result saving

Installation

# Clone this repository
git clone https://github.com/Sree14hari/Spectrogram-VisionTransformer.git
cd Spectrogram-VisionTransformer

# (Optional) create virtual environment
python -m venv venv
source venv/bin/activate   # or venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

You’ll need packages like torch, numpy, librosa (or similar for spectrograms), etc.

Usage

Prepare data

  • Place your audio files / dataset in a folder
  • (Optional) a config or script to convert them into spectrograms

Train a model

python train.py --config configs/your_config.yaml

Evaluate or infer

python evaluate.py --checkpoint path/to/model.pth --data your_test_data

Example script

You could include a sample script run_demo.py that loads a sample audio, converts to spectrogram, runs through the model, and prints results.

Model / Architecture

Describe the architecture you used:

  • Patch size (e.g. 16×16)
  • Number of transformer layers / heads / embedding dimension
  • Any modifications you did for spectrogram data (positional embeddings, time-frequency embedding, masking, etc.)
  • Loss functions, regularization, etc.

If you adopted or adapted from other works (e.g. ASiT: Audio Spectrogram vIsion Transformer ([arXiv][1]) or other similar works), mention it here.

Experiments & Results

Provide a table of your experiments, for example:

Task / Dataset Model Variant Accuracy / Metric Notes
Speech Commands ViT-base 95.6% baseline
Your dataset Your model XX.X% your experiment

Include charts, loss curves, confusion matrices, etc.

Dataset

Describe the dataset(s) you used:

  • Name (e.g. Speech Commands, ESC-50, custom)
  • Number of classes
  • Preprocessing (sampling rate, window size, overlap, normalization)
  • Train / val / test splits

Training

Detail your training settings, e.g.:

  • Learning rate, scheduler
  • Batch size
  • Number of epochs
  • Optimizer (Adam, SGD, etc.)
  • Data augmentation (if any)
  • Hardware setup

Evaluation

  • How metrics are computed (accuracy, F1, AUC, etc.)
  • Any special evaluation scripts
  • How to reproduce results

Contributing

If you welcome contributions, you can say:

  • Please open issues or pull requests
  • Follow the code style
  • Add tests / documentation
  • Cite your work

License

State your license (e.g. MIT, Apache, etc.)

MIT License
© Sreehari R