Spectrogram-VisionTransformer

A project that applies Vision Transformer (ViT) architectures to spectrograms for audio / speech tasks (e.g. classification, recognition).
Transforms audio signals into spectrogram “images” and uses transformer models to learn from them.

Introduction
Features
Installation
Usage
Examples
Model / Architecture
Experiments & Results
Dataset
Training
Evaluation
Contributing
License
References

Introduction

Deep learning for audio often uses convolutional networks on spectrogram inputs.
This project explores using Vision Transformers (ViT) on spectrograms to capture global patterns and long-range dependencies in time-frequency space.

The key idea:

Convert a raw audio waveform → spectrogram (e.g. Mel spectrogram)
Treat the spectrogram as a 2D image
Feed patches / tokens into a Transformer / ViT model
Train / fine-tune for audio tasks (classification, detection, etc.)

Features

Pipeline for audio → spectrogram → transformer input
Preprocessing scripts (e.g. STFT, Mel, normalization)
Model architectures based on ViT / Transformer
Training and evaluation scripts
Support for GPU / multi-GPU
Logging, checkpoints, and result saving

Installation

# Clone this repository
git clone https://github.com/Sree14hari/Spectrogram-VisionTransformer.git
cd Spectrogram-VisionTransformer

# (Optional) create virtual environment
python -m venv venv
source venv/bin/activate   # or venv\Scripts\activate on Windows

# Install dependencies
pip install -r requirements.txt

You’ll need packages like torch, numpy, librosa (or similar for spectrograms), etc.

Usage

Prepare data

Place your audio files / dataset in a folder
(Optional) a config or script to convert them into spectrograms

Train a model

python train.py --config configs/your_config.yaml

Evaluate or infer

python evaluate.py --checkpoint path/to/model.pth --data your_test_data

Example script

You could include a sample script run_demo.py that loads a sample audio, converts to spectrogram, runs through the model, and prints results.

Model / Architecture

Describe the architecture you used:

Patch size (e.g. 16×16)
Number of transformer layers / heads / embedding dimension
Any modifications you did for spectrogram data (positional embeddings, time-frequency embedding, masking, etc.)
Loss functions, regularization, etc.

If you adopted or adapted from other works (e.g. ASiT: Audio Spectrogram vIsion Transformer ([arXiv][1]) or other similar works), mention it here.

Experiments & Results

Provide a table of your experiments, for example:

Task / Dataset	Model Variant	Accuracy / Metric	Notes
Speech Commands	ViT-base	95.6%	baseline
Your dataset	Your model	XX.X%	your experiment

Include charts, loss curves, confusion matrices, etc.

Dataset

Describe the dataset(s) you used:

Name (e.g. Speech Commands, ESC-50, custom)
Number of classes
Preprocessing (sampling rate, window size, overlap, normalization)
Train / val / test splits

Training

Detail your training settings, e.g.:

Learning rate, scheduler
Batch size
Number of epochs
Optimizer (Adam, SGD, etc.)
Data augmentation (if any)
Hardware setup

Evaluation

How metrics are computed (accuracy, F1, AUC, etc.)
Any special evaluation scripts
How to reproduce results

Contributing

If you welcome contributions, you can say:

Please open issues or pull requests
Follow the code style
Add tests / documentation
Cite your work

License

State your license (e.g. MIT, Apache, etc.)

MIT License
© Sreehari R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spectrogram-VisionTransformer

Table of Contents

Introduction

Features

Installation

Usage

Prepare data

Train a model

Evaluate or infer

Example script

Model / Architecture

Experiments & Results

Dataset

Training

Evaluation

Contributing

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Spectrogram-VisionTransformer

Table of Contents

Introduction

Features

Installation

Usage

Prepare data

Train a model

Evaluate or infer

Example script

Model / Architecture

Experiments & Results

Dataset

Training

Evaluation

Contributing

License