Self-Supervised Speech Representation Learning using raw waveform input and Transformer-based contextual modeling.
This project explores learning meaningful speech representations without using labeled transcripts. The system is designed to operate directly on raw audio waveforms and leverage deep neural networks for feature extraction and contextual modeling.
The implementation focuses on:
- Raw waveform processing
- CNN-based feature encoding
- Transformer context modeling
- Masked contrastive learning
speech-ssl/
├── src/
├── notebooks/
├── data/
├── graphs/
├── checkpoints/
├── scripts/
├── tests/
├── requirements.txt
└── README.md
git clone <repo-url>
cd speech-ssl
python -m venv venv
venv\Scripts\activate # Windows
pip install -r requirements.txt
Project initialization complete.
Model development in progress.
Shivanshu Pal
MSc Data Science
Focus: Speech & Audio AI