This project demonstrates how raw speech audio is transformed into meaningful features used in Speech and Audio AI systems.
The goal is to understand the signal processing pipeline, not just to use existing libraries.
- Raw waveform (time domain)
- Fast Fourier Transform (frequency domain)
- Spectrogram (time–frequency representation)
- Mel Spectrogram (human auditory scale)
- MFCCs (compact speech representation)
FFT (Fast Fourier Transform) converts a time-domain signal into its frequency components. Speech consists of multiple frequencies produced by vocal cords and vocal tract resonances.
Human hearing is nonlinear. We perceive low frequencies more precisely than high frequencies. The Mel scale models this perception, making features more meaningful for speech models.
MFCCs represent the spectral envelope of speech, which contains phonetic information. They are compact, robust to noise, and widely used in speech recognition and emotion detection.
- Automatic Speech Recognition (ASR)
- Speech Emotion Recognition
- Speaker Identification
- Healthcare and Assistive Technologies
Shivanshu Pal
MSc Data Science
Aspiring PhD Researcher — Speech & Audio AI
Email: contactshiva7@gmail.com