🎤 Realtime Voice Assistant

End-to-end voice assistant with Whisper STT, intent classification, NLU, and TTS synthesis — built for learning and portfolio demonstration.

🎯 What You'll Learn

This project demonstrates core speech processing concepts with real implementations (not just mocks):

DSP Fundamentals: FFT, STFT, Mel filter banks, DCT, pre-emphasis, SNR estimation
Feature Extraction: Real MFCC computation using scipy (DFT → Mel → log → DCT), plus delta and delta-delta temporal features
Voice Activity Detection: Energy-based silence detection and speech segmentation
Intent Classification: Multi-strategy pipeline (keyword + regex + fuzzy matching)
Speech Recognition: Whisper integration with audio preprocessing
Speech Synthesis: TTS with rate control, pitch scaling, and SSML stubs
Evaluation Metrics: WER via Levenshtein dynamic programming, CER, intent accuracy, confusion matrix with per-class precision/recall/F1
Streaming ASR: Chunk-based transcription with overlap handling for real-time audio streams
Audio Quality: Signal-to-Noise Ratio estimation using VAD-based speech/silence segmentation
API Design: RESTful API with Pydantic validation, token bucket rate limiting, and 9 endpoints
Visualization: Interactive Streamlit dashboard with 5 pages

📖 New to speech processing? Check out the Glossary for definitions of all key terms used in this project.

🏗️ Architecture

┌──────────────┐    ┌──────────────┐    ┌──────────────────┐    ┌──────────────┐    ┌──────────────┐
│   🎙️ Audio   │───▶│  📝 STT      │───▶│ 🧠 Intent Class. │───▶│ 💬 Response │───▶│  🔊 TTS      │
│   Capture     │    │  (Whisper)   │    │ (Keyword+Regex)  │    │ Generation  │    │  Synthesis   │
└──────────────┘    └──────────────┘    └──────────────────┘    └──────────────┘    └──────────────┘
       │                    │                      │                      │                    │
       │              ┌─────┴─────┐          ┌────┴────┐         ┌─────┴─────┐        ┌────┴────┐
       │              │ MFCC      │          │ 10      │         │ Template  │        │ Pitch   │
       │              │ Log-Mel   │          │ intents │         │ Filling   │        │ Rate    │
       │              │ Pre-emph  │          │ Fuzzy   │         │ Context   │        │ SSML    │
       │              │ VAD       │          │ Fusion  │         │ History   │        │ Vocoder │
       │              └───────────┘          └─────────┘         └───────────┘        └─────────┘

📊 How MFCCs Work

MFCCs (Mel-Frequency Cepstral Coefficients) are the dominant feature representation in speech processing. Here's the computation pipeline:

Audio Waveform
     │
     ▼
[Pre-emphasis]  y[n] = x[n] - α·x[n-1]   ← Boost high frequencies
     │
     ▼
[Framing]       25ms windows, 10ms hop     ← Short, stationary segments
     │
     ▼
[Hanning Window] Taper edges to zero        ← Reduce spectral leakage
     │
     ▼
[FFT]           Time → Frequency domain     ← Reveal spectral content
     │
     ▼
[Mel Filter Bank] 26 triangular filters     ← Match human hearing (Mel scale)
     │
     ▼
[Log]           log(filter energies)        ← Model loudness perception
     │
     ▼
[DCT]           Decorrelate features        ← Separate source from filter
     │
     ▼
MFCCs (13 coefficients per frame)          ← Vocal tract shape = phoneme identity

Why MFCCs? The DCT step is the key insight — it performs cepstral analysis, which separates the vocal tract filter (what we want for speech recognition) from the excitation source (pitch, which we don't want). Lower coefficients capture the spectral envelope (vocal tract shape), while higher coefficients capture fine pitch details.

📁 Project Structure

08-realtime-voice-assistant/
├── src/
│   ├── config.py              # Dataclass-based config with validation & env vars
│   ├── audio_processor.py     # Real DSP: MFCC, STFT, Mel filter bank, VAD
│   ├── stt_engine.py          # Whisper STT with preprocessing & batch support
│   ├── tts_engine.py          # TTS with rate/pitch control & SSML stubs
│   ├── intent_classifier.py   # Multi-strategy intent classification pipeline
│   ├── voice_assistant.py     # Pipeline orchestrator with context & analytics
│   ├── evaluation.py          # WER (Levenshtein), CER, accuracy, benchmarking
│   └── api/
│       └── main.py            # FastAPI with 7 endpoints & Pydantic models
├── streamlit_app/
│   ├── app.py                 # Dashboard entry point (5 pages)
│   └── pages/
│       ├── 1_📊_Overview.py           # Architecture diagram & component cards
│       ├── 2_🎤_Chat.py               # Chat with intent display & confidence bars
│       ├── 3_📈_Metrics.py            # WER/CER calculator & audio analysis
│       ├── 4_🎵_Audio_Playground.py    # Interactive DSP demos & signal generator
│       └── 5_📚_Learn.py              # Educational deep-dives
├── tests/
│   ├── conftest.py            # Shared fixtures (audio signals, mock engines)
│   ├── test_audio_processor.py  # 45+ tests: MFCC, spectrogram, VAD, Mel filter
│   ├── test_stt.py            # STT engine, preprocessing, batch transcription
│   ├── test_tts.py            # TTS synthesis, rate/pitch, format conversion
│   ├── test_intent_classifier.py  # All classifiers, fuzzy matching, pipeline
│   ├── test_voice_assistant.py  # Pipeline, history, context, analytics
│   ├── test_evaluation.py     # WER, CER, Levenshtein, accuracy, benchmarks
│   └── test_api.py            # All 7 API endpoints, validation, errors
├── requirements.txt
└── README.md

🚀 Quick Start

# Clone the repository
git clone https://github.com/mohamed-elkholy95/realtime-voice-assistant.git
cd realtime-voice-assistant

# Create virtual environment (recommended)
python -m venv .venv
source .venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run the test suite (100+ tests)
python -m pytest tests/ -v

# Start the API server
uvicorn src.api.main:app --reload --port 8008
# API docs: http://localhost:8008/docs

# Launch the Streamlit dashboard
streamlit run streamlit_app/app.py

🔌 API Endpoints

Method	Endpoint	Description
`GET`	`/health`	Health check
`POST`	`/chat`	Process text through the assistant pipeline
`POST`	`/transcribe`	Convert base64 audio to text (STT)
`POST`	`/synthesize`	Convert text to base64 audio (TTS)
`GET`	`/intents`	List all supported intent types
`POST`	`/evaluate`	Compute WER/CER between reference and hypothesis
`GET`	`/history`	Get conversation history
`GET`	`/analytics`	Get intent classification analytics

Example API Usage

# Chat
curl -X POST http://localhost:8008/chat \
  -H "Content-Type: application/json" \
  -d '{"text": "What is the weather today?"}'

# Evaluate WER
curl -X POST http://localhost:8008/evaluate \
  -H "Content-Type: application/json" \
  -d '{"reference": "hello world", "hypothesis": "hello there"}'

📈 Streamlit Dashboard

The interactive dashboard has 5 pages:

📊 Overview — Architecture diagram, pipeline visualization, component cards
🎤 Chat — Voice chat with intent display, confidence bars, session analytics
📈 Metrics — Interactive WER/CER calculator, audio analysis, spectrograms
🎵 Audio Playground — Signal generator, MFCC explorer, Mel filter bank visualizer, pre-emphasis & normalization demos
📚 Learn — Educational deep-dives into STT, MFCCs, intent classification, TTS, and WER

🧪 Testing

The project has 100+ tests covering all components:

# Run all tests
python -m pytest tests/ -v

# Run with coverage
python -m pytest tests/ -v --cov=src --cov-report=term-missing

# Run specific test module
python -m pytest tests/test_audio_processor.py -v

# Run specific test
python -m pytest tests/test_evaluation.py::TestWordErrorRate -v

🤝 Contributing

Fork the repository
Create a feature branch: git checkout -b feature/my-feature
Write tests for your changes
Ensure all tests pass: python -m pytest tests/ -v
Commit with conventional commits: feat: add new feature
Push and open a Pull Request

📝 License

MIT License — see LICENSE for details.

👤 Author

Mohamed Elkholy — GitHub · melkholy@techmatrix.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎤 Realtime Voice Assistant

🎯 What You'll Learn

🏗️ Architecture

📊 How MFCCs Work

📁 Project Structure

🚀 Quick Start

🔌 API Endpoints

Example API Usage

📈 Streamlit Dashboard

🧪 Testing

🤝 Contributing

📝 License

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
src		src
streamlit_app		streamlit_app
tests		tests
.gitignore		.gitignore
GLOSSARY.md		GLOSSARY.md
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🎤 Realtime Voice Assistant

🎯 What You'll Learn

🏗️ Architecture

📊 How MFCCs Work

📁 Project Structure

🚀 Quick Start

🔌 API Endpoints

Example API Usage

📈 Streamlit Dashboard

🧪 Testing

🤝 Contributing

📝 License

👤 Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages