Live bidirectional (English/Spanish) speech-to-text for church outreach at Stark Road Gospel Hall, Farmington Hills, MI.
Real-time mic input, fully on-device transcription and translation, displayed in browser. Supports both English and Spanish speakers via --lang flag. Uses a two-pass pipeline for fast partials and high-quality finals, with A/B model comparison for translation quality analysis.
Two-Pass Pipeline
================
Mic (48kHz) ──> Resample 16kHz (<1ms) ──> Silero VAD (<1ms) ──┐
│
┌─────────────────────────────────────────────────────┘
│
├─ PARTIAL (every 0.6s of new speech, while speaker is talking)
│ Whisper Large-V3-Turbo STT (~500ms)
│ MarianMT EN↔ES PyTorch (~250ms) ← italic in UI
│ Total: ~750ms
│
└─ FINAL (on 0.5s silence gap or 8s max utterance)
Whisper Large-V3-Turbo STT (~500ms)
TranslateGemma 4B EN↔ES (~550ms) ← replaces partial
├─ Piper TTS (~40ms/word EN, --tts) ← audio output
TranslateGemma 12B EN↔ES (~2.1s, --ab) ← side-by-side
Total: ~1.1s (4B) / ~2.6s (A/B sequential)
Pipeline overlap: translation runs on utterance N
while STT runs on utterance N+1, hiding translation latency.
│
▼
WebSocket (0.0.0.0:8765)
HTTP (0.0.0.0:8080)
│
┌──────────┬───────────┼───────────┐
▼ ▼ ▼ ▼
Audience A/B/C Mobile CSV +
Display Compare Display Diagnostics
(projector) (operator) (QR code) (JSONL)
Inference runs on Apple Silicon (MLX) or NVIDIA GPUs (CUDA). No cloud APIs, no internet required at runtime.
| Display | File | Purpose |
|---|---|---|
| Audience | displays/audience_display.html |
Projector-friendly side-by-side EN/ES with fading context, fullscreen toggle, QR code overlay for phones |
| A/B/C Comparison | displays/ab_display.html |
Operator view showing Gemma 4B / MarianMT / 12B side-by-side with latency stats |
| Mobile | displays/mobile_display.html |
Responsive phone/tablet view with model toggle and Spanish-only mode, accessible via LAN |
| Church | displays/church_display.html |
Simplified church-oriented layout |
| OBS Overlay | displays/obs_overlay.html |
Transparent overlay for OBS Studio / streaming integration |
Phones connect by scanning the QR code on the audience display or navigating to http://<LAN-IP>:8080/displays/mobile_display.html.
# Prerequisites
brew install ffmpeg portaudio
# Create env
python3.11 -m venv stt_env
source stt_env/bin/activate
pip install -r requirements-mac.txt
# HuggingFace login (required for TranslateGemma)
huggingface-cli login
# Download all models
python setup_models.py
# Run — 4B only (default, ~4.3 GB RAM)
python dry_run_ab.py
# Run — A/B mode with both 4B and 12B (~11.3 GB RAM)
python dry_run_ab.py --ab
# Run — Spanish speaker mode (ES→EN translation)
python dry_run_ab.py --lang es
# Open displays in browser
open displays/audience_display.html
open displays/ab_display.html# NVIDIA/CUDA setup (RTX 2070 or similar)
pip install -r requirements-nvidia.txt
python dry_run_ab.py --backend=cuda --no-ab| Flag | Default | Description |
|---|---|---|
--lang |
en | Source language: en (English→Spanish) or es (Spanish→English) |
--ab |
off | Load both 4B and 12B for A/B comparison |
--backend |
auto | Inference backend: auto, mlx, cuda, cpu |
--no-ab |
off | Skip 12B model (for low-VRAM devices) |
--low-vram |
off | Marian-only translation mode |
--dry-run-text |
off | Test with text input instead of mic |
--http-port |
8080 | HTTP server port (serves display pages to phones over LAN) |
--ws-port |
8765 | WebSocket server port |
--vad-threshold |
0.3 | VAD speech detection sensitivity (0-1) |
--gain |
auto | Mic gain multiplier (auto-calibrates by default) |
--device |
auto | Audio input device index |
--chunk-duration |
2.0 | Seconds of speech to accumulate |
--tts |
off | Enable Piper TTS audio synthesis for translated text |
--tts-output |
ws | TTS output mode: ws (WebSocket), wav (file), both |
--log-level |
WARNING | Logging level: DEBUG, INFO, WARNING, ERROR |
# Run full test suite (600+ tests, no GPU required)
pytest tests/ -v
# With coverage report
pytest tests/ -v --cov=engines --cov=tools --cov=features --cov-report=term-missing
# Lint
ruff check . && ruff format --check .
mypy engines/ settings.pyTests run on CI (Ubuntu, Python 3.11 + 3.12) without GPU or model downloads. Heavy ML dependencies are mocked.
| Component | Model | Framework | Size | Latency (typical) |
|---|---|---|---|---|
| VAD | Silero VAD | PyTorch | ~2 MB | <1ms |
| STT | Whisper Large-V3-Turbo | mlx-whisper / faster-whisper | ~1.5 GB | ~500ms |
| Translate (partials) | MarianMT opus-mt-en-es / es-en | PyTorch (CPU) | ~298 MB | ~250ms |
| Translate A (finals) | TranslateGemma 4B 4-bit | mlx-lm | ~2.5 GB | ~550ms |
| Translate B (finals) | TranslateGemma 12B 4-bit | mlx-lm | ~7 GB | ~2.1s |
| TTS (EN) | Piper en_US-lessac-high | ONNX Runtime | ~63 MB | ~40ms/word |
| TTS (ES) | Piper es_MX-claude-high | ONNX Runtime | ~63 MB | ~8ms/word |
CUDA variants: bitsandbytes 4-bit for TranslateGemma, faster-whisper INT8 for STT.
Pipeline overlap (P7-6C) hides translation latency by running translation on utterance N while STT processes utterance N+1.
- Bidirectional language support --
--lang en(English→Spanish, default) or--lang es(Spanish→English) with automatic model selection - Two-pass STT pipeline -- fast italic partials (MarianMT, ~750ms) replaced by high-quality finals (TranslateGemma, ~1.1s) on silence detection
- Pipeline overlap -- translation runs concurrently with next utterance's STT, hiding translation latency
- A/B translation comparison -- 4B and 12B TranslateGemma run in parallel via
run_in_executor, logged to CSV - Theological Whisper prompt -- biases STT toward church vocabulary (atonement, propitiation, mediator, etc.) to reduce homophone errors
- Previous-text context -- last transcription fed to Whisper for cross-chunk accuracy
- Profanity filter -- allows biblical terms (e.g., "wrath") that generic filters block
- Speculative decoding support -- 4B model drafts tokens for 12B verification (
--num-draft-tokens) - Confidence scoring -- segment-level
avg_logprobmapped to green/yellow/red indicators - Translation QE -- length ratio + untranslated content detection per chunk
- Hallucination detection -- flags segments with
compression_ratio > 2.4 - Word-level timestamps -- per-word confidence logged for fine-tuning prioritization
- Automated diagnostics -- homophone flags, bad sentence splits, Marian/Gemma divergence tracking
- Per-chunk audio saving -- WAV files saved to
stark_data/live_sessions/for Whisper fine-tuning - Structured review queue -- JSONL diagnostics with priority scoring for active learning
- Hardware profiling -- per-session CPU/RAM/GPU snapshots for portability planning
- LAN serving -- HTTP server + WebSocket on
0.0.0.0so phones connect over local network - 229-term theological glossary -- covers 66 books, 31 proper names, theological concepts, liturgical terms
- Piper TTS audio synthesis --
--ttsflag enables text-to-speech for translated text (EN + ES voices, ONNX, thread-safe) - End-to-end roundtrip quality testing --
tools/roundtrip_test.pymeasures STT WER and roundtrip translation accuracy - Post-session validation pipeline -- YouTube WER comparison with text-anchor alignment (
tools/validate_session.py) - Dual-target inference -- runs on Apple Silicon (MLX) or NVIDIA GPUs (CUDA) from a single codebase
- Unified configuration -- pydantic-settings with STARK_ env prefix, .env file support
- STT fallback -- automatic retry with fallback model on low-confidence or hallucinated segments
- Apple Silicon (M1/M2/M3/M4)
- 8 GB+ unified memory for 4B-only mode (~4.3 GB used)
- 18 GB+ unified memory for A/B mode (~11.3 GB used)
- Python 3.11, macOS with Metal support
- NVIDIA GPU with 6 GB+ VRAM
- CUDA 12.x toolkit
- Python 3.11, Windows or Linux
pip install -r requirements-nvidia.txt
- NVIDIA GPU with 16 GB+ VRAM (tested on A2000 Ada)
- 64 GB+ system RAM recommended
- WSL2 with CUDA toolkit
- Used for audio preprocessing (demucs), pseudo-labeling (Whisper large-v3), and LoRA/QLoRA fine-tuning
Fine-tuning runs on the Windows desktop and adapters transfer to the Mac for inference:
- Whisper LoRA (r=32) on 20-50 hours of church sermon audio, with accent-balanced sampling across Midwest/Scottish/British/Canadian accents
- TranslateGemma QLoRA (r=16, 4-bit NF4) on ~155K biblical verse pairs (public domain) for Spanish; Hindi and Chinese adapters planned (~155K-310K pairs each)
- MarianMT full fine-tune as a lightweight fallback (298 MB)
Training data: church audio via yt-dlp + Bible parallel corpus (KJV/ASV/WEB/BBE/YLT paired with RVR1909). Accent-diverse audio tagged via --accent flag and balanced with temperature-based sampling. See CLAUDE.md for the full architecture, training strategy, and compute timeline.
├── dry_run_ab.py # Main pipeline: mic → VAD → STT → translate → WebSocket + HTTP
├── settings.py # Unified pydantic-settings config (STARK_ env prefix, .env support)
├── setup_models.py # One-command model download + verification
├── build_glossary.py # EN→ES theological glossary (229 terms)
├── download_sermons.py # yt-dlp sermon downloader
├── requirements-mac.txt # Mac/MLX pip dependencies
├── requirements-nvidia.txt # NVIDIA/CUDA inference dependencies
│
├── engines/ # STT + translation + TTS engine abstraction (MLX + CUDA)
│ ├── base.py # ABCs: STTEngine, TranslationEngine, TTSEngine, result dataclasses
│ ├── mlx_engine.py # MLXWhisperEngine, MLXGemmaEngine, MarianEngine, PiperTTSEngine
│ ├── cuda_engine.py # FasterWhisperEngine, CUDAGemmaEngine
│ ├── factory.py # create_stt_engine(), create_translation_engine(), create_tts_engine()
│ └── active_learning.py # Fallback event JSONL logger
│
├── displays/
│ ├── audience_display.html # Projector display (EN/ES side-by-side, QR overlay)
│ ├── ab_display.html # A/B/C operator comparison display
│ ├── mobile_display.html # Phone/tablet responsive display
│ ├── church_display.html # Simplified church layout
│ └── obs_overlay.html # Transparent overlay for OBS Studio
│
├── training/ # Windows/WSL training scripts (CUDA)
│ ├── preprocess_audio.py # 10-step audio cleaning pipeline (accent-aware)
│ ├── transcribe_church.py # Whisper large-v3 pseudo-labeling
│ ├── prepare_bible_corpus.py # Bible verse pair alignment
│ ├── prepare_whisper_dataset.py # Accent-balanced audiofolder builder
│ ├── prepare_piper_dataset.py # LJSpeech format conversion for Piper TTS
│ ├── train_whisper.py # Whisper LoRA fine-tuning (accent-balanced + per-accent WER)
│ ├── train_gemma.py # TranslateGemma QLoRA fine-tuning
│ ├── train_marian.py # MarianMT full fine-tune
│ ├── train_piper.py # Piper TTS voice fine-tuning
│ ├── export_piper_onnx.py # Piper TTS model export to ONNX
│ ├── evaluate_translation.py # SacreBLEU/chrF++/COMET scoring
│ ├── evaluate_piper.py # Piper TTS quality assessment
│ └── assess_quality.py # Baseline WER assessment
│
├── tools/ # Mac benchmarking & monitoring
│ ├── live_caption_monitor.py # YouTube caption comparison (post/live/trend)
│ ├── translation_qe.py # Reference-free translation QE
│ ├── benchmark_latency.py # End-to-end latency profiling
│ ├── stt_benchmark.py # STT-only benchmarking
│ ├── roundtrip_test.py # End-to-end STT + translation roundtrip quality test
│ ├── validate_session.py # Post-session validation vs YouTube captions
│ ├── prepare_finetune_data.py # Fine-tuning data export from live sessions
│ ├── download_roundtrip_texts.py # Download test texts for roundtrip testing
│ ├── convert_models_to_both.py # Model format conversion (MLX ↔ CUDA)
│ └── test_adaptive_model.py # Adaptive model selection testing
│
├── features/ # Standalone future features
│ ├── diarize.py # Speaker diarization (pyannote-audio)
│ ├── extract_verses.py # Bible verse reference extraction
│ └── summarize_sermon.py # Post-sermon 5-sentence summary
│
├── docs/
│ ├── immediate_todo.md # Live demo session notes + session issues
│ ├── todo.md # Phased task list
│ ├── release_plan.md # Release planning
│ ├── optimized.md # NVIDIA C++ inference optimization plan
│ ├── deploy.md # Automated adapter deployment system plan
│ ├── training_plan.md # Full training schedule + go/no-go gates
│ ├── roadmap.md # Mac → Windows → RTX 2070 deployment roadmap
│ ├── accent_tuning_plan.md # 4-week accent-diverse STT tuning plan
│ ├── macos_libomp_fix.md # libomp conflict diagnosis + fix
│ └── ... # 18 docs total
│
├── stark_data/ # Church audio + transcripts + corrections
├── bible_data/ # Biblical parallel text corpus (155K pairs)
└── metrics/ # CSV logs, diagnostics JSONL, hardware profiles
| Doc | Contents |
|---|---|
CLAUDE.md |
Project overview, 6-layer architecture summary, CI/CD, phase checklist |
CLAUDE-macbook.md |
Mac inference environment setup |
CLAUDE-windows.md |
Windows/WSL training environment setup |
docs/optimized.md |
NVIDIA C++ inference optimization plan (llama.cpp / exllamav2) |
docs/deploy.md |
Automated adapter deployment, health checks, hot-reload, rollback |
docs/immediate_todo.md |
Live demo session notes and immediate action items |
docs/roadmap.md |
Full project roadmap: Mac → training → RTX 2070 deployment |
docs/training_plan.md |
Training schedule, data sources, go/no-go gates |
docs/accent_tuning_plan.md |
4-week accent-diverse STT tuning plan (code complete) |
docs/multi_lingual.md |
Hindi & Chinese actionable todo list |
docs/macos_libomp_fix.md |
macOS libomp conflict diagnosis and fix |
docs/todo.md |
Phased task list |
engines/CLAUDE.md |
Engine layer: MLX thread safety, model IDs, confidence thresholds, critical fixes |
training/CLAUDE.md |
Fine-tuning: audio preprocessing, Bible corpus, LoRA/QLoRA configs, compute timeline |
tools/CLAUDE.md |
Monitoring: YouTube comparison, text-anchor alignment, translation QE tiers |
displays/CLAUDE.md |
Browser displays: WebSocket protocol, HTTP serving, display modes |
features/CLAUDE.md |
Post-processing: diarization, sermon summary, verse extraction |
What's done:
- Bidirectional language support:
--lang en(EN→ES) and--lang es(ES→EN) with automatic model selection - Wholesale swap to Whisper Large-V3-Turbo (both partials and finals)
engines/package: MLX + CUDA engine implementations with factory auto-detectionsettings.py: pydantic-settings unified config (STARK_env prefix,.envsupport)- Backend selection (
--backend auto|mlx|cuda) with CUDA fallback paths - STT fallback logic (lazy-load fallback model on low confidence / hallucination)
- Piper TTS engine integration (
--ttsflag, WebSocket + WAV output, EN + ES voices) - Validation pipeline (
tools/validate_session.py) with text-based anchor alignment (19.6% WER on live session) - Roundtrip quality test: STT WER 3.8-8.8%, roundtrip WER ~56%, ~134ms/word (
tools/roundtrip_test.py) - Fine-tuning data prep tools (review queue + dataset export)
- Piper TTS training scripts: dataset prep, training, ONNX export, evaluation
- CI/CD pipeline: 7 GitHub Actions workflows (lint, test, security, release, label, commitlint, stale) + Codecov
- 600+ tests with coverage threshold (≥18%), pre-commit hooks, CalVer versioning
- Dependabot for automated dependency updates
- Structured logging (
--log-level, session log files, VAD event logging) - Rolling session stats (5-min averages broadcast to operator displays)
- Periodic GPU warmup during sustained silence
- New CSV columns:
silence_delay_ms,queue_wait_ms,partial_stt_ms - Reduced silence trigger (0.8s→0.5s) and partial interval (1.0s→0.6s)
What's next:
- NVIDIA C++ inference optimization — llama.cpp/exllamav2 for sub-1s translation (plan)
- Automated adapter deployment with health checks and rollback (plan)
- Fine-tune Whisper + TranslateGemma on church audio (Phase 2-6)
- Active learning feedback loop: flag → correct → retrain
- Hindi & Chinese translation adapters
- See
docs/todo.mdfor full task list
Private project. All Bible translation training data uses public domain or CC-licensed sources only.