stark-translate

Live bidirectional (English/Spanish) speech-to-text for church outreach at Stark Road Gospel Hall, Farmington Hills, MI.

Real-time mic input, fully on-device transcription and translation, displayed in browser. Supports both English and Spanish speakers via --lang flag. Uses a two-pass pipeline for fast partials and high-quality finals, with A/B model comparison for translation quality analysis.

Architecture

                              Two-Pass Pipeline
                              ================

  Mic (48kHz) ──> Resample 16kHz (<1ms) ──> Silero VAD (<1ms) ──┐
                                                                  │
            ┌─────────────────────────────────────────────────────┘
            │
            ├─ PARTIAL (every 0.6s of new speech, while speaker is talking)
            │    Whisper Large-V3-Turbo STT (~500ms)
            │    MarianMT EN↔ES PyTorch (~250ms)             ← italic in UI
            │    Total: ~750ms
            │
            └─ FINAL (on 0.5s silence gap or 8s max utterance)
                 Whisper Large-V3-Turbo STT (~500ms)
                 TranslateGemma 4B EN↔ES (~550ms)            ← replaces partial
                 ├─ Piper TTS (~40ms/word EN, --tts)         ← audio output
                 TranslateGemma 12B EN↔ES (~2.1s, --ab)      ← side-by-side
                 Total: ~1.1s (4B) / ~2.6s (A/B sequential)

                 Pipeline overlap: translation runs on utterance N
                 while STT runs on utterance N+1, hiding translation latency.
                                     │
                                     ▼
                          WebSocket (0.0.0.0:8765)
                           HTTP (0.0.0.0:8080)
                                     │
              ┌──────────┬───────────┼───────────┐
              ▼          ▼           ▼           ▼
          Audience    A/B/C       Mobile      CSV +
          Display    Compare     Display     Diagnostics
         (projector) (operator)  (QR code)    (JSONL)

Inference runs on Apple Silicon (MLX) or NVIDIA GPUs (CUDA). No cloud APIs, no internet required at runtime.

Display Modes

Display	File	Purpose
Audience	`displays/audience_display.html`	Projector-friendly side-by-side EN/ES with fading context, fullscreen toggle, QR code overlay for phones
A/B/C Comparison	`displays/ab_display.html`	Operator view showing Gemma 4B / MarianMT / 12B side-by-side with latency stats
Mobile	`displays/mobile_display.html`	Responsive phone/tablet view with model toggle and Spanish-only mode, accessible via LAN
Church	`displays/church_display.html`	Simplified church-oriented layout
OBS Overlay	`displays/obs_overlay.html`	Transparent overlay for OBS Studio / streaming integration

Phones connect by scanning the QR code on the audience display or navigating to http://<LAN-IP>:8080/displays/mobile_display.html.

Quick Start

# Prerequisites
brew install ffmpeg portaudio

# Create env
python3.11 -m venv stt_env
source stt_env/bin/activate
pip install -r requirements-mac.txt

# HuggingFace login (required for TranslateGemma)
huggingface-cli login

# Download all models
python setup_models.py

# Run — 4B only (default, ~4.3 GB RAM)
python dry_run_ab.py

# Run — A/B mode with both 4B and 12B (~11.3 GB RAM)
python dry_run_ab.py --ab

# Run — Spanish speaker mode (ES→EN translation)
python dry_run_ab.py --lang es

# Open displays in browser
open displays/audience_display.html
open displays/ab_display.html

NVIDIA/CUDA Setup

# NVIDIA/CUDA setup (RTX 2070 or similar)
pip install -r requirements-nvidia.txt
python dry_run_ab.py --backend=cuda --no-ab

Key Flags

Flag	Default	Description
`--lang`	en	Source language: `en` (English→Spanish) or `es` (Spanish→English)
`--ab`	off	Load both 4B and 12B for A/B comparison
`--backend`	auto	Inference backend: auto, mlx, cuda, cpu
`--no-ab`	off	Skip 12B model (for low-VRAM devices)
`--low-vram`	off	Marian-only translation mode
`--dry-run-text`	off	Test with text input instead of mic
`--http-port`	8080	HTTP server port (serves display pages to phones over LAN)
`--ws-port`	8765	WebSocket server port
`--vad-threshold`	0.3	VAD speech detection sensitivity (0-1)
`--gain`	auto	Mic gain multiplier (auto-calibrates by default)
`--device`	auto	Audio input device index
`--chunk-duration`	2.0	Seconds of speech to accumulate
`--tts`	off	Enable Piper TTS audio synthesis for translated text
`--tts-output`	ws	TTS output mode: ws (WebSocket), wav (file), both
`--log-level`	WARNING	Logging level: DEBUG, INFO, WARNING, ERROR

Testing

# Run full test suite (600+ tests, no GPU required)
pytest tests/ -v

# With coverage report
pytest tests/ -v --cov=engines --cov=tools --cov=features --cov-report=term-missing

# Lint
ruff check . && ruff format --check .
mypy engines/ settings.py

Tests run on CI (Ubuntu, Python 3.11 + 3.12) without GPU or model downloads. Heavy ML dependencies are mocked.

Models

Component	Model	Framework	Size	Latency (typical)
VAD	Silero VAD	PyTorch	~2 MB	<1ms
STT	Whisper Large-V3-Turbo	mlx-whisper / faster-whisper	~1.5 GB	~500ms
Translate (partials)	MarianMT opus-mt-en-es / es-en	PyTorch (CPU)	~298 MB	~250ms
Translate A (finals)	TranslateGemma 4B 4-bit	mlx-lm	~2.5 GB	~550ms
Translate B (finals)	TranslateGemma 12B 4-bit	mlx-lm	~7 GB	~2.1s
TTS (EN)	Piper en_US-lessac-high	ONNX Runtime	~63 MB	~40ms/word
TTS (ES)	Piper es_MX-claude-high	ONNX Runtime	~63 MB	~8ms/word

CUDA variants: bitsandbytes 4-bit for TranslateGemma, faster-whisper INT8 for STT.

Pipeline overlap (P7-6C) hides translation latency by running translation on utterance N while STT processes utterance N+1.

Features

Bidirectional language support -- --lang en (English→Spanish, default) or --lang es (Spanish→English) with automatic model selection
Two-pass STT pipeline -- fast italic partials (MarianMT, ~750ms) replaced by high-quality finals (TranslateGemma, ~1.1s) on silence detection
Pipeline overlap -- translation runs concurrently with next utterance's STT, hiding translation latency
A/B translation comparison -- 4B and 12B TranslateGemma run in parallel via run_in_executor, logged to CSV
Theological Whisper prompt -- biases STT toward church vocabulary (atonement, propitiation, mediator, etc.) to reduce homophone errors
Previous-text context -- last transcription fed to Whisper for cross-chunk accuracy
Profanity filter -- allows biblical terms (e.g., "wrath") that generic filters block
Speculative decoding support -- 4B model drafts tokens for 12B verification (--num-draft-tokens)
Confidence scoring -- segment-level avg_logprob mapped to green/yellow/red indicators
Translation QE -- length ratio + untranslated content detection per chunk
Hallucination detection -- flags segments with compression_ratio > 2.4
Word-level timestamps -- per-word confidence logged for fine-tuning prioritization
Automated diagnostics -- homophone flags, bad sentence splits, Marian/Gemma divergence tracking
Per-chunk audio saving -- WAV files saved to stark_data/live_sessions/ for Whisper fine-tuning
Structured review queue -- JSONL diagnostics with priority scoring for active learning
Hardware profiling -- per-session CPU/RAM/GPU snapshots for portability planning
LAN serving -- HTTP server + WebSocket on 0.0.0.0 so phones connect over local network
229-term theological glossary -- covers 66 books, 31 proper names, theological concepts, liturgical terms
Piper TTS audio synthesis -- --tts flag enables text-to-speech for translated text (EN + ES voices, ONNX, thread-safe)
End-to-end roundtrip quality testing -- tools/roundtrip_test.py measures STT WER and roundtrip translation accuracy
Post-session validation pipeline -- YouTube WER comparison with text-anchor alignment (tools/validate_session.py)
Dual-target inference -- runs on Apple Silicon (MLX) or NVIDIA GPUs (CUDA) from a single codebase
Unified configuration -- pydantic-settings with STARK_ env prefix, .env file support
STT fallback -- automatic retry with fallback model on low-confidence or hallucinated segments

Hardware Requirements

Mac (Inference)

Apple Silicon (M1/M2/M3/M4)
8 GB+ unified memory for 4B-only mode (~4.3 GB used)
18 GB+ unified memory for A/B mode (~11.3 GB used)
Python 3.11, macOS with Metal support

NVIDIA (Inference -- RTX 2070 or similar)

NVIDIA GPU with 6 GB+ VRAM
CUDA 12.x toolkit
Python 3.11, Windows or Linux
pip install -r requirements-nvidia.txt

Windows (Training)

NVIDIA GPU with 16 GB+ VRAM (tested on A2000 Ada)
64 GB+ system RAM recommended
WSL2 with CUDA toolkit
Used for audio preprocessing (demucs), pseudo-labeling (Whisper large-v3), and LoRA/QLoRA fine-tuning

Training

Fine-tuning runs on the Windows desktop and adapters transfer to the Mac for inference:

Whisper LoRA (r=32) on 20-50 hours of church sermon audio, with accent-balanced sampling across Midwest/Scottish/British/Canadian accents
TranslateGemma QLoRA (r=16, 4-bit NF4) on ~155K biblical verse pairs (public domain) for Spanish; Hindi and Chinese adapters planned (~155K-310K pairs each)
MarianMT full fine-tune as a lightweight fallback (298 MB)

Training data: church audio via yt-dlp + Bible parallel corpus (KJV/ASV/WEB/BBE/YLT paired with RVR1909). Accent-diverse audio tagged via --accent flag and balanced with temperature-based sampling. See CLAUDE.md for the full architecture, training strategy, and compute timeline.

Project Structure

├── dry_run_ab.py              # Main pipeline: mic → VAD → STT → translate → WebSocket + HTTP
├── settings.py                # Unified pydantic-settings config (STARK_ env prefix, .env support)
├── setup_models.py            # One-command model download + verification
├── build_glossary.py          # EN→ES theological glossary (229 terms)
├── download_sermons.py        # yt-dlp sermon downloader
├── requirements-mac.txt       # Mac/MLX pip dependencies
├── requirements-nvidia.txt    # NVIDIA/CUDA inference dependencies
│
├── engines/                   # STT + translation + TTS engine abstraction (MLX + CUDA)
│   ├── base.py                # ABCs: STTEngine, TranslationEngine, TTSEngine, result dataclasses
│   ├── mlx_engine.py          # MLXWhisperEngine, MLXGemmaEngine, MarianEngine, PiperTTSEngine
│   ├── cuda_engine.py         # FasterWhisperEngine, CUDAGemmaEngine
│   ├── factory.py             # create_stt_engine(), create_translation_engine(), create_tts_engine()
│   └── active_learning.py     # Fallback event JSONL logger
│
├── displays/
│   ├── audience_display.html  # Projector display (EN/ES side-by-side, QR overlay)
│   ├── ab_display.html        # A/B/C operator comparison display
│   ├── mobile_display.html    # Phone/tablet responsive display
│   ├── church_display.html    # Simplified church layout
│   └── obs_overlay.html       # Transparent overlay for OBS Studio
│
├── training/                  # Windows/WSL training scripts (CUDA)
│   ├── preprocess_audio.py    # 10-step audio cleaning pipeline (accent-aware)
│   ├── transcribe_church.py   # Whisper large-v3 pseudo-labeling
│   ├── prepare_bible_corpus.py # Bible verse pair alignment
│   ├── prepare_whisper_dataset.py # Accent-balanced audiofolder builder
│   ├── prepare_piper_dataset.py # LJSpeech format conversion for Piper TTS
│   ├── train_whisper.py       # Whisper LoRA fine-tuning (accent-balanced + per-accent WER)
│   ├── train_gemma.py         # TranslateGemma QLoRA fine-tuning
│   ├── train_marian.py        # MarianMT full fine-tune
│   ├── train_piper.py         # Piper TTS voice fine-tuning
│   ├── export_piper_onnx.py   # Piper TTS model export to ONNX
│   ├── evaluate_translation.py # SacreBLEU/chrF++/COMET scoring
│   ├── evaluate_piper.py      # Piper TTS quality assessment
│   └── assess_quality.py      # Baseline WER assessment
│
├── tools/                     # Mac benchmarking & monitoring
│   ├── live_caption_monitor.py # YouTube caption comparison (post/live/trend)
│   ├── translation_qe.py      # Reference-free translation QE
│   ├── benchmark_latency.py   # End-to-end latency profiling
│   ├── stt_benchmark.py       # STT-only benchmarking
│   ├── roundtrip_test.py      # End-to-end STT + translation roundtrip quality test
│   ├── validate_session.py    # Post-session validation vs YouTube captions
│   ├── prepare_finetune_data.py # Fine-tuning data export from live sessions
│   ├── download_roundtrip_texts.py # Download test texts for roundtrip testing
│   ├── convert_models_to_both.py # Model format conversion (MLX ↔ CUDA)
│   └── test_adaptive_model.py # Adaptive model selection testing
│
├── features/                  # Standalone future features
│   ├── diarize.py             # Speaker diarization (pyannote-audio)
│   ├── extract_verses.py      # Bible verse reference extraction
│   └── summarize_sermon.py    # Post-sermon 5-sentence summary
│
├── docs/
│   ├── immediate_todo.md      # Live demo session notes + session issues
│   ├── todo.md                # Phased task list
│   ├── release_plan.md        # Release planning
│   ├── optimized.md           # NVIDIA C++ inference optimization plan
│   ├── deploy.md              # Automated adapter deployment system plan
│   ├── training_plan.md       # Full training schedule + go/no-go gates
│   ├── roadmap.md             # Mac → Windows → RTX 2070 deployment roadmap
│   ├── accent_tuning_plan.md  # 4-week accent-diverse STT tuning plan
│   ├── macos_libomp_fix.md    # libomp conflict diagnosis + fix
│   └── ...                    # 18 docs total
│
├── stark_data/                # Church audio + transcripts + corrections
├── bible_data/                # Biblical parallel text corpus (155K pairs)
└── metrics/                   # CSV logs, diagnostics JSONL, hardware profiles

Docs

Doc	Contents
`CLAUDE.md`	Project overview, 6-layer architecture summary, CI/CD, phase checklist
`CLAUDE-macbook.md`	Mac inference environment setup
`CLAUDE-windows.md`	Windows/WSL training environment setup
`docs/optimized.md`	NVIDIA C++ inference optimization plan (llama.cpp / exllamav2)
`docs/deploy.md`	Automated adapter deployment, health checks, hot-reload, rollback
`docs/immediate_todo.md`	Live demo session notes and immediate action items
`docs/roadmap.md`	Full project roadmap: Mac → training → RTX 2070 deployment
`docs/training_plan.md`	Training schedule, data sources, go/no-go gates
`docs/accent_tuning_plan.md`	4-week accent-diverse STT tuning plan (code complete)
`docs/multi_lingual.md`	Hindi & Chinese actionable todo list
`docs/macos_libomp_fix.md`	macOS libomp conflict diagnosis and fix
`docs/todo.md`	Phased task list
`engines/CLAUDE.md`	Engine layer: MLX thread safety, model IDs, confidence thresholds, critical fixes
`training/CLAUDE.md`	Fine-tuning: audio preprocessing, Bible corpus, LoRA/QLoRA configs, compute timeline
`tools/CLAUDE.md`	Monitoring: YouTube comparison, text-anchor alignment, translation QE tiers
`displays/CLAUDE.md`	Browser displays: WebSocket protocol, HTTP serving, display modes
`features/CLAUDE.md`	Post-processing: diarization, sermon summary, verse extraction

Development Status

What's done:

Bidirectional language support: --lang en (EN→ES) and --lang es (ES→EN) with automatic model selection
Wholesale swap to Whisper Large-V3-Turbo (both partials and finals)
engines/ package: MLX + CUDA engine implementations with factory auto-detection
settings.py: pydantic-settings unified config (STARK_ env prefix, .env support)
Backend selection (--backend auto|mlx|cuda) with CUDA fallback paths
STT fallback logic (lazy-load fallback model on low confidence / hallucination)
Piper TTS engine integration (--tts flag, WebSocket + WAV output, EN + ES voices)
Validation pipeline (tools/validate_session.py) with text-based anchor alignment (19.6% WER on live session)
Roundtrip quality test: STT WER 3.8-8.8%, roundtrip WER ~56%, ~134ms/word (tools/roundtrip_test.py)
Fine-tuning data prep tools (review queue + dataset export)
Piper TTS training scripts: dataset prep, training, ONNX export, evaluation
CI/CD pipeline: 7 GitHub Actions workflows (lint, test, security, release, label, commitlint, stale) + Codecov
600+ tests with coverage threshold (≥18%), pre-commit hooks, CalVer versioning
Dependabot for automated dependency updates
Structured logging (--log-level, session log files, VAD event logging)
Rolling session stats (5-min averages broadcast to operator displays)
Periodic GPU warmup during sustained silence
New CSV columns: silence_delay_ms, queue_wait_ms, partial_stt_ms
Reduced silence trigger (0.8s→0.5s) and partial interval (1.0s→0.6s)

What's next:

NVIDIA C++ inference optimization — llama.cpp/exllamav2 for sub-1s translation (plan)
Automated adapter deployment with health checks and rollback (plan)
Fine-tune Whisper + TranslateGemma on church audio (Phase 2-6)
Active learning feedback loop: flag → correct → retrain
Hindi & Chinese translation adapters
See docs/todo.md for full task list

License

Private project. All Bible translation training data uses public domain or CC-licensed sources only.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

stark-translate

Architecture

Display Modes

Quick Start

NVIDIA/CUDA Setup

Key Flags

Testing

Models

Features

Hardware Requirements

Mac (Inference)

NVIDIA (Inference -- RTX 2070 or similar)

Windows (Training)

Training

Project Structure

Docs

Development Status

License

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 105 Commits
.github		.github
bible_data		bible_data
displays		displays
docs		docs
engines		engines
features		features
fine_tuned_gemma_mi_A		fine_tuned_gemma_mi_A
fine_tuned_gemma_mi_B		fine_tuned_gemma_mi_B
fine_tuned_whisper_mi		fine_tuned_whisper_mi
metrics		metrics
stark_data		stark_data
tests		tests
tools		tools
training		training
.commitlintrc.yml		.commitlintrc.yml
.env.church		.env.church
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE-macbook.md		CLAUDE-macbook.md
CLAUDE-windows.md		CLAUDE-windows.md
CLAUDE.md		CLAUDE.md
README.md		README.md
build_glossary.py		build_glossary.py
codecov.yml		codecov.yml
download_sermons.py		download_sermons.py
dry_run_ab.py		dry_run_ab.py
pyproject.toml		pyproject.toml
requirements-mac.txt		requirements-mac.txt
requirements-nvidia.txt		requirements-nvidia.txt
requirements-windows.txt		requirements-windows.txt
run_church.sh		run_church.sh
settings.py		settings.py
setup_models.py		setup_models.py
test_coverage.json		test_coverage.json
workers.py		workers.py

Folders and files

Latest commit

History

Repository files navigation

stark-translate

Architecture

Display Modes

Quick Start

NVIDIA/CUDA Setup

Key Flags

Testing

Models

Features

Hardware Requirements

Mac (Inference)

NVIDIA (Inference -- RTX 2070 or similar)

Windows (Training)

Training

Project Structure

Docs

Development Status

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages