- System Overview
- Architecture
- Component Details
- Data Flow
- Technical Specifications
- Configuration
- Deployment
- Monitoring and Observability
- Performance Characteristics
- Troubleshooting
This is a horizontally scalable, Dockerized microservice pipeline for real-time speech transcription and translation. The system processes live audio streams through multiple worker services, providing low-latency speech-to-text transcription with optional translation capabilities.
- Real-time Processing: Sub-500ms end-to-end latency for speech transcription
- Horizontal Scaling: Independent scaling of STT, translation, and gateway services
- Dual VAD System: WebRTC + Silero VAD for robust speech detection
- Multi-language Support: 15+ languages with automatic language detection
- GPU Acceleration: CUDA support for Faster-Whisper transcription
- Event-Driven Architecture: Redis Streams + Consumer Groups for reliable job distribution
- Session Persistence: Redis-backed session state for gateway horizontal scaling
- Health Monitoring: Comprehensive metrics and health checks for all services
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Web Clients │────│ Gateway │────│ Redis │
│ (WebSocket) │ │ (WebSocket) │ │ (Streams/Queue) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
│ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ STT Workers │────│ Redis │────│ Translation │
│ (Faster-Whisper)│ │ (Pub/Sub) │ │ Workers │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
│
┌─────────────────┐
│ Results to │
│ Clients │
└─────────────────┘
- Purpose: WebSocket server handling client connections and audio streaming
- Technology: Python asyncio, websockets library
- Key Features:
- Dual Voice Activity Detection (WebRTC + Silero VAD)
- Session state persistence in Redis
- Audio chunking and streaming to workers
- Flow control with per-client job-in-flight tracking
- Language settings management
- Purpose: Speech-to-text transcription using Faster-Whisper
- Technology: Python, Faster-Whisper, PyTorch
- Key Features:
- GPU-accelerated transcription with CUDA support
- Consumer group-based job distribution
- Base64 audio decoding and normalization
- VAD filtering and beam search optimization
- Purpose: Text translation using EasyNMT
- Technology: Python, EasyNMT, PyTorch
- Key Features:
- ThreadPoolExecutor for memory leak prevention
- Automatic language mapping and detection
- Memory monitoring with periodic garbage collection
- Controlled thread pool to prevent resource exhaustion
- Purpose: Message queuing and state storage
- Configuration:
- Append-only file persistence
- Memory optimization with LRU eviction
- Stream node limits for performance
- Pub/Sub channel management
- Purpose: Desktop interface for voice typing and live subtitles
- Technology: PyQt5, pyaudio, websockets
- Features:
- Voice typing mode with automated keyboard input
- Live subtitle overlay with semi-transparent display
- Audio device selection (microphone/system audio)
- Multi-language support with real-time switching
- Purpose: Load testing and system validation
- Technology: Python asyncio, websockets, numpy
- Features:
- Concurrent client simulation
- Synthetic audio generation
- Performance metrics collection
- Load testing capabilities
GatewayService (gateway.py)
class GatewayService:
def __init__(self):
self.redis_client = RedisClient(self.instance_id, self.logger)
self.vad_detector = VoiceActivityDetector()
self.audio_processor = AudioProcessor()
self.websocket_handler = WebSocketHandler(self, self.redis_client, self.audio_processor)
self.health_monitor = HealthMonitor(self.instance_id, self.logger, self.redis_client, self)SpeechSession (session.py)
- Manages per-client speech state (INACTIVE → ACTIVE → SILENCE → INACTIVE)
- Handles audio buffer accumulation with pre-speech buffering
- Redis-backed persistence for horizontal scaling
- Base64 encoding/decoding for audio data serialization
VoiceActivityDetector (vad.py)
- Dual VAD implementation using WebRTC + Silero models
- WebRTC VAD: Fast, lightweight speech detection (10ms frames)
- Silero VAD: Accurate neural network-based detection
- Threaded execution to prevent blocking main event loop
- Audio Reception: Binary WebSocket messages with metadata headers
- Resampling: All audio normalized to 16kHz, 16-bit PCM
- Speech Detection: Dual VAD with majority voting
- Buffer Management: Rolling pre-speech buffer (1 second) + active speech accumulation
- Job Publishing: Event-driven publishing to Redis Streams (not interval-based)
@dataclass
class SpeechSession:
state: SpeechState = SpeechState.INACTIVE
audio_buffer: bytearray = None
pre_speech_buffer: bytearray = None
silence_start_time: Optional[float] = None
accumulated_audio_bytes: int = 0
last_published_len: int = 0
source_lang: str = "en"
target_lang: str = "vi"
translation_enabled: bool = TrueClient → Gateway:
{
"type": "set_langs",
"source_language": "en",
"target_language": "vi"
}Gateway → Client:
{
"type": "realtime",
"text": "Hello world",
"translation": "Xin chào thế giới",
"segment_id": "1699123456789",
"processing_time": 0.234
}STTWorker (worker.py)
- Consumer group-based job processing with Redis Streams
- Faster-Whisper model with configurable parameters
- GPU acceleration with CUDA device selection
- Comprehensive error handling and metrics collection
def transcribe_audio(self, audio_data: bytes, language: str = "", use_vad_filter: bool = True) -> Dict[str, Any]:
# Convert bytes to numpy array
audio_array = np.frombuffer(audio_data, dtype=np.int16)
audio_float = audio_array.astype(np.float32) / 32768.0
# Normalize audio to -0.95 dBFS
if NORMALIZE_AUDIO:
audio_float = self.normalize_audio(audio_float)
# Transcribe with Faster-Whisper
segments, info = self.model.transcribe(
audio_float,
language=language if language else None,
beam_size=BEAM_SIZE,
vad_filter=use_vad_filter
)- Job Consumption: Redis Stream consumer group reading
- Audio Decoding: Base64 decoding of audio data
- Transcription: Faster-Whisper processing with GPU acceleration
- Result Publishing: Pub/Sub to client-specific channels
- Translation Triggering: Publishing to transcriptions stream when translation enabled
TranslationWorker (worker.py)
- EasyNMT model with automatic language detection
- ThreadPoolExecutor to prevent memory leaks
- Language code mapping for compatibility
- Memory monitoring with periodic GC
LANGUAGE_MAPPING = {
"en": "english", "vi": "vietnamese", "ja": "japanese",
"zh": "chinese", "ko": "korean", "fr": "french",
# ... 100+ language mappings
}async def process_translation_job(self, job_data: Dict[str, Any]) -> Dict[str, Any]:
# Map language codes to EasyNMT format
mapped_source_lang = get_mapped_language(source_lang)
mapped_target_lang = get_mapped_language(target_lang)
# Execute translation in thread pool
translation = await asyncio.get_event_loop().run_in_executor(
self.executor,
lambda: self.model.translate(text, source_lang=mapped_source_lang, target_lang=mapped_target_lang)
)- audio_jobs: Gateway → STT Workers (job queuing)
- transcriptions: STT Workers → Translation Workers (translation pipeline)
- stt_workers: Load balancing across STT worker instances
- translation_workers: Load balancing across translation worker instances
- results:{client_id}: Worker results → Gateway → Clients
- Per-client channels for secure result delivery
- session:{client_id}: Gateway session state persistence
- 1-hour expiration for cleanup
-
Client Connection
- WebSocket connection established to Gateway
- Client ID assigned, session initialized
- Redis pub/sub channel subscribed for results
-
Audio Streaming
- Client sends binary audio chunks with metadata
- Gateway resamples audio to 16kHz PCM
- Dual VAD detects speech activity
- Audio accumulated during active speech periods
-
Speech Detection & Job Creation
- Speech state transitions: INACTIVE → ACTIVE → SILENCE
- Pre-speech buffer (1s) prepended to active speech
- Audio segments published to Redis
audio_jobsstream - Job-in-flight tracking prevents flooding
-
STT Processing
- STT workers consume jobs via consumer groups
- Base64 audio decoded and normalized
- Faster-Whisper transcription with GPU acceleration
- Results published to client-specific pub/sub channels
-
Translation Processing (if enabled)
- Transcription results published to
transcriptionsstream - Translation workers process via consumer groups
- EasyNMT translation with language mapping
- Translated results published to client channels
- Transcription results published to
-
Result Delivery
- Gateway receives results via pub/sub
- Results forwarded to appropriate WebSocket clients
- Flow control ensures proper sequencing
- Network Disconnection: Graceful cleanup, session preservation
- Worker Failure: Job retry via Redis Stream consumer groups
- Model Errors: Logged, job marked as failed, metrics updated
- Memory Issues: Periodic GC, thread pool limits, resource monitoring
| Metric | Target | Current Implementation |
|---|---|---|
| End-to-end Latency | <500ms | ~200-400ms typical |
| STT Processing | <200ms | Faster-Whisper optimized |
| Translation Processing | <100ms | EasyNMT cached models |
| Concurrent Clients | 100+ | Horizontal scaling |
| Audio Buffer Size | 10s max | Configurable limits |
- CPU: 4 cores (for gateway + workers)
- RAM: 4GB (base models)
- Storage: 10GB (models + logs)
- Network: 10Mbps stable connection
- CPU: 8+ cores
- RAM: 16GB+
- GPU: NVIDIA with 4GB+ VRAM (for acceleration)
- Storage: 50GB SSD
- Sample Rate: 16kHz (resampled if different)
- Bit Depth: 16-bit PCM
- Channels: Mono
- Format: Raw PCM or WAV container
- English, Spanish, French, German, Italian, Portuguese, Russian
- Japanese, Chinese, Korean, Arabic, Hindi, Thai
- Vietnamese, Dutch, Swedish, Czech, Polish + auto-detection
- 100+ languages via Opus-MT models
- Automatic source language detection
- Custom language code mapping system
GATEWAY_PORT=5026
HEALTH_PORT=8080
REDIS_URL=redis://localhost:6379
SILENCE_THRESHOLD_SECONDS=2.0
SAMPLE_RATE=16000
WEBRTC_SENSITIVITY=3
SILERO_SENSITIVITY=0.7
PRE_SPEECH_BUFFER_SECONDS=1.0
MAX_QUEUE_DEPTH=100MODEL_SIZE=large-v3
DEVICE=cuda
BEAM_SIZE=5
VAD_FILTER=true
NORMALIZE_AUDIO=false
HEALTH_PORT=8081EASYNMT_MODEL=opus-mt
DEVICE=cuda
HEALTH_PORT=8082services:
gateway:
build: ./gateway
ports: ["5026:5026", "8080:8080"]
environment: {...}
depends_on: [redis]
stt_worker:
build: ./stt_worker
environment: {...}
runtime: nvidia # GPU support
deploy: {replicas: 1}
translation_worker:
build: ./translation_worker
environment: {...}
deploy: {replicas: 1}
redis:
image: redis:7-alpine
volumes: [redis_data:/data]# Persistence
appendonly yes
appendfilename "appendonly.aof"
auto-aof-rewrite-percentage 50
# Memory management
maxmemory 512mb
maxmemory-policy allkeys-lru
# Performance optimization
stream-node-max-bytes 4096
stream-node-max-entries 100
activedefrag yes
# Clone repository
git clone <repository>
cd realtime-speech-microservices
# Start infrastructure
cd infra
docker-compose up --build
# Run client GUI
python client_gui.py
# Run demo tests
cd demos
python demo_client.py --clients 5# Scale workers horizontally
docker-compose up --scale stt_worker=3 --scale translation_worker=2 --scale gateway=2apiVersion: apps/v1
kind: Deployment
metadata:
name: stt-worker
spec:
replicas: 3
template:
spec:
containers:
- name: stt-worker
resources:
limits:
nvidia.com/gpu: 1# docker-compose.override.yml
services:
stt_worker:
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- DEVICE=cudaAll services expose health endpoints:
- Gateway:
http://localhost:8080/health - STT Worker:
http://localhost:8081/health - Translation Worker:
http://localhost:8082/health
Health checks include:
- Service uptime and responsiveness
- Redis connectivity
- Model loading status
- Processing metrics and error counts
Each service exposes comprehensive metrics:
{
"instance_id": "gateway-abc123",
"queue_depth": 5,
"max_queue_depth": 100,
"metrics": {
"clients_connected": 12,
"audio_chunks_processed": 15420,
"jobs_published": 234,
"results_forwarded": 228,
"errors": 2
},
"timestamp": 1699123456.789
}- INFO: Normal operations, client connections, job processing
- WARNING: Recoverable errors, retries, resource warnings
- ERROR: Critical failures, model loading issues, connection failures
- DEBUG: Detailed processing information, state transitions
%(asctime)s [%(name)s] %(levelname)s: %(message)s
# Service health checks
curl http://localhost:8080/health
curl http://localhost:8081/health
curl http://localhost:8082/health
# View logs
docker-compose logs -f gateway
docker-compose logs -f stt_worker
# Redis monitoring
docker-compose exec redis redis-cli xlen audio_jobs
docker-compose exec redis redis-cli xinfo stream audio_jobs| Component | Typical Latency | Notes |
|---|---|---|
| Audio Reception | 5-10ms | WebSocket processing |
| VAD Detection | 10-20ms | Dual VAD processing |
| Job Publishing | 5-15ms | Redis Stream write |
| STT Processing | 100-300ms | GPU: 100ms, CPU: 300ms |
| Translation | 50-150ms | Cached model inference |
| Result Delivery | 5-10ms | Pub/Sub forwarding |
| Total E2E | 200-400ms | Real-time performance |
| Service | Base Throughput | Scaling Factor |
|---|---|---|
| Gateway | 50 concurrent clients | Linear with CPU cores |
| STT Worker | 10-20 jobs/sec | GPU acceleration |
| Translation Worker | 20-50 jobs/sec | CPU-bound |
| Service | Base Memory | Scaling Factor |
|---|---|---|
| Gateway | 200MB | +50MB per 100 clients |
| STT Worker | 2-4GB | +1GB per GPU worker |
| Translation Worker | 1-2GB | +500MB per worker |
| Service | Base CPU | Scaling Notes |
|---|---|---|
| Gateway | 20-50% | Event-driven, low overhead |
| STT Worker | 80-100% | GPU offload recommended |
| Translation Worker | 50-80% | CPU intensive, scale horizontally |
Symptoms: Large package downloads fail during build Solution:
# Enable BuildKit
export DOCKER_BUILDKIT=1
# Use increased timeout and retries
docker build --memory=4g --build-arg BUILDKIT_INLINE_CACHE=1 stt_worker/Symptoms: Faster-Whisper model download hangs or fails Solution:
# Pre-download models
docker run --rm speech/stt-worker python -c "import faster_whisper; faster_whisper.WhisperModel('base')"Symptoms: CUDA out of memory errors Solutions:
- Reduce
MAX_BATCH_SIZEin environment - Use smaller model:
MODEL_SIZE=small - Scale to fewer concurrent workers
Symptoms: Connection timeouts, lost messages Solutions:
- Check Redis container health:
docker-compose ps redis - Verify network connectivity
- Check Redis logs:
docker-compose logs redis - Adjust Redis
timeoutandtcp-keepalivesettings
Symptoms: End-to-end latency >500ms Solutions:
- Scale STT workers:
docker-compose up --scale stt_worker=3 - Check GPU utilization and memory
- Monitor Redis queue depth
- Adjust silence threshold for more frequent job sending
Symptoms: Increasing memory usage over time Solutions:
- Translation worker automatically runs GC every 5 minutes
- Monitor with
psutilintegration - Restart workers periodically if needed
- Check thread pool executor usage
# View all service logs
docker-compose logs -f
# Check container resource usage
docker stats
# Inspect Redis streams
docker-compose exec redis redis-cli xinfo stream audio_jobs
docker-compose exec redis redis-cli xlen audio_jobs
# Monitor Redis commands
docker-compose exec redis redis-cli monitor
# Check service connectivity
curl -f http://localhost:8080/health
curl -f http://localhost:8081/health
curl -f http://localhost:8082/health# Increase batch size for GPU efficiency
export MAX_BATCH_SIZE=8
# Adjust beam size (accuracy vs speed tradeoff)
export BEAM_SIZE=3
# Use smaller model for faster processing
export MODEL_SIZE=medium# Reduce silence threshold for more responsive processing
export SILENCE_THRESHOLD_SECONDS=0.8
# Adjust pre-speech buffer
export PRE_SPEECH_BUFFER_SECONDS=0.5# Increase memory limit
maxmemory 1gb
# Adjust persistence settings
auto-aof-rewrite-min-size 128mb
This technical documentation provides comprehensive coverage of the real-time speech transcription and translation microservices system. The architecture is designed for horizontal scalability, low-latency processing, and robust error handling in production environments.