v0.4.0 — Enterprise Auth, Native Pipeline, Neural Search & Security Hardening
A major release combining enterprise-grade authentication, a native transcription pipeline, neural search, GPU optimizations, cloud ASR providers, comprehensive speaker intelligence, a Progressive Web App, user groups & sharing, and a final frontend hardening sprint — all built from processing 1,400+ real-world recordings over two months of development. 281 commits since v0.3.3.
🔐 Enterprise Authentication
Four authentication methods that can run simultaneously, configured through the admin UI without restarts:
- Local — Username/password with bcrypt, TOTP MFA (RFC 6238 — Google Authenticator, Authy, Microsoft Authenticator), FedRAMP IA-5 password policies (complexity, history, expiration), NIST AC-7 account lockout with progressive thresholds
- LDAP/Active Directory — Enterprise directory integration with auto-provisioning and username-attribute mapping
- OIDC/Keycloak — OpenID Connect with federated identity, social login, and federated logout propagation
- PKI/X.509 — Certificate-based mTLS authentication with OCSP/CRL revocation checking and super-admin local password fallback
Plus: per-IP and per-user rate limiting, audit logging in structured JSON/CEF format with OpenSearch integration, JWT refresh token rotation with concurrent session limits, and database-driven configuration with AES-256-GCM encryption at rest — all manageable from a Super Admin UI without restarts.
⚡ Native Transcription Pipeline (2× Faster)
Replaced the legacy WhisperX pipeline with a native engine built on faster-whisper's BatchedInferencePipeline + PyAnnote v4. Cross-attention DTW provides word timestamps during transcription — no separate alignment pass, no wav2vec2 dependency, and native word timestamps for all 100+ languages (previously only ~42 via wav2vec2).
Benchmark (3.3-hour podcast, RTX A6000): 706s → 332s — 2.1× faster
- Unified pipeline replaces the previous
parallel_pipeline/whisperx_servicesplit - User-configurable VAD — Voice Activity Detection threshold and silence duration exposed as tunable settings
- Word timestamp validation — post-processing ensures monotonicity and prevents drift
- GPU pipeline benchmarks — 40.3× single-file realtime, 54.6× peak at concurrency=8, perfect linear scaling 1–12 workers
- TF32 acceleration enabled at worker startup and after diarization (Ampere+ GPUs)
🎙️ PyAnnote v4 Migration & Speaker Intelligence
- Automatic migration system — Admin UI with real-time progress bar migrates speaker embeddings from v3 (512-dim) to v4 (256-dim) via atomic alias swap, zero downtime
- Speaker overlap detection — Identifies overlapping speakers with confidence scoring
- Speaker pre-clustering — GPU-accelerated cross-file speaker grouping (#144)
- Global Speaker Management page — Dedicated page for cross-file speaker profile management with avatars
- Gender classification — Apache 2.0 licensed neural network predicts gender from voice; stored on profiles for cross-video consistency
- Gender-informed cluster validation — Cross-gender cluster assignments require higher similarity thresholds; minority members flagged for review
- Speaker metadata parsing — Cross-reference pipeline with metadata hints display for LLM-assisted speaker identification (#141)
- Jump-to-timestamp links in the speaker editor (#147)
- Unassign & blacklist — Remove speaker assignments and blacklist erroneous profiles
- Outlier analysis — Detect and flag outlier embeddings in speaker clusters
- Inline audio playback — Play/pause toggle in speaker cluster views
- OpenSearch cosine score fix — All 8 kNN score read locations now correctly convert
(1+cos)/2→ raw cosine - Warm model caching eliminates 40-60s cold-start delays by pre-loading models on startup
🔍 Hybrid Neural Search
Full-text BM25 combined with semantic vector search via OpenSearch ML Commons. Search for "budget discussion" and find segments about "financial planning" even when those exact words never appear.
- ML Commons integration — Native OpenSearch neural search, server-side embeddings
- RRF hybrid merging — BM25 + vector scores combined via Reciprocal Rank Fusion
- 6 embedding model tiers — from 384-dim MiniLM (fast) to 768-dim mpnet (best quality)
- Hybrid search crash fix — Previously silent fallback to BM25-only on OpenSearch 3.4 due to
ArrayIndexOutOfBoundsExceptionwhen combiningaggs+hybrid+collapse+ RRF - Soft demotion instead of hard suppression — Semantic results no longer dropped
- Dynamic over-fetch — Cap raised from 200 to 1,000 via
SEARCH_MAX_OVERFETCHfor large indexes - BM25 tuning — Fuzziness AUTO, cross-fields, phrase slop, rank constant tuned 40→30
- Stop/cancel reindex — Admin UI can cancel in-flight reindex operations
- Offline/airgapped model downloading for air-gapped deployments
- Dynamic model management via admin UI
☁️ Cloud ASR Providers
For deployments without a GPU — 8 cloud speech providers plus cloud diarization (#150):
- Providers: Deepgram, AssemblyAI, OpenAI Whisper API, Google, AWS Transcribe, Azure Speech, Speechmatics, Gladia
- pyannote.ai cloud diarization integration
- Independent diarization provider architecture —
diarization_sourceselector: ASR built-in, local PyAnnote GPU, pyannote.ai cloud, or off — independent of transcription provider choice - API-lite deployment mode — 2 GB CPU-only image vs. 8.9 GB for the full GPU image. Cloud-transcribed files still get local speaker embedding extraction for cross-file matching
- Custom vocabulary — Domain-specific hotwords (medical, legal, corporate, government) used as faster-whisper hotwords and cloud provider keyword boosting
- Admin-pinned ASR model — Admins control local Whisper model selection; model loaded once at startup, shared across all workers
- Per-transcription model override — Users can override the admin-pinned model per upload (#153)
🤝 User Groups, Collection Sharing & Collaboration
- User Groups & Collection Sharing (#148) — Create user groups and share collections with groups or individual users; granular viewer/editor permissions
- Speaker profile sharing via the collection sharing infrastructure
- Config/prompt sharing — Share LLM configs, prompts, media sources, and organization contexts between users
- Per-collection AI prompts (#146) — Different summarization styles for different collection types
- Bidirectional prompt-collection links — Prompts show which collections use them
- Organization context (#142) — Inject domain knowledge into all LLM prompts for context-aware summaries
📤 Upload & Media
- TUS 1.0.0 resumable uploads (#10) — Chunked uploads with MinIO multipart storage that survive network interruptions
- Collection & tag selection at upload (#145) — Organize files during upload, not after
- URL download quality settings (#122) — Configure video resolution, audio-only mode, and bitrate for yt-dlp downloads
- File retention & auto-deletion (#134) — Admin-configurable file retention with automatic deletion and GDPR-compliant audit logging
- Auto-labeling (#140) — AI suggests tags and collections from transcript content with fuzzy deduplication
- Disable AI summary per upload (#152)
- Disable speaker diarization per upload (#151)
- Selective reprocessing (#143) — Stepper UI to re-run specific pipeline stages on existing files
- YouTube bot-bypass — 2026 yt-dlp best practices (Deno JS runtime, client rotation, proper headers) for 1,800+ supported platforms
🛡️ Frontend Hardening Sprint
A dedicated audit sprint shipped in this release. Full details below under "Security", but the highlights:
- Flash of Authenticated Content (FOAC) fix — Layout now gates protected content during async auth verification
- Centralized user state cleanup (
lib/session/clearUserState.ts) — 17+ stores, caches, and localStorage keys cleared on every login/logout - Session-scoped
AbortControllercancels in-flight requests on logout - bfcache invalidation — Back button after logout forces reload to discard restored snapshots
- DOMPurify sanitization across 8
{@html}render sites; replaces a bypassable regex sanitizer - Production source maps disabled
- Keycloak redirect URL validation
🎨 UX & Frontend Polish
- Upload modal redesign — Replaced the 4,603-line monolith with a 6-step linear stepper (Media → Tags → Collections → Speakers → Options → Submit) plus a conditional Extract step for large videos. All three upload sources (file/URL/recording) now share steps 2-6. "Remember previous values" and "Review with defaults" shortcuts for power users
- Skeleton loaders — Replace generic spinners on home gallery, search results, file detail, and speaker clusters/profiles/inbox (~20% faster perceived load per Nielsen Norman research)
- Gallery click feedback — Instant press state + mousedown prefetch (~50-100ms head start)
- Gallery redesign — Compact Apple-like grid cards, list view, sorting, multi-select bulk actions
- Gallery state persistence — Filters persist across file detail navigation; scroll position restored on back
- Collection & Share modal polish — Intro text, permission reference cards, empty states, backdrop-click data-loss protection
- Manage Collections visual fix — Eliminated the "card in a card" glitch
- Settings redesign — Tabbed navigation, per-user preferences, speaker behavior defaults
- Queue Dashboard — Unified tasks view (formerly File Status) with quick filters, DatePicker, and pagination
- Stepper reprocess UI — Step-by-step reprocessing with stage picker
- Gallery action consolidation — Action buttons moved to header with dropdown groups (#139)
- Multi-select with auto-filter and title normalization
- Unified color system — Gallery toolbar replaced a 7-color rainbow (blue, purple, green, amber, red, gray, purple) with a consistent 2-color system per Apple HIG: primary blue for main actions, surface/gray for secondary actions, red for destructive only. All purple removed from buttons and badges across 9 components. New
--ai-accent-colorCSS variable replaces hardcoded purple. Dark mode hover direction fixed (was lightening instead of darkening)
📱 Progressive Web App & Mobile
- Installable PWA (#155) — 15+ mobile fixes shipped as a comprehensive overhaul
- 2-column mobile grid, hamburger navigation, full-screen modals, scroll locking, touch-optimized UI
- iPad/iOS responsive layout fixes — widened tablet breakpoints to 1200px for iPad landscape
- Mobile settings navigation redesigned with dropdown selector
- Background page scroll lock under all modals
- 44×44pt touch targets throughout (Apple HIG compliance)
⚙️ Infrastructure, Monitoring & Performance
- 3-stage Celery pipeline — Preprocess (CPU) → Transcribe+Diarize (GPU) → Postprocess (CPU) across separate queues so the GPU never idles waiting on CPU work
- Flower monitoring upgrade — Industry-standard Celery/Flower integration with persistent task history, queue visibility, worker status
- Multi-GPU stats with stepper UI — Real-time per-GPU stats display
- GPU concurrent model sharing — NVML profiling, PyAnnote embedding batch optimization (upstream contributions to PyAnnote and WhisperX)
- 273× faster WhisperX speaker assignment — Replaced O(n×m) linear scan with interval tree + NumPy vectorized ops (10.2s → 0.037s for a 3-hour file). Contributed upstream.
- Dual-model transcription architecture — CPU lightweight + GPU primary for workload flexibility
- Embedded documentation container — New
opentranscribe-docsDocusaurus site served at/docs/through NGINX proxy; fully offline-capable for air-gapped deployments - Progressive Web App service worker with versioned cache purging
- Codebase modularization — 9 new shared backend modules, 6 new UI components, speaker task splits, dead code removal
- Default Whisper model changed from
large-v2tolarge-v3-turbo(6× faster); uselarge-v3for translation or maximum accuracy - Intelligent batch sizing based on available VRAM
- Alembic-only database bootstrapping — Linearized migration chain after branch merges
- Configurable TXT export — Persistent export preferences including speaker grouping
🔒 Security
A comprehensive security hardening pass alongside the feature work:
Infrastructure Hardening
- CSP headers, private MinIO buckets, AES-256-GCM encryption at rest
- Non-root containers throughout the backend and frontend images
- FIPS 140-3 readiness documentation for government deployments
- Apt/apk upgrade on all runtime stages to pull latest base OS patches
- Hadolint + Trivy + Grype + Dockle + SBOM scan pipeline integrated into
docker-build-push.sh
Frontend Session Hardening
- Flash of Authenticated Content (FOAC) fix —
+layout.sveltenow gates all protected content behindauthReady && isAuthenticated && !isPublicPath. Previously, unauthenticated users hitting/briefly saw the gallery slot render before the redirect, leaking ~1-2 frames of protected UI and triggering/filesAPI calls - Centralized user state cleanup — New
lib/session/clearUserState.tsis the single source of truth for session teardown. Clears 17+ subsystems on every login/logout transition: toast, websocket, uploads, gallery filters, search results, sharing, LLM status, settings modal, transcript, groups, downloads, notifications, recording (with media track cleanup), thumbnail cache, media URL cache, speaker colors, plus user-scoped localStorage keys. Preferences (theme, locale, view mode, recording settings) are explicitly preserved - Session-scoped request cancellation —
AbortControllerinlib/axios.tsattached to every request via interceptor (except auth endpoints).logout()callsabortAllRequests()beforeclearUserState(), closing the race window where a late response could repopulate a cleared store - bfcache invalidation on back button — Listens for
pageshowevents withevent.persisted === trueand forceswindow.location.reload(), preventing previously-protected pages from being restored from memory on shared devices - Toast cross-session leak fixed —
toastStore.clear()called from every login success path (local, Keycloak, PKI, MFA) and fromlogout() - Keycloak redirect URL validation —
loginWithKeycloak()parses and validates the authorization URL protocol (http:/https:only) before redirecting
XSS Hardening
- DOMPurify-backed HTML sanitization — New
lib/utils/sanitizeHtml.tswith strict tag whitelist. Addeddompurifyas a dependency - Defense-in-depth across 8
{@html}render sites — TopicsList, TranscriptDisplay, TranscriptModal, SearchTranscriptModal, SearchOccurrence, SearchResultCard, SummaryDisplay - Bypassable regex sanitizer replaced — The previous
SearchOccurrencesanitizerhtml.replace(/<(?!\/?mark[\s>])[^>]*>/g, '')was bypassable via</mark><script>...</script><mark>payloads (regex only matched opening tags)
Build & Config
- Production source maps disabled —
sourcemap: mode !== 'production'prevents shipping variable names, API endpoints, and business logic to DevTools viewers - Defense-in-depth home page guard —
routes/+page.svelteearly-returns if unauthenticated
🌍 Internationalization
- 8 UI languages: English, Spanish, French, German, Portuguese, Chinese, Japanese, Russian
- AI summary output in 12 languages
- Full i18n compliance audit — added missing translations across all locales
🚢 How to Update
Docker Compose
docker compose pull
docker compose up -dAfter upgrading, hard-reload the frontend (Ctrl+Shift+R / Cmd+Shift+R) to pick up the new service worker and clear stale cached assets.
Alembic migrations run automatically on startup — no manual database changes. All existing data is preserved.
Management Script
./opentr.sh stop && ./opentr.sh start prodTo enable new authentication methods
- Log in as super admin
- Navigate to Settings → Authentication
- Enable desired methods (LDAP, Keycloak, PKI)
- Configure each in its dedicated section
Optional: PyAnnote v4 migration
To enable speaker overlap detection and improved performance:
- Navigate to Settings → Embeddings
- Click "Migrate to PyAnnote v4"
- Monitor progress with the real-time progress bar (no restart required)
Optional: reclaim disk space
The wav2vec2 alignment model is no longer used (~360 MB):
rm -rf ${MODEL_CACHE_DIR:-./models}/torch/hub/checkpoints/wav2vec2_*Existing word-level timestamps are preserved — no reprocessing needed.
Optional: clean up deprecated env vars
# These can be safely removed from .env:
# ENABLE_ALIGNMENT=true (alignment is now always-on natively)
# TRANSCRIPTION_ENGINE=whisperx (single unified engine, setting ignored)📝 Breaking Changes
- Authentication Configuration: Auth settings now configured via Super Admin UI (Settings → Authentication) instead of environment variables. Database configuration takes precedence if set.
- PyAnnote Migration: Existing installations may optionally migrate speaker embeddings to v4 for overlap detection and improved voice matching.
- wav2vec2 Alignment Model Removed: Word-level timestamps are now native.
ENABLE_ALIGNMENTandTRANSCRIPTION_ENGINEenv vars are deprecated and silently ignored. - Removed Python modules:
whisperx_service.py,parallel_pipeline.py,pyannote_compat.py,fast_speaker_assignment.py,batched_alignment.py— functionality merged into the unified pipeline.
🐛 Selected Bug Fixes
- Hybrid search silently falling back to BM25-only due to OpenSearch 3.4 crash — fixed
- OpenSearch cosine similarity scores now correctly converted from
(1+cos)/2to raw cosine - Speaker profile centroid embeddings now correctly averaged across all constituent embeddings
- GPU memory leaks — CPU worker CUDA context initialization, prefork child VRAM leak, warm cache gating
- HuggingFace gated model authentication for PyAnnote diarization
- 18 N+1 query patterns and ORM hydration waste fixed across services and tasks
- Login flicker, empty-state flash, and navigation glitches eliminated
- YouTube 2026 bot-bypass (Deno JS runtime, client rotation)
- Admin bypass and shared editor access across all API endpoints
- Alembic migration chain linearized after branch merges
- LDAP user bcrypt crash when verifying non-local passwords
- WebSocket notification queue, upload queue, and previous-upload-values localStorage leaks on logout
- Dropdown clipping in upload modal
- Nested card visual glitch in Manage Collections modal
- Debug console.logs removed from production code
- Dead code removed (
Tasks.svelte.old, unusedAudioExtractionModal.svelte) - Avatar lazy-loading on Speakers page
- Dark mode hover direction —
--primary-hoverwas lighter than--primary-color, making buttons appear to deactivate on hover. Fixed to darken consistently across both themes
👥 Contributors
Special thanks to the community members whose code and feedback shaped this release:
Code contributors:
- @vfilon (Vitali Filon) — Authored the entire LDAP/Active Directory authentication feature (PR #117, 9 commits): auth engine, username attribute support,
auth_typehandling, password change restrictions for non-local users, conditional settings UI, documentation, and migration detection logic. Foundation of the enterprise auth system. - @imorrish (Ian Morrish) — Submitted PR #117 upstream; contributed the Postgres password reset guide to the troubleshooting docs.
Feature requests and bug reports that shipped in this release:
- @imorrish — Scrollable speaker dropdown (#129), filename in AI summary template (#138), collection/tag selection at upload (#145), per-collection default AI prompt (#146)
- @it-service-gemag — Disable diarization per upload (#151), disable AI summary per upload (#152), per-transcription Whisper model selection (#153)
- @Politiezone-MIDOW — File retention and auto-deletion system (#134)
- @coltrall — Docker daemon detection fix in the installation script (#137)
- @SQLServerIO (Wes Brown) — Pagination for large transcripts, fixing file detail page hang with long recordings (#109)
Thank you to everyone who filed issues, tested pre-releases, and shared their use cases — your feedback directly drives what gets built.
📚 Full Details
- Full changelog: CHANGELOG.md
- Blog post: The story behind v0.4.0
- Commits since v0.3.3: v0.3.3...v0.4.0 (281 commits)
- Docker images:
davidamacey/opentranscribe-backend:v0.4.0anddavidamacey/opentranscribe-frontend:v0.4.0on Docker Hub