Skip to content

Releases: attevon-llc/OpenTranscribe

v0.4.1 - LDAP DN Fix & Keycloak PKI Compliance

15 Apr 01:52

Choose a tag to compare

LDAP DN Fix & Keycloak PKI Compliance

This patch release fixes a critical LDAP group filtering bug reported in #188 and adds government/FedRAMP Keycloak-as-PKI-broker support.

What Was Broken

Active Directory Distinguished Names use commas as internal syntax (e.g. CN=Whisper_Users,CN=Users,DC=domain,DC=local). The previous code split group lists on commas, which shredded full DNs into fragments that could never match what AD returned. Group filtering was silently broken for any installation using full DNs.

Highlights

  • LDAP group DN parsing fixed — group lists now use semicolons as the multi-group separator; full AD DNs work correctly
  • PKI_ADMIN_DNS parsing fixed — same semicolon delimiter fix for certificate admin lists
  • Keycloak X.509 PKI broker — cert claims injected by Keycloak (both cert_* and x509_cert_* forms) are extracted and stored on the user record
  • PKI admin promotion via Keycloak — cert DN in PKI_ADMIN_DNS grants admin access for Keycloak users, matching standalone PKI auth behaviour
  • Government cert CN formatCN=LastName FirstName emailusername (space-separated 3-token) parsed and displayed as First Last
  • 116 new unit tests across ldap_auth and keycloak_auth modules

Upgrade Notes

LDAP group list format change — update your LDAP_REQUIRED_USER_GROUPS and LDAP_ADMIN_GROUPS environment variables to use semicolons:

# Before (broken for full DNs)
LDAP_REQUIRED_USER_GROUPS=CN=Whisper_Users,CN=Users,DC=domain,DC=local

# After (correct)
LDAP_REQUIRED_USER_GROUPS=CN=Whisper_Users,CN=Users,DC=domain,DC=local

# Multiple groups — use semicolons
LDAP_REQUIRED_USER_GROUPS=CN=Group1,DC=domain,DC=local;CN=Group2,DC=domain,DC=local

PKI_ADMIN_DNS — if you have multiple admin DNs, use semicolons:

PKI_ADMIN_DNS=CN=Doe John jdoe,OU=Agency,O=U.S. Government,C=US;CN=Smith Jane jsmith,OU=Agency,O=U.S. Government,C=US

No database migrations required.

How to Update

Docker Compose:

docker compose pull
docker compose up -d

Full Changelog

See CHANGELOG.md

v0.4.0 — Enterprise Auth, Native Pipeline, Neural Search & Security Hardening

14 Apr 04:21

Choose a tag to compare

v0.4.0 — Enterprise Auth, Native Pipeline, Neural Search & Security Hardening

A major release combining enterprise-grade authentication, a native transcription pipeline, neural search, GPU optimizations, cloud ASR providers, comprehensive speaker intelligence, a Progressive Web App, user groups & sharing, and a final frontend hardening sprint — all built from processing 1,400+ real-world recordings over two months of development. 281 commits since v0.3.3.


🔐 Enterprise Authentication

Four authentication methods that can run simultaneously, configured through the admin UI without restarts:

  • Local — Username/password with bcrypt, TOTP MFA (RFC 6238 — Google Authenticator, Authy, Microsoft Authenticator), FedRAMP IA-5 password policies (complexity, history, expiration), NIST AC-7 account lockout with progressive thresholds
  • LDAP/Active Directory — Enterprise directory integration with auto-provisioning and username-attribute mapping
  • OIDC/Keycloak — OpenID Connect with federated identity, social login, and federated logout propagation
  • PKI/X.509 — Certificate-based mTLS authentication with OCSP/CRL revocation checking and super-admin local password fallback

Plus: per-IP and per-user rate limiting, audit logging in structured JSON/CEF format with OpenSearch integration, JWT refresh token rotation with concurrent session limits, and database-driven configuration with AES-256-GCM encryption at rest — all manageable from a Super Admin UI without restarts.

⚡ Native Transcription Pipeline (2× Faster)

Replaced the legacy WhisperX pipeline with a native engine built on faster-whisper's BatchedInferencePipeline + PyAnnote v4. Cross-attention DTW provides word timestamps during transcription — no separate alignment pass, no wav2vec2 dependency, and native word timestamps for all 100+ languages (previously only ~42 via wav2vec2).

Benchmark (3.3-hour podcast, RTX A6000): 706s → 332s — 2.1× faster

  • Unified pipeline replaces the previous parallel_pipeline / whisperx_service split
  • User-configurable VAD — Voice Activity Detection threshold and silence duration exposed as tunable settings
  • Word timestamp validation — post-processing ensures monotonicity and prevents drift
  • GPU pipeline benchmarks — 40.3× single-file realtime, 54.6× peak at concurrency=8, perfect linear scaling 1–12 workers
  • TF32 acceleration enabled at worker startup and after diarization (Ampere+ GPUs)

🎙️ PyAnnote v4 Migration & Speaker Intelligence

  • Automatic migration system — Admin UI with real-time progress bar migrates speaker embeddings from v3 (512-dim) to v4 (256-dim) via atomic alias swap, zero downtime
  • Speaker overlap detection — Identifies overlapping speakers with confidence scoring
  • Speaker pre-clustering — GPU-accelerated cross-file speaker grouping (#144)
  • Global Speaker Management page — Dedicated page for cross-file speaker profile management with avatars
  • Gender classification — Apache 2.0 licensed neural network predicts gender from voice; stored on profiles for cross-video consistency
  • Gender-informed cluster validation — Cross-gender cluster assignments require higher similarity thresholds; minority members flagged for review
  • Speaker metadata parsing — Cross-reference pipeline with metadata hints display for LLM-assisted speaker identification (#141)
  • Jump-to-timestamp links in the speaker editor (#147)
  • Unassign & blacklist — Remove speaker assignments and blacklist erroneous profiles
  • Outlier analysis — Detect and flag outlier embeddings in speaker clusters
  • Inline audio playback — Play/pause toggle in speaker cluster views
  • OpenSearch cosine score fix — All 8 kNN score read locations now correctly convert (1+cos)/2 → raw cosine
  • Warm model caching eliminates 40-60s cold-start delays by pre-loading models on startup

🔍 Hybrid Neural Search

Full-text BM25 combined with semantic vector search via OpenSearch ML Commons. Search for "budget discussion" and find segments about "financial planning" even when those exact words never appear.

  • ML Commons integration — Native OpenSearch neural search, server-side embeddings
  • RRF hybrid merging — BM25 + vector scores combined via Reciprocal Rank Fusion
  • 6 embedding model tiers — from 384-dim MiniLM (fast) to 768-dim mpnet (best quality)
  • Hybrid search crash fix — Previously silent fallback to BM25-only on OpenSearch 3.4 due to ArrayIndexOutOfBoundsException when combining aggs + hybrid + collapse + RRF
  • Soft demotion instead of hard suppression — Semantic results no longer dropped
  • Dynamic over-fetch — Cap raised from 200 to 1,000 via SEARCH_MAX_OVERFETCH for large indexes
  • BM25 tuning — Fuzziness AUTO, cross-fields, phrase slop, rank constant tuned 40→30
  • Stop/cancel reindex — Admin UI can cancel in-flight reindex operations
  • Offline/airgapped model downloading for air-gapped deployments
  • Dynamic model management via admin UI

☁️ Cloud ASR Providers

For deployments without a GPU — 8 cloud speech providers plus cloud diarization (#150):

  • Providers: Deepgram, AssemblyAI, OpenAI Whisper API, Google, AWS Transcribe, Azure Speech, Speechmatics, Gladia
  • pyannote.ai cloud diarization integration
  • Independent diarization provider architecturediarization_source selector: ASR built-in, local PyAnnote GPU, pyannote.ai cloud, or off — independent of transcription provider choice
  • API-lite deployment mode — 2 GB CPU-only image vs. 8.9 GB for the full GPU image. Cloud-transcribed files still get local speaker embedding extraction for cross-file matching
  • Custom vocabulary — Domain-specific hotwords (medical, legal, corporate, government) used as faster-whisper hotwords and cloud provider keyword boosting
  • Admin-pinned ASR model — Admins control local Whisper model selection; model loaded once at startup, shared across all workers
  • Per-transcription model override — Users can override the admin-pinned model per upload (#153)

🤝 User Groups, Collection Sharing & Collaboration

  • User Groups & Collection Sharing (#148) — Create user groups and share collections with groups or individual users; granular viewer/editor permissions
  • Speaker profile sharing via the collection sharing infrastructure
  • Config/prompt sharing — Share LLM configs, prompts, media sources, and organization contexts between users
  • Per-collection AI prompts (#146) — Different summarization styles for different collection types
  • Bidirectional prompt-collection links — Prompts show which collections use them
  • Organization context (#142) — Inject domain knowledge into all LLM prompts for context-aware summaries

📤 Upload & Media

  • TUS 1.0.0 resumable uploads (#10) — Chunked uploads with MinIO multipart storage that survive network interruptions
  • Collection & tag selection at upload (#145) — Organize files during upload, not after
  • URL download quality settings (#122) — Configure video resolution, audio-only mode, and bitrate for yt-dlp downloads
  • File retention & auto-deletion (#134) — Admin-configurable file retention with automatic deletion and GDPR-compliant audit logging
  • Auto-labeling (#140) — AI suggests tags and collections from transcript content with fuzzy deduplication
  • Disable AI summary per upload (#152)
  • Disable speaker diarization per upload (#151)
  • Selective reprocessing (#143) — Stepper UI to re-run specific pipeline stages on existing files
  • YouTube bot-bypass — 2026 yt-dlp best practices (Deno JS runtime, client rotation, proper headers) for 1,800+ supported platforms

🛡️ Frontend Hardening Sprint

A dedicated audit sprint shipped in this release. Full details below under "Security", but the highlights:

  • Flash of Authenticated Content (FOAC) fix — Layout now gates protected content during async auth verification
  • Centralized user state cleanup (lib/session/clearUserState.ts) — 17+ stores, caches, and localStorage keys cleared on every login/logout
  • Session-scoped AbortController cancels in-flight requests on logout
  • bfcache invalidation — Back button after logout forces reload to discard restored snapshots
  • DOMPurify sanitization across 8 {@html} render sites; replaces a bypassable regex sanitizer
  • Production source maps disabled
  • Keycloak redirect URL validation

🎨 UX & Frontend Polish

  • Upload modal redesign — Replaced the 4,603-line monolith with a 6-step linear stepper (Media → Tags → Collections → Speakers → Options → Submit) plus a conditional Extract step for large videos. All three upload sources (file/URL/recording) now share steps 2-6. "Remember previous values" and "Review with defaults" shortcuts for power users
  • Skeleton loaders — Replace generic spinners on home gallery, search results, file detail, and speaker clusters/profiles/inbox (~20% faster perceived load per Nielsen Norman research)
  • Gallery click feedback — Instant press state + mousedown prefetch (~50-100ms head start)
  • Gallery redesign — Compact Apple-like grid cards, list view, sorting, multi-select bulk actions
  • Gallery state persistence — Filters persist across file detail navigation; scroll position restored on back
  • Collection & Share modal polish — Intro text, permission reference cards, empty states, backdrop-click data-loss protection
  • Manage Collections visual fix — Eliminated the "card in a card" glitch
  • Settings redesign — Tabbed navigation, per-user preferences, speaker behavior defaults
  • Queue Dashboard — Unified tasks view (formerly File Status) with quick filters, DatePicker, and pagination
  • Stepper reprocess UI — Step-by-step reprocessing with stage picker
  • Gallery action consolidation — Action buttons moved to header with dropdown groups (#139)
  • **Multi-select with auto-filter a...
Read more

v0.3.3 - Community Contributions & Protected Media Support

14 Jan 05:13

Choose a tag to compare

Community Contributions & Protected Media Support

Community-driven release featuring contributions from @vfilon, who submitted all four PRs in this version!

Highlights

  • 🇷🇺 Russian Language Support - 8th supported UI language with 1,600+ translated strings
  • 🔐 Protected Media Authentication - New plugin system for downloading from password-protected corporate video portals (MediaCMS support built-in)
  • 🛠️ Bug Fixes - VRAM monitoring fix for non-CUDA devices, loading screen translation fix
  • 🔧 URL Utilities - Centralized URL construction for consistent dev/production behavior

How to Update

Docker Compose:

docker compose pull
docker compose up -d

Protected Media Setup (Optional)

To enable authenticated downloads from MediaCMS installations:

# Add to .env
MEDIACMS_ALLOWED_HOSTS=media.example.com,mediacms.internal

Full Changelog

See CHANGELOG.md for complete details.

Thank You

Special thanks to @vfilon for contributing all four PRs in this release!

v0.3.2 - Setup Script Bug Fixes

17 Dec 01:42

Choose a tag to compare

Patch release fixing critical bugs in the one-liner installation script that prevented successful setup on fresh installations.

Note: This is a scripts-only release. No Docker container rebuild required.

Fixed

Setup Script Fixes

  • Scripts Directory Creation - Fixed curl error 23 ("Failure writing output to destination") when downloading SSL and permission scripts by creating the scripts/ directory before download attempts
  • PyTorch 2.6+ Compatibility - Applied torch.load patch to download-models.py for PyTorch 2.6+ compatibility, mirroring the fix already present in the backend (from Wes Brown's commit 8929cd6)
    • PyTorch 2.6 changed weights_only default to True, causing omegaconf deserialization errors during model downloads
    • The patch sets weights_only=False for trusted HuggingFace models

Upgrade Notes

For existing installations: No action required - Docker containers already have the PyTorch fix.

For new installations: The one-liner setup script now works correctly:

curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash

Full Changelog: https://github.com/davidamacey/OpenTranscribe/blob/master/CHANGELOG.md

v0.3.1 - Script Enhancements & Documentation Updates

16 Dec 13:49

Choose a tag to compare

Script Enhancements & Documentation Updates

Patch release with enhanced setup scripts for HTTPS/SSL configuration and comprehensive documentation updates covering v0.2.0 and v0.3.0 features.

Highlights

New Management Commands

  • ./opentranscribe.sh setup-ssl - Interactive HTTPS/SSL configuration
  • ./opentranscribe.sh version - Check current version and available updates
  • ./opentranscribe.sh update - Update containers only (quick)
  • ./opentranscribe.sh update-full - Update containers + config files (recommended)

NGINX Improvements

  • Automatic NGINX overlay loading when NGINX_SERVER_NAME is configured
  • NGINX health check added to ./opentr.sh health

Documentation Updates

  • New comprehensive NGINX/SSL setup guide
  • Updated docs for Universal Media URL support (1800+ platforms)
  • Added garbage cleanup feature documentation
  • FAQ entries for system statistics and transcript pagination
  • All Docusaurus and README docs updated for v0.2.0/v0.3.0 features

How to Update

Existing installations:

./opentranscribe.sh update-full

New installations:

curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash

Full Changelog

See CHANGELOG.md

OpenTranscribe v0.3.0 - Universal Media URL & NGINX Support

15 Dec 12:24

Choose a tag to compare

What's New in v0.3.0

This release integrates valuable contributions from the community fork by @vfilon, bringing major new features including support for 1800+ video platforms and production-ready NGINX reverse proxy with SSL/TLS.

🎬 Universal Media URL Support (1800+ Platforms)

The headline feature expands OpenTranscribe far beyond YouTube:

Supported Platforms:

  • Primary (Best Support): YouTube, Dailymotion, Twitter/X
  • Secondary: Vimeo (public only), TikTok (variable), and 1800+ more via yt-dlp

Features:

  • Dynamic source platform detection from yt-dlp metadata
  • User-friendly error messages for authentication-required platforms
  • Platform guidance for common issues (Vimeo login, Instagram restrictions, etc.)
  • Updated UI with "Supported Platforms" section and limitations warning

Note: Authentication is not currently supported. Videos requiring login will fail with helpful error messages guiding users to publicly accessible alternatives.

🔐 NGINX Reverse Proxy with SSL/TLS

This release closes #72, enabling browser microphone recording on remote network access:

  • docker-compose.nginx.yml overlay for production deployments
  • Full SSL/TLS configuration with HTTP → HTTPS redirect
  • WebSocket proxy support for real-time updates
  • 2GB file upload support for large media files
  • Flower dashboard and MinIO console accessible through NGINX
  • Self-signed certificate generation script

🔧 Critical Bug Fixes: UUID/ID Standardization

Comprehensive fix for UUID/ID handling across 60+ files:

Issues Fixed:

  • Speaker recommendations not showing for new videos
  • Profile embedding service returning wrong ID type
  • Inconsistent ID handling between backend and frontend
  • Comment system UUID issues
  • Password reset flow problems

🏗️ Infrastructure Improvements

GPU Configuration:

  • Separated into optional docker-compose.gpu.yml overlay
  • Better cross-platform support (macOS, CPU-only systems)
  • Auto-detection in opentr.sh script

Task Management:

  • Task status reconciliation before marking files as stuck
  • Multiple timestamp fallbacks for better reliability
  • Auto-refresh analytics when segment speaker changes

LLM Service:

  • Ollama context window configuration (num_ctx parameter)
  • Model-aware temperature handling
  • Better logging with resolved endpoint info

🌐 i18n Updates

All 7 supported languages have been updated:

  • Notification text changed from "YouTube Processing" to "Video Processing"
  • New media URL description and platform limitation strings
  • Updated recommended platforms list

🙏 Acknowledgments

Special thanks to @vfilon for the fork contributions that made this release possible:

  • Universal Media URL support concept
  • NGINX reverse proxy configuration
  • Task status reconciliation improvements
  • GPU overlay separation

How to Update

Docker Compose (Recommended)

# Pull the latest images
docker compose pull

# Restart with new images
docker compose up -d

For NGINX/SSL Setup

# Set NGINX_SERVER_NAME in .env
./scripts/generate-ssl-cert.sh
./opentr.sh start prod

See docs/NGINX_SETUP.md for complete setup instructions.

Full Changelog

See the CHANGELOG for complete details.


Full Changelog: v0.2.1...v0.3.0

v0.2.1 - Security Patch Release

13 Dec 14:55

Choose a tag to compare

Security Patch Release

This release addresses critical container vulnerabilities identified in security scans. All users are encouraged to update.

Resolved Critical CVEs (4 → 0)

CVE Package Severity Status
CVE-2025-47917 libmbedcrypto CRITICAL ✅ Fixed
CVE-2023-6879 libaom3 CRITICAL ✅ Fixed
CVE-2025-7458 libsqlite3 CRITICAL ✅ Fixed
CVE-2023-45853 zlib CRITICAL ✅ Fixed

Container Updates

Frontend:

  • nginx:1.29.3-alpine3.22nginx:1.29.4-alpine3.23
  • Fixed 6 vulnerabilities (3 HIGH, 3 MEDIUM) in libpng and busybox
  • Added HEALTHCHECK instruction

Backend:

  • python:3.12-slim-bookwormpython:3.13-slim-trixie
  • Debian 12 → Debian 13 "trixie"
  • Python 3.12 → Python 3.13
  • Added HEALTHCHECK instruction

How to Update

Docker Compose:

docker compose pull
docker compose up -d

Manual:

docker pull davidamacey/opentranscribe-frontend:v0.2.1
docker pull davidamacey/opentranscribe-backend:v0.2.1

Full Changelog

See CHANGELOG.md for complete details.


🔒 Your security is our priority. Thank you for using OpenTranscribe.

v0.2.0 - Community-Driven Multilingual Release

13 Dec 00:43
8851626

Choose a tag to compare

We're thrilled to announce OpenTranscribe v0.2.0! This release is special because it marks our first major community-driven update, featuring contributions from real-world users who are actively using OpenTranscribe in production.

Growing Community

In just over a month since our v0.1.0 release, OpenTranscribe has seen exciting growth:

Community Contributions

Wes Brown's Seven Pull Requests

A massive thank you to Wes Brown (@SQLServerIO) who submitted an incredible seven pull requests addressing real-world issues he encountered while using OpenTranscribe:

  1. PR #110: Pagination for large transcripts - Fixes page hanging with thousands of segments
  2. PR #107: Auto-cleanup garbage transcription segments
  3. PR #106: User admin endpoints now use UUID instead of integer ID
  4. PR #105: Speaker merge UI and segment speaker reassignment
  5. PR #104: LLM model discovery for OpenAI-compatible providers
  6. PR #103: Per-file speaker count settings in upload and reprocess UI
  7. PR #102: PyTorch 2.6+ compatibility and speaker diarization settings

The Multilingual Feature Request

Issue #99 from @LaboratorioInternacionalWeb highlighted a critical gap in our product: Spanish audio files were being transcribed to English because WhisperX was hardcoded with language="en" and task="translate".

What's New in v0.2.0

🌍 Multilingual Transcription Support (100+ Languages)

  • Source Language: Auto-detect or specify the audio language (100+ languages supported)
  • Translate to English: Toggle to translate non-English audio (default: OFF - keeps original language)
  • LLM Output Language: Generate AI summaries in 12 different languages
  • ~42 languages have word-level timestamp support via wav2vec2 alignment
  • Settings are stored per-user in the database

🌐 UI Internationalization (7 Languages)

The UI is now available in:

  • English (default)
  • Spanish (Español)
  • French (Français)
  • German (Deutsch)
  • Portuguese (Português)
  • Chinese (中文)
  • Japanese (日本語)

🎙️ Speaker Management Enhancements

  • Speaker Merge UI: New visual interface to combine duplicate speakers with segment preview and reassignment
  • Per-File Speaker Settings: Configure min/max speakers at upload or reprocess time
  • User-Level Preferences: Save default speaker detection settings

🤖 LLM Integration Improvements

  • Model Auto-Discovery: Automatic detection of available models for vLLM, Ollama, and Anthropic providers
  • Anthropic Support Enhanced: Native model discovery via /v1/models API
  • Multilingual Output: Generate AI summaries in 12 different languages
  • Improved Configuration UX: Toast notifications, better API key handling, edit mode with stored keys
  • Updated Default Models: Anthropic uses claude-opus-4-5-20251101, Ollama uses llama3.2:latest

⚡ Performance & Stability

  • Pagination for Large Transcripts: No more browser hanging with thousands of segments
  • Auto-Cleanup Garbage Segments: Automatic detection and removal of erroneous transcription segments
  • PyTorch 2.6+ Compatibility: Support for the latest PyTorch versions
  • Backend Code Quality: Reduced cyclomatic complexity across 47 functions in 27 files

👤 Admin & User Experience

  • System Statistics: CPU, memory, disk, and GPU usage now visible to all users
  • Admin Password Reset: Secure password reset functionality with validation
  • UUID Consistency: Fixed admin endpoints to use UUID instead of integer IDs

Upgrading to v0.2.0

# If using the production installer
cd opentranscribe
./opentranscribe.sh update

# Or pull the latest Docker images
docker compose pull
docker compose up -d

Database migrations run automatically on startup - no manual intervention required.

Resources


Full Changelog: v0.1.0...v0.2.0

Happy transcribing! 🎉
The OpenTranscribe Team

OpenTranscribe v0.1.0 - First Official Release

06 Nov 05:05

Choose a tag to compare

OpenTranscribe v0.1.0 - First Official Release

Release Date: November 5, 2025
License: GNU Affero General Public License v3.0 (AGPL-3.0)

Overview

We're thrilled to announce the first official release of OpenTranscribe! After 6 months of intensive development starting in May 2025, what began as a weekend experiment has evolved into a production-ready, fully-featured AI transcription platform.

OpenTranscribe is a powerful, self-hosted AI-powered transcription and media analysis platform that combines state-of-the-art AI models with a modern web interface to provide high-accuracy transcription, speaker identification, AI summarization, and advanced search capabilities.

Why AGPL-3.0?

We've chosen the GNU Affero General Public License v3.0 to:

  • Protect open source - Ensure the code remains open and accessible to everyone
  • Prevent proprietary forks - Require that modifications, especially network services, remain open
  • Ensure transparency - Network users have the right to access the source code
  • Build community - Foster collaboration and shared improvements

Key Highlights

🎧 Professional-Grade Transcription

  • 70x realtime speed on GPU with large-v2 model
  • Word-level timestamps using WAV2VEC2 alignment
  • 50+ languages supported with automatic translation
  • Universal format support - Audio and video files up to 4GB

👥 Advanced Speaker Intelligence

  • Automatic speaker diarization using PyAnnote.audio
  • Cross-video speaker recognition with voice fingerprinting
  • AI-powered speaker suggestions using LLM context analysis
  • Global speaker profiles that persist across all recordings
  • Speaker analytics with talk time, pace, and interaction patterns

🤖 AI-Powered Insights

  • LLM integration - Support for OpenAI, Claude, vLLM, Ollama, OpenRouter, and custom providers
  • BLUF format summaries - Bottom Line Up Front structured analysis
  • Custom AI prompts - Unlimited prompts with flexible JSON schemas
  • Intelligent sectioning - Handles transcripts of any length automatically
  • Local or cloud processing - Privacy-first local models or powerful cloud AI

🔍 Powerful Search & Discovery

  • Hybrid search - Keyword + semantic search with OpenSearch 3.3.1
  • 9.5x faster vector search - Significantly improved performance
  • 25% faster queries with 75% lower p90 latency
  • Advanced filtering - Search by speaker, tags, collections, date, duration
  • Interactive navigation - Click-to-seek on transcripts and waveforms

⚡ Enterprise Performance

  • Multi-GPU scaling - Optional parallel processing (4+ workers per GPU)
  • Specialized work queues - GPU, CPU, Download, NLP, and Utility queues
  • Non-blocking architecture - Parallel processing saves 45-75s per 3-hour file
  • Model caching - Efficient ~2.6GB cache with automatic persistence
  • Complete offline support - Full airgapped deployment capability

Installation

Quick Install (Recommended)

curl -fsSL https://raw.githubusercontent.com/davidamacey/OpenTranscribe/master/setup-opentranscribe.sh | bash
cd opentranscribe
./opentranscribe.sh start

Access at: http://localhost:5173

Docker Hub Images

Pre-built multi-platform images (AMD64, ARM64):

  • davidamacey/opentranscribe-backend:v0.1.0
  • davidamacey/opentranscribe-frontend:v0.1.0

From Source

git clone https://github.com/davidamacey/OpenTranscribe.git
cd OpenTranscribe
git checkout v0.1.0
cp .env.example .env
# Edit .env with your settings
./opentr.sh start dev

What's Included

Core Features

Transcription - WhisperX with faster-whisper backend
Speaker Diarization - PyAnnote.audio integration with auto-labeling and profile generation
Media File Upload - Direct upload of audio/video files up to 4GB with drag-and-drop
Video File Size Detection - Client-side audio extraction option for large video files
YouTube Support - Direct URL and playlist processing for batch transcription
Browser Microphone Recording - Built-in recording (localhost or HTTPS) with background operation
AI-Powered Summaries - Multi-provider LLM integration with customizable formats
AI Topic Generation - Automatic tag and collection suggestions from transcript content
Timestamp Comments - User annotations anchored to specific video moments
Search Engine - OpenSearch 3.3.1 with hybrid keyword and vector search
Collections - Organize media into themed groups with AI suggestions
Analytics - Speaker metrics and interaction analysis
Waveform Visualization - Interactive audio timeline
PWA Support - Installable progressive web app
Dark/Light Mode - Full theme support

Infrastructure

Docker Compose - Multi-environment orchestration
PostgreSQL - Relational database with JSONB
MinIO - S3-compatible object storage
Redis - Message broker and caching
Celery - Distributed task processing
NGINX - Production web server
Flower - Task monitoring dashboard

Security

Non-root containers - Principle of least privilege
RBAC - Role-based access control
Encrypted secrets - Secure API key storage
Security scanning - Trivy and Grype integration
Session management - JWT-based authentication

System Requirements

Minimum

  • CPU: 4 cores
  • RAM: 8GB
  • Storage: 50GB (including ~3GB for AI models)
  • GPU: Optional (CPU-only mode available)

Recommended

  • CPU: 8+ cores
  • RAM: 16GB+
  • Storage: 100GB+ SSD
  • GPU: NVIDIA GPU with 8GB+ VRAM (RTX 3070 or better)

Supported Platforms

  • OS: Linux, macOS (including Apple Silicon), Windows (via WSL2)
  • Architectures: AMD64, ARM64
  • GPUs: NVIDIA CUDA, Apple MPS (Metal)

Performance Benchmarks

Metric Performance
Transcription Speed (GPU) 70x realtime
Vector Search Improvement 9.5x faster
Query Performance 25% faster, 75% lower p90 latency
Multi-GPU Throughput 4 videos simultaneously (4 workers)
Model Cache Size ~2.6GB total

Documentation

📚 Complete Documentation: https://docs.opentranscribe.app

Key resources:

Roadmap to v1.0.0

We're committed to delivering a stable, production-ready v1.0.0 release. While we'll strive for backwards compatibility, we cannot guarantee it until v1.0.0. Breaking changes will be clearly announced.

Planned features for future releases:

  • Real-time transcription for live streaming
  • Enhanced speaker analytics and visualization
  • Better speaker diarization models
  • Google-style text search
  • LLM powered RAG Chat with transcript text
  • Other refinements along the way!

Known Issues

No critical issues at release time. See GitHub Issues for community-reported items.

Contributing

We welcome contributions from the community! See our Contributing Guide for details.

Ways to contribute:

  • 🐛 Report bugs and issues
  • 💡 Suggest new features
  • 🔧 Submit pull requests
  • 📚 Improve documentation
  • 🌍 Translate the interface
  • ⭐ Star the repository

Support & Community

Acknowledgments

OpenTranscribe builds upon amazing open-source projects:

  • OpenAI Whisper - Foundation speech recognition model
  • WhisperX - Enhanced alignment and diarization
  • PyAnnote.audio - Speaker diarization toolkit
  • FastAPI - Modern Python web framework
  • Svelte - Reactive frontend framework
  • PostgreSQL - Reliable database system
  • OpenSearch - Search and analytics engine
  • Docker - Containerization platform

Special thanks to the AI community and all contributors who helped make this release possible!

License

OpenTranscribe is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

See LICENSE for full details.


Built with ❤️ by the OpenTranscribe community

OpenTranscribe demonstrates the power of AI-assisted development while maintaining full local control over your data and processing.

Download: v0.1.0 Release
Docker: Backend | Frontend
Docs: docs.opentranscribe.app