OpenVoice Audio Summarizer is an open-source platform designed to transform meeting recordings and audio files into concise summaries and actionable insights. The system helps improve meeting productivity by organizing key discussion points and reducing the time required to review recordings. It primarily operates using open-source end-to-end models, eliminating the need for external API services. As a result, the entire process runs locally on the user’s machine, ensuring zero API cost and better data privacy.
The OpenVoice Audio Summarizer is built using pre-trained models such as Falcon3 with 1 billion parameters and the Whisper Medium speech recognition model. Whisper is used to convert audio into text, while the Falcon model generates summaries and insights from the transcribed content.
To improve computational efficiency, the system runs on an NVIDIA GeForce RTX 2070 GPU. Additionally, 4-bit quantization is applied to the Falcon model to optimize memory usage and performance while maintaining model accuracy.
- Audio File Processing
- Audio File Preprocssing
- Automatic Transcription
- Automatic Speech Recognition (ASR)
- Intent and Topic Detection
- Context-Aware Summarization
- Real Time Internal Logs
- Local Model Execution
The system is follows a layered, interface-driven architecture, designed to ensure modularity, maintainability, and extensibility. Fundamental design principles include dependency injection, interface-based abstraction, separation of concerns, and a plugin-oriented tool architecture.
Key Architectural Highlights:
- Dependencies are injected via a centralized container, promoting loose coupling and easy testing.
- Core components define interfaces to enforcing contracts and enabling polymorphic behavior.
- Singleton patterns are used for global utilities like logging.
- The architecture defines clear layers: UI/API, Components (business logic), Infrastructure (integrations), Interfaces (contracts), and Container (dependency management).
┌─────────────────────────────────────┐
│ UI Layer (Streamlit) │
│ Backend API (FastAPI) │
├─────────────────────────────────────┤
│ Components (Business Logic) │
│ ├─ Chat Generation Service │
│ ├─ Chat Connection Service │
│ ├─ Transcript Generation Service │
│ └── Voice Connection Service │
├─────────────────────────────────────┤
│ Infrastructure (Integrations) │
│ ├─ FalconAI Service │
| ├─ WhisperAI Service │
| ├─ Audio Service │
│ ├─ Chat Prompt │
│ ├─ Environment & Configuration │
│ └─ AI Models │
├─────────────────────────────────────┤
│ Interfaces (Contracts) │
│ ├─ Chat Interfaces │
│ ├─ Audio Interfaces │
│ ├─ Infrastructure Interfaces │
│ ├─ Logging Interfaces │
│ └─ LLM Interfaces │
├─────────────────────────────────────┤
│ Container (Dependency Injection) │
│ Dependency Wiring & Factory │
├─────────────────────────────────────┤
│ Core (Utilities & Patterns) │
│ Singleton Meta-class │
└─────────────────────────────────────┘
- UI/API Layer: Facilitates user interactions and exposes RESTful endpoints.
- Components Layer: Orchestrates business logic and AI-driven chat operations.
- Infrastructure Layer: Manages external integrations, Audio services, and system utilities.
- Interfaces Layer: Defines abstract contracts to ensure modularity and testability.
- Container LayerL: Oversees dependency injection and management of singleton instances.
open-voice-audio-summarizer/
│
├── audio/ # Audio files for testing
│ ├── meeting_audio.wav
│ └── team_meeting.mp3
│
├── backend/ # FastAPI API Layer
│ ├── dependencies/ # Dependency providers for routes
│ │ └── dependencies.py
│ │
│ ├── routes/ # API endpoints
│ │ └── chat.py
│ │
│ └── main.py # FastAPI app initialization with lifespan
│
├── src/ # Application Core Layer
│ ├── dto/ # Internal DTOs (service layer contracts)
│ │ └── audio_data.py # Audio data representation
│ │
│ │
│ ├── components/ # Business logic orchestration
│ │ ├── chat_generation.py # summary generation
│ │ ├── chat_connection.py # falcon3 client management
│ │ ├── voice_transcription.py # audio to raw text generation
│ │ └── voice_connection.py # whisper client management
│ │
│ │
│ ├── container/ # Dependency Injection
│ │ └── openvoice_container.py # Factory for FalconAI and WhisperAI
│ │
│ │
│ ├── core/ # Core utilities
│ │ └── singleton_meta.py # Singleton pattern implementation
│ │
│ │
│ ├── infrastructure/ # External integrations
│ │ ├── falconai_service.py # Falcon3 client wrapper
│ │ ├── falconai_client.py # Low-level FalconAI integration
│ │ ├── whisperai_service.py # Whisper client wrapper
│ │ ├── whisperai_client.py # Low-level WhisperAI integration
│ │ ├── falcon_prompt.py # Prompt management
│ │ ├── config_provider.py # Configuration loader
│ │ └── audio_processor.py # Audio preprocessing services
│ │
│ ├── interfaces/ # Abstract contracts
│ │ ├── audio/ # Audio interfaces
│ │ │ └── audio_service_interface.py
│ │ │
│ │ ├── chat/ # Chat interfaces
│ │ │ └── chat_service_interface.py
│ │ │
│ │ ├── infra/ # Infrastructure interfaces
│ │ │ └── config_provider_interface.py
│ │ │
│ │ ├── logging/ # Logging interfaces
│ │ │ └── logger_interface.py
│ │ │
│ │ └── llm/ # LLM interfaces
│ │ ├── falcon_interface.py
│ │ └── whisper_interface.py
│ │
│ │
│ └── logs/
│ ├── logger_singleton.py # Singleton logger
│ └── logger_streamlit.py # Streamlit logger
│
│
├── ui/ # Frontend Layer
│ └── app.py # Streamlit openvoice interface
│
│
├── config.yaml # Application configuration
├── setup.py # Package setup
├── requirements.txt # Dependencies
└── README.md # Documentation
User Upload Audio Input (Streamlit UI)
│
▼
FastAPI Backend (async)
│
├─► (/chat/user_upload)
│
│
├─► (/chat/streamlit_logs)
│
│
▼
OpenVoiceContainer ──────► VoiceConnectionService & ChatConnectionService
│
├─► VoiceTranscriptionService
│ (Create and return an audio transcription)
│
├─► ChatGenerationService
│ (Generation and retrun the audio summary)
│
▼
AI Model & Audio Process
│
├─► Audio Processor
│ (Provide preprocess audio waveform)
│ └─ Stream logs to UI
│
├─► Whisper Audio Transcription
│ (Transcribe audio data to text)
│ └─ Stream logs to UI
│
├─► Prompt Provider
│ (Provide system prompts and user prompt with transcribed data)
│ └─ Stream logs to UI
│
├─► Falcon3 Text to Summary
│ (Generate a response for given prompt)
│ └─ Stream logs to UI
│
└─ If response generated:
│ │
│ └─► Return to UI
│ ├─ Display final result to user
│─────────►├─ Streaming logs continue
- A system with GPU and CUDA support
- Minimum GPU memory: 6 GB
- Conda installed (for environment management)
- Python 3.10 recommended
- Create a Conda environment:
conda create -n llm-openvoice python=3.10- Activate the environment:
conda activate llm-openvoice- Install PyTorch, TorchAudio, and CUDA support:
conda install pytorch torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia- Install dependencies
pip install -r requirement.txt- Download required models:
Falcon3-1B-Instruct→ save to a local folder, e.g., G:/LLMs/Falcon3-1B-Instruct- Download
tiiuae/Falcon3-1B-Instruct
- Download
Whisper-Medium→ save to a local folder, e.g., G:/LLMs/whisper_medium_en- Download
openai/whisper-medium.en
- Download
- After downloading the models, update config.yaml with the paths where you saved the models
models:
Falcon3-1B-Instruct:
location: "YOUR_LOCAL_PATH/Falcon3-1B-Instruct"
Whisper-Medium:
location: "YOUR_LOCAL_PATH/whisper_medium_en"We welcome contributions related to:
- Additional language support
- AI & Prompt Engineering
- Architecture Improvements
- Backend Enhancements
- UI Improvements
- Testing & Quality Assurance
- 🍴 Fork the repository
- 🌿 Create a
feature/*branch - 🛠️ Commit changes with clear messages
- 📤 Open a Pull Request
The project follows a Git Flow–inspired workflow:
- 🌿
master— Stable, production-ready releases - 🌱
develop— Active development branch - ✨
feature/*— New feature branches
- Pull latest changes from
develop - Create a
feature/*branch - Implement and test changes
- Open PR → Merge into
develop - Release from
develop→ Merge intomaster
This ensures stability while enabling safe feature development.