This project was born out of my need to be able to convert my class notes into digital files that I could store in my cloud. I love writing notes in pen and paper, but after a semester ends, I tend to have lots of notebooks and papers worth of notes that I want to convert into text files to store in my cloud. Voxformat helps to convert my notes into formatted text (like bold, italics; will add other formatting features soon!) while I am reading them out while revising my notes for the exams.
VoxFormat is a command-line interface (CLI) application designed to streamline the process of digitizing handwritten notes or drafting documents through voice dictation. It allows users to speak their text, which is transcribed in real-time, and simultaneously apply formatting like bold and italics using simple voice commands. This tool aims to be a faster alternative to manual typing or potentially error-prone handwriting OCR, providing a direct path from spoken words to a structured digital document.
The application continuously listens to voice input, transcribes the speech, and parses specific voice cues (e.g., "format start bold," "format stop italic") to apply basic Markdown formatting to the output. The resulting document can then be automatically saved as a plain text file.
- Real-time Transcription: Converts spoken audio to text on the fly.
- Voice-Controlled Formatting: Apply bold and italics using voice commands during dictation.
- Continuous Operation: Listens indefinitely until explicitly stopped by a voice command or a period of silence.
- Automatic Saving: Saves the transcribed (raw) text to an
output.txtfile in anoutputs/directory upon application exit. - Live Terminal Preview: Displays the Markdown-formatted document in the terminal as it's being created.
- Cross-Platform (Core): Built with C++ and standard libraries, with platform-specific audio handling via PortAudio.
- Core Language: C++ (utilizing C++20 features)
- Speech-to-Text Engine: Whisper.cpp (using the
ggml-small.en.binmodel for a balance of accuracy and performance) - Audio Capture: PortAudio (cross-platform audio I/O library)
- Audio Resampling: libsamplerate (Secret Rabbit Code)
- Build System: CMake
- Concurrency:
std::thread,std::mutex,std::condition_variable,std::atomicfor multi-threaded audio capture and processing. - File I/O:
std::fstreamandstd::filesystem(C++17) for saving documents and managing paths. - String Processing: Standard C++ string libraries,
std::regexfor artifact cleaning.
-
Clone the Repository: It is crucial to clone with
--recurse-submodulesto fetch the PortAudio dependency. If you have already cloned without it, navigate to the project directory and rungit submodule update --init --recursive.git clone --recurse-submodules https://github.com/spriha27/voxformat.git cd voxformat -
Run Setup Script: This script will download the necessary Whisper model (
ggml-small.en.bin) and attempt to build external dependencies. Ensure you havebashand standard build tools (make, C++ compiler) installed.bash setup.sh
The script will place the model in
external/whisper.cpp/models/. -
Build VoxFormat:
- Using CMake directly:
mkdir build cd build cmake .. make -j$(nproc || sysctl -n hw.ncpu) # Adjust -j for your number of cores
- Using CLion:
- Open the
voxformatproject directory in CLion. - CLion should automatically detect
CMakeLists.txt. - Important for macOS (Metal GPU Acceleration):
- Ensure the
ggml-metal.metalfile (fromexternal/whisper.cpp/) is copied into your CMake build directory (e.g.,cmake-build-debug/orbuild/). Thesetup.shscript attempts to do this. - In CLion, go to
Run > Edit Configurations.... Select thevoxformattarget. Set the "Working directory" to your CMake build directory (e.g.,/path/to/voxformat/cmake-build-debugor/path/to/voxformat/build). This helpslibwhisperfind the Metal shaders at runtime.
- Ensure the
- Build the project using CLion's build button (hammer icon).
- Open the
- Using CMake directly:
-
Run:
- From CMake build directory (after
make):./voxformat
- From CLion: Run the
voxformatconfiguration.
- From CMake build directory (after
- Start the application. You will see a message "--- Listening... ---".
- Begin speaking your text.
- To apply formatting, use the following voice commands clearly:
format start bold- Subsequent text will be bold.format stop bold- Stops bold formatting.format start italics- Subsequent text will be italicized.format stop italics- Stops italic formatting.
- The application will display a live Markdown preview in the terminal.
- To stop the application and save the document:
- Say:
format stop application - Alternatively, the application will automatically stop and save after 60 seconds of continuous silence (no detected speech).
- Say:
- Upon exit, a plain text file named
output.txtwill be saved in anoutputs/directory within your project's root.
- Phase 0: Project Setup & Foundation (Initial Setup)
- Initialized Git repository and basic C++ project structure with CMake.
- Defined core data structures (
TextSegment).
- Phase 1: Audio Input & STT Integration (Core Transcription)
- Integrated PortAudio for microphone audio capture.
- Integrated Whisper.cpp library.
- Successfully transcribed raw audio to text using
ggml-small.en.binmodel. - Implemented basic audio resampling using libsamplerate.
- Transitioned to a multi-threaded architecture:
- Dedicated thread for audio capture (
AudioCapturer). - Dedicated worker thread for audio processing and STT (
WhisperProcessor). - Used mutexes and condition variables for thread-safe data exchange.
- Dedicated thread for audio capture (
- Enabled Metal GPU acceleration on macOS for faster Whisper processing.
- Phase 2: Real-time Output & Basic Command Parsing (User Interaction)
- Implemented live preview of transcribed text in the terminal.
- Developed initial
DocumentFormatterclass. - Created a basic command parser within
DocumentFormatterto recognize "format start/stop bold/italic" and "format stop application". - Managed formatting state (bold, italic active).
- Implemented
print_current_document_previewto display Markdown. - Refined string utility functions (
to_lower,trim, artifact cleaning).
- Phase 3: Application Lifecycle & File Output (Usability)
- Implemented continuous listening mode.
- Added a silence detector: application automatically stops and saves after 60 seconds of no detected speech activity.
- Ensured "format stop application" voice command correctly terminates the application.
- Implemented automatic saving of the transcribed document (raw text) to
outputs/output.txtupon application exit (either by command or silence timeout).
- Phase 4: Cleanup & Documentation (Current)
- Refined code structure and comments.
- Created
README.mdandsetup.sh.
- Initial STT Inaccuracy & Duplication:
- Early versions using simple, fixed, non-overlapping audio chunks fed to Whisper resulted in poor transcription quality and significant repetition of text.
- Solution: Transitioned to a multi-threaded model where audio capture is decoupled from processing. Implemented a streaming-like approach for
WhisperProcessorusingwhisper_stateandnew_segment_callbackfrom Whisper.cpp, which significantly improved transcription continuity and reduced duplication by leveraging Whisper's internal context management. (Based on recent logs, it seems we reverted to a simplerwhisper_fullon independent chunks in the worker thread, which also gave good results once GPU was active and chunking was managed well).
- Metal GPU Acceleration Not Engaging:
- Initially,
whisper_fullcalls were very slow, indicating CPU-only processing despiteuse_gpu = true. - Solution: Ensured
ggml-metal.metalwas in the executable's runtime working directory (e.g.,cmake-build-debug/) and CLion's "Working directory" setting was correct. This allowedlibwhisperto find and compile the Metal shaders.
- Initially,
- CMake Configuration and Linking:
- Several iterations were needed to correctly configure CMake to build and link PortAudio, libsamplerate, and Whisper.cpp, especially when managing them as submodules or included sources.
- Solution: Careful management of
add_subdirectory,FetchContent, andtarget_link_libraries. Using explicit binary directories for subdirectories helped avoid conflicts.
- Thread Synchronization and Shutdown:
- Ensuring clean shutdown of threads (audio capture, whisper processing) and PortAudio required careful use of atomic flags, condition variables, and thread joining.
- Solution: Implemented a clear stop signal (
g_main_stop_threads) checked by all threads, and robust shutdown sequences inmain().
- Command Word Recognition:
- The initial command keyword "vox" was often misheard by the
base.enmodel. - Solution: Switched to the more common English word "format," which improved STT reliability for commands. Further accuracy improvements came with the
small.enmodel.
- The initial command keyword "vox" was often misheard by the
- GUI Implementation: Develop a simple desktop GUI (e.g., using Qt or ImGui) for a more user-friendly experience than the CLI.
- Advanced De-duplication/Segment Merging: Implement timestamp-based segment merging for even smoother and more accurate concatenation of transcribed text from overlapping windows.
- More Formatting Options: Add support for underline, headings, lists, etc., via new voice commands.
- Customizable Output Filename: Allow users to specify the output filename via a voice command (e.g., "format save my_document"). (The parser had this briefly, could be re-enabled and made more robust).
- Real-time Editing: Allow users to pause dictation and manually edit the transcribed text in a GUI.
- Model Selection: Allow users to choose different Whisper models (tiny, small, medium) based on their hardware and accuracy needs.
- Improved Error Handling & User Feedback: More comprehensive error messages and status updates.
- Configuration File: For settings like silence timeout, default model, etc.
- Support for More Languages: Whisper.cpp supports many languages; this could be exposed.
