Convert markdown H2 sections to individual audio files using multiple TTS (Text-to-Speech) providers including macOS say, Linux espeak-ng, and ElevenLabs API.
- Cross-Platform TTS Providers: macOS
say, Linuxespeak-ng, and ElevenLabs API - Automatic Platform Detection: Uses the best provider for your OS automatically
- Process files or directories recursively with structure mirroring
- Target duration control: Adjust timing with annotations like
(8s) - Multiple formats: AIFF, M4A, and MP3 output
- Voice caching: Fast lookups with SQLite WAL mode
- Developer-friendly: Debug mode, dry-run preview, progress indicators
- macOS (uses built-in
saycommand) - Go 1.25 or later (to build the tool)
-
Linux (Ubuntu, Debian, Fedora, Arch, etc.)
-
Go 1.25 or later (to build the tool)
-
espeak-ngorespeakinstalled:# Ubuntu/Debian sudo apt install espeak-ng ffmpeg # Fedora/RHEL sudo dnf install espeak-ng ffmpeg # Arch Linux sudo pacman -S espeak-ng ffmpeg
-
ffmpegfor audio format conversion (MP3, M4A support)
- Any OS (Windows, macOS, Linux)
- Go 1.25 or later (to build the tool)
- ElevenLabs API key (Get one here)
- Set
ELEVENLABS_API_KEYenvironment variable or create.envfile
go install github.com/indaco/md2audio/cmd/md2audio@latestgit clone https://github.com/indaco/md2audio.git
cd md2audio
go build -o md2audio ./cmd/md2audioThe binary will be created in the current directory. You can move it to a location in your PATH:
sudo mv md2audio /usr/local/bin/md2audio supports multiple Text-to-Speech providers. The best provider for your platform is selected automatically:
- Platform: macOS only
- Cost: Free (built-in)
- Setup: No configuration needed
- Quality: Good for local development and testing
- Formats: AIFF, M4A
- Voices: ~70 voices in various languages
- Platform: Linux only
- Cost: Free (open-source)
- Setup: Install
espeak-ngandffmpeg - Quality: Good for local development and testing
- Formats: WAV, MP3, M4A, AIFF (via ffmpeg)
- Voices: 50+ voices in various languages
- Voice Mapping: Automatically maps macOS voice names (e.g., "Kate" β en-gb)
- Platform: Cross-platform (works on any OS)
- Cost: Paid API (Pricing)
- Setup: Requires API key
- Quality: Premium, highly realistic voices
- Formats: MP3
- Voices: Multiple professional voices with emotional control
-
Get your API key from ElevenLabs
-
Set the environment variable:
export ELEVENLABS_API_KEY='your-api-key'
-
Or create a
.envfile in your project directory:# Copy the example file cp .env.example .env # Then edit .env and add your API key
Or create it directly:
echo 'ELEVENLABS_API_KEY=your-api-key' > .env
-
(Optional) Configure voice settings in
.env:# Voice quality settings (all optional, with sensible defaults) ELEVENLABS_STABILITY=0.5 # Voice consistency (0.0-1.0, default: 0.5) ELEVENLABS_SIMILARITY_BOOST=0.5 # Voice similarity (0.0-1.0, default: 0.5) ELEVENLABS_STYLE=0.0 # Voice style/emotion (0.0-1.0, default: 0.0) ELEVENLABS_USE_SPEAKER_BOOST=true # Boost similarity (true/false, default: true) ELEVENLABS_SPEED=1.0 # Default speed for non-timed sections (0.7-1.2, default: 1.0)
Note:
ELEVENLABS_SPEEDonly applies to sections WITHOUT timing annotations- Sections with
(5s)timing will calculate speed automatically - Higher stability = more consistent but less expressive
- Higher similarity_boost = closer to original voice characteristics
- Style adds emotional range (0 = disabled, higher = more expressive)
-
List available voices:
./md2audio -provider elevenlabs -list-voices
# Check version
./md2audio -version
# List available voices (automatically uses the best provider for your OS)
./md2audio -list-voices
# Process a single markdown file with voice preset
# Works on both macOS (say) and Linux (espeak) automatically!
./md2audio -f script.md -p british-female
# Process entire directory recursively
./md2audio -d ./docs -p british-female
# Use specific voice with slower rate for clarity
# On macOS: uses "Kate" voice directly
# On Linux: maps "Kate" to "en-gb" voice automatically
./md2audio -f script.md -v Kate -r 170
# Generate M4A files instead of default format
# macOS default: AIFF, Linux default: WAV
./md2audio -d ./content -p british-female -format m4a
# Custom output directory and prefix
./md2audio -f script.md -o ./voiceovers -prefix demo
# Preview what would be generated (dry-run mode)
./md2audio -f script.md -p british-female -dry-run
# Enable debug logging to troubleshoot issues
./md2audio -f script.md -p british-female -debug
# Combine dry-run with debug for detailed preview
./md2audio -d ./docs -p british-female -dry-run -debug
# Explicitly use espeak provider (on any Linux system)
./md2audio -f script.md -provider espeak -v en-gb
# Explicitly use say provider (on macOS)
./md2audio -f script.md -provider say -v Kate# List available ElevenLabs voices (cached for faster access)
./md2audio -provider elevenlabs -list-voices
# Refresh voice cache (when new voices are available)
./md2audio -provider elevenlabs -list-voices -refresh-cache
# Export voices to JSON for reference
./md2audio -provider elevenlabs -export-voices elevenlabs_voices.json
# Process a single file with ElevenLabs
./md2audio -provider elevenlabs \
-elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \
-f script.md
# Process entire directory with ElevenLabs
./md2audio -provider elevenlabs \
-elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \
-d ./docs \
-o ./audio_output
# Use specific ElevenLabs model
./md2audio -provider elevenlabs \
-elevenlabs-voice-id YOUR_VOICE_ID \
-elevenlabs-model eleven_multilingual_v2 \
-f script.mdEnable debug logging to troubleshoot issues or understand what's happening under the hood:
# Enable debug logging
./md2audio -f script.md -p british-female -debugDebug mode shows:
- Cache hits/misses for voice lookups
- API request details (ElevenLabs)
- File processing progress
- Internal operation details
When to use debug mode:
- Troubleshooting API issues with ElevenLabs
- Understanding cache behavior
- Investigating performance problems
- Reporting bugs with detailed logs
Preview what would be generated without creating any audio files:
# Dry-run mode - shows what would be generated
./md2audio -f script.md -p british-female -dry-run
# Combine with debug for maximum visibility
./md2audio -d ./docs -provider elevenlabs -elevenlabs-voice-id YOUR_ID -dry-run -debugDry-run mode shows:
- Which sections would be processed
- Output file paths that would be created
- Timing information for timed sections
- Preview of text content
When to use dry-run mode:
- Testing markdown format before generation
- Verifying output paths and filenames
- Checking section count and structure
- Planning batch processing jobs
Example output:
π‘ DRY-RUN MODE: No files will be created
βΉ Section 1/3:
- title: Introduction
π‘ Target duration: 8.0 seconds
π‘ Text: Welcome to this demonstration...
Would create: ./audio_sections/section_01_introduction.aiff
βΉ Section 2/3:
- title: Main Content
π‘ Text: Here is the main content...
Would create: ./audio_sections/section_02_main_content.aiff
β Would generate 3 audio files
To improve performance, md2audio caches voice lists from providers. This is especially useful for ElevenLabs to avoid repeated API calls:
# First call - fetches from API and caches (slower)
./md2audio -provider elevenlabs -list-voices
# Subsequent calls - uses cache (instant)
./md2audio -provider elevenlabs -list-voices
# Force refresh when new voices are available
./md2audio -provider elevenlabs -list-voices -refresh-cache
# Export cached voices to JSON file for reference
./md2audio -provider elevenlabs -export-voices elevenlabs_voices.json
./md2audio -provider say -export-voices say_voices.jsonCache Details:
- Location:
~/.md2audio/voice_cache.db(SQLite database) - Duration: 30 days (voices don't change frequently)
- Benefits: Instant voice listing, reduced API calls, offline access to voice list
- Refresh: Use
-refresh-cacheflag when you know new voices are available
| Flag | Description | Default |
|---|---|---|
-f |
Input markdown file (use -f or -d) |
- |
-d |
Input directory (recursive, use -f or -d) |
- |
-o |
Output directory | ./audio_sections |
-format |
Output format | aiff |
-prefix |
Filename prefix | section |
-list-voices |
List all available voices (uses cache if available) | - |
-refresh-cache |
Force refresh of voice cache | false |
-export-voices |
Export cached voices to JSON file | - |
-provider |
TTS provider (say, espeak, or elevenlabs) |
Auto-detect by platform |
-version |
Print version and exit | - |
-debug |
Enable debug logging | false |
-dry-run |
Show what would be generated without creating files | false |
These options work for both say (macOS) and espeak (Linux) providers:
| Flag | Description | Default |
|---|---|---|
-p |
Voice preset (see Voice Presets below) | Kate (if not set) |
-v |
Specific voice name (overrides -p) |
- |
-r |
Speaking rate (lower = slower) | 180 |
Note: Voice names are automatically mapped between platforms. For example, "Kate" uses the Kate voice on macOS and en-gb on Linux.
| Flag | Description | Default |
|---|---|---|
-elevenlabs-voice-id |
ElevenLabs voice ID (required) | - |
-elevenlabs-model |
ElevenLabs model ID | eleven_multilingual_v2 |
-elevenlabs-api-key |
ElevenLabs API key (prefer env var) | ELEVENLABS_API_KEY env |
These presets work on both macOS and Linux (automatically mapped):
| Preset | macOS Voice | Linux Voice |
|---|---|---|
british-female |
Kate | en-gb |
british-male |
Daniel | en-gb |
us-female |
Samantha | en-us |
us-male |
Alex | en-us |
australian-female |
Karen | en-au |
indian-female |
Veena | en-in |
Cross-Platform Usage:
# Same command works on both macOS and Linux!
./md2audio -f script.md -p british-female
# Or use specific voices (automatically mapped)
./md2audio -f script.md -v Kate # macOS: Kate, Linux: en-gbElevenLabs voice quality can be fine-tuned using environment variables. All settings are optional and have sensible defaults:
| Setting | Range | Default | Description |
|---|---|---|---|
ELEVENLABS_STABILITY |
0.0-1.0 | 0.5 | Voice consistency. Higher = more consistent but less expressive |
ELEVENLABS_SIMILARITY_BOOST |
0.0-1.0 | 0.5 | Voice similarity to original. Higher = closer to voice characteristics |
ELEVENLABS_STYLE |
0.0-1.0 | 0.0 | Emotional range. 0 = disabled, higher = more expressive |
ELEVENLABS_USE_SPEAKER_BOOST |
true/false | true | Boost similarity of synthesized speech |
ELEVENLABS_SPEED |
0.7-1.2 | 1.0 | Default speaking speed (only for sections without timing annotations) |
Speed Behavior:
- Sections with timing annotations like
## Scene 1 (5s)β Speed is calculated automatically to fit duration - Sections without timing annotations β Uses
ELEVENLABS_SPEEDsetting (default: 1.0)
Example .env configuration:
ELEVENLABS_API_KEY=your-api-key
ELEVENLABS_STABILITY=0.7 # More consistent voice
ELEVENLABS_SIMILARITY_BOOST=0.8 # Closer to original voice
ELEVENLABS_STYLE=0.3 # Slight emotional variation
ELEVENLABS_SPEED=1.1 # 10% faster for non-timed sectionsThe script expects H2 headers (##) to denote sections. You can optionally specify target duration for each section:
## Scene 1: Introduction (8s)
This is the content for scene 1. It will be converted to audio that lasts exactly 8 seconds.
## Scene 2: Main Demo (12s)
This is the content for scene 2. The speaking rate will be automatically adjusted to fit 12 seconds.
## Scene 3: Conclusion
This section has no timing specified, so it will use the default speaking rate (-r flag).(8s)- Target duration of 8 seconds(10.5s)- Target duration of 10.5 seconds(0-8s)- Range format, uses end time (8 seconds)(15 seconds)- Also works with "seconds" spelled out
How it works (macOS say provider only):
- The script counts the words in your text
- Calculates the required words-per-minute (WPM) to fit the target duration
- Automatically adjusts the speaking rate for that section
- Shows you the actual duration vs target after generation
Important Notes:
-
Timing is supported with both providers, but with different accuracy:
-
macOS say provider: Uses
-r(rate) parameter for speed control- Very wide range of speaking rates (90-360 wpm)
- Actual duration may differ from target (typically within 1-3 seconds)
- Applies 0.95 adjustment factor for better accuracy
-
ElevenLabs provider: Uses
speedparameter (NEW!)- Limited range: 0.7x (slower) to 1.2x (faster) of natural pace
- More accurate natural-sounding speech
- If target duration requires speed outside this range, audio will be clamped
- Example: 5s target β 5.75s actual (within 15% for typical content)
-
-
Timing accuracy tip: Test with your content and adjust timing annotations as needed. For very tight timing requirements, consider the say provider's wider speed range.
Process entire directory trees recursively with the -d flag:
./md2audio -d ./docs -p british-female -o ./audio_outputInput structure:
docs/
βββ intro.md
βββ chapter1/
β βββ part1.md
β βββ part2.md
βββ chapter2/
βββ overview.md
Output structure (mirrors input):
audio_output/
βββ intro/
β βββ section_01_welcome.aiff
β βββ section_02_overview.aiff
βββ chapter1/
β βββ part1/
β β βββ section_01_title.aiff
β β βββ section_02_title.aiff
β βββ part2/
β βββ section_01_title.aiff
βββ chapter2/
βββ overview/
βββ section_01_title.aiff
Key features:
- Processes all
.mdfiles recursively - Creates mirror directory structure
- Each markdown file gets its own subdirectory
- Preserves folder hierarchy from input
- Continues processing even if individual files fail
Example with examples folder:
# Process the included examples
./md2audio -d ./examples -p british-female -format m4a
# Results in organized audio files matching the examples structureFiles are named using the pattern:
{prefix}_{number}_{sanitized_title}.{format}
Example outputs:
section_01_scene_1_introduction.aiffsection_02_scene_2_main_demo.aiff
- Generate separate files per section (this is automatic)
- Add timing to your markdown headers to match your screen recording
- Import all audio files into your video editing software
- Place each audio clip on the timeline where needed
- The audio will match your specified durations automatically
- Be realistic: Very short durations with lots of text will sound rushed
- Test first: Generate one section to verify the pacing feels natural
- Adjust if needed: If timing is off, adjust the duration in your markdown and regenerate
- Word count matters: ~2-3 words per second is natural speech
- Override if needed: The
-rflag still works for sections without timing
Voice not found:
- Run
./md2audio -list-voicesto see available voices - Use the exact voice name with
-vflag
No sections found:
- Ensure your markdown uses
##for headers (H2) - Check there's content after each header
Audio quality:
- AIFF format is higher quality but larger
- M4A format is compressed and smaller
- Adjust rate with
-rflag for clarity
# 1. Check your markdown format
cat examples/demo_script.md
# 2. List available voices
./md2audio -list-voices
# 3. Generate audio files
./md2audio -f examples/demo_script.md -p british-female -r 175 -format m4a
# 4. Import the files from ./audio_sections into your video editor- The script automatically cleans markdown formatting (links, bold, italic)
- Empty sections are skipped
- Section titles are sanitized for safe filenames
- Speaking rate default is 180 (macOS default is 200)
Interested in contributing or understanding the codebase?
See the Contributing Guide for detailed information about:
- Project architecture and package organization
- Development tools and workflow
- Code quality standards
- Setting up your development environment
Contributions are welcome!
See the Contributing Guide for setup instructions.
This project is licensed under the MIT License β see the LICENSE file for details.