diff --git a/README.md b/README.md index 4dcacf6..c8c084f 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,5 @@ -

- md2audio -

+# md2audio +

CI @@ -25,603 +24,281 @@

-Convert markdown H2 sections to individual audio files using multiple TTS (Text-to-Speech) providers including macOS `say`, Linux `espeak-ng`, and ElevenLabs API. - -## Features - -- **Cross-Platform TTS Providers**: macOS `say`, Linux `espeak-ng`, and ElevenLabs API -- **Automatic Platform Detection**: Uses the best provider for your OS automatically -- **Process files or directories** recursively with structure mirroring -- **Target duration control**: Adjust timing with annotations like `(8s)` -- **Multiple formats**: AIFF, M4A, and MP3 output -- **Voice caching**: Fast lookups with SQLite WAL mode -- **Developer-friendly**: Debug mode, dry-run preview, progress indicators - -## Prerequisites - -### For macOS say Provider (Default on macOS) - -- macOS (uses built-in `say` command) -- Go 1.25 or later (to build the tool) - -### For Linux espeak Provider (Default on Linux) - -- Linux (Ubuntu, Debian, Fedora, Arch, etc.) -- Go 1.25 or later (to build the tool) -- `espeak-ng` or `espeak` installed: - - ```bash - # Ubuntu/Debian - sudo apt install espeak-ng ffmpeg +

Convert Markdown H2 sections to individual audio files using multiple TTS providers.

- # Fedora/RHEL - sudo dnf install espeak-ng ffmpeg +> [!WARNING] +> This project is under active development. You may encounter bugs or incomplete features. Please report any issues on the [GitHub issue tracker](https://github.com/indaco/md2audio/issues). - # Arch Linux - sudo pacman -S espeak-ng ffmpeg - ``` - -- `ffmpeg` for audio format conversion (MP3, M4A support) - -### For ElevenLabs Provider (Works on all platforms) +## Features -- Any OS (Windows, macOS, Linux) -- Go 1.25 or later (to build the tool) -- ElevenLabs API key ([Get one here](https://elevenlabs.io/)) -- Set `ELEVENLABS_API_KEY` environment variable or create `.env` file +- **Multiple TTS Providers**: Choose from macOS say, Linux espeak, Google Cloud TTS, or ElevenLabs +- **Cross-Platform**: Works on macOS, Linux, and Windows (with cloud providers) +- **Automatic Platform Detection**: Uses the best provider for your OS by default +- **Timing Control**: Specify target durations with annotations like `(8s)` +- **Batch Processing**: Process files or entire directories recursively +- **Voice Caching**: Fast voice lookups with SQLite-based caching +- **Multiple Formats**: AIFF, M4A, MP3, WAV, OGG output formats +- **Developer Tools**: Debug mode, dry-run preview, progress indicators ## Installation -### Using go install +### 1. Global Install (via go install) ```bash go install github.com/indaco/md2audio/cmd/md2audio@latest ``` -### Building from source +### 2. Prebuilt binaries + +Download the pre-compiled binaries from the [releases page](https://github.com/md2audio/tempo/releases) and move the binary to a folder in your system's PATH. + +### 3. Build from Source ```bash git clone https://github.com/indaco/md2audio.git cd md2audio -go build -o md2audio ./cmd/md2audio +go build -o md2audio ./cmd/md2audio # move the binary to a folder in your system's PATH ``` -The binary will be created in the current directory. You can move it to a location in your PATH: +or with [just](https://just.systems/man/en/) ```bash -sudo mv md2audio /usr/local/bin/ +just install ``` -## TTS Providers +## Basic Usage -md2audio supports multiple Text-to-Speech providers. The best provider for your platform is selected automatically: +```bash +# Process a markdown file (uses default provider for your OS) +./md2audio -f script.md -p british-female -### macOS say (Default on macOS) +# Process entire directory +./md2audio -d ./docs -p british-female -o ./audio -- **Platform**: macOS only -- **Cost**: Free (built-in) -- **Setup**: No configuration needed -- **Quality**: Good for local development and testing -- **Formats**: AIFF, M4A -- **Voices**: ~70 voices in various languages +# List available voices +./md2audio -list-voices +``` -### Linux espeak-ng (Default on Linux) +## TTS Providers -- **Platform**: Linux only -- **Cost**: Free (open-source) -- **Setup**: Install `espeak-ng` and `ffmpeg` -- **Quality**: Good for local development and testing -- **Formats**: WAV, MP3, M4A, AIFF (via ffmpeg) -- **Voices**: 50+ voices in various languages -- **Voice Mapping**: Automatically maps macOS voice names (e.g., "Kate" → en-gb) +md2audio supports multiple text-to-speech providers. Choose the one that best fits your needs: -### ElevenLabs +| Provider | Platform | Cost | Quality | Best For | +| ------------------------------------------------ | -------- | ---- | ------- | ------------------------------ | +| **[say](docs/providers/say.md)** | macOS | Free | Good | Local dev/testing | +| **[espeak](docs/providers/espeak.md)** | Linux | Free | Basic | Linux dev/testing | +| **[Google Cloud TTS](docs/providers/google.md)** | All | Paid | Premium | Enterprise, multi-language | +| **[ElevenLabs](docs/providers/elevenlabs.md)** | All | Paid | Premium | Production content, audiobooks | -- **Platform**: Cross-platform (works on any OS) -- **Cost**: Paid API ([Pricing](https://elevenlabs.io/pricing)) -- **Setup**: Requires API key -- **Quality**: Premium, highly realistic voices -- **Formats**: MP3 -- **Voices**: Multiple professional voices with emotional control +**[Compare Providers](docs/provider-comparison.md)** - Detailed comparison to help you choose -#### Setting up ElevenLabs +### Quick Provider Examples -1. Get your API key from [ElevenLabs](https://elevenlabs.io/) +```bash +# macOS say (default on macOS) +./md2audio -f script.md -p british-female -2. Set the environment variable: +# Linux espeak (default on Linux) +./md2audio -f script.md -provider espeak -v en-gb - ```bash - export ELEVENLABS_API_KEY='your-api-key' - ``` +# Google Cloud TTS +./md2audio -provider google -google-voice en-US-Neural2-F -f script.md -3. Or create a `.env` file in your project directory: +# ElevenLabs +./md2audio -provider elevenlabs -elevenlabs-voice-id VOICE_ID -f script.md +``` - ```bash - # Copy the example file - cp .env.example .env - # Then edit .env and add your API key - ``` +## Markdown Format - Or create it directly: +Use H2 headers (`##`) to denote sections. Add optional timing annotations: - ```bash - echo 'ELEVENLABS_API_KEY=your-api-key' > .env - ``` +```markdown +## Introduction (8s) -4. (Optional) Configure voice settings in `.env`: +This section will be adjusted to approximately 8 seconds. - ```bash - # Voice quality settings (all optional, with sensible defaults) - ELEVENLABS_STABILITY=0.5 # Voice consistency (0.0-1.0, default: 0.5) - ELEVENLABS_SIMILARITY_BOOST=0.5 # Voice similarity (0.0-1.0, default: 0.5) - ELEVENLABS_STYLE=0.0 # Voice style/emotion (0.0-1.0, default: 0.0) - ELEVENLABS_USE_SPEAKER_BOOST=true # Boost similarity (true/false, default: true) - ELEVENLABS_SPEED=1.0 # Default speed for non-timed sections (0.7-1.2, default: 1.0) - ``` +## Main Content (5-10s) - **Note:** - - `ELEVENLABS_SPEED` only applies to sections WITHOUT timing annotations - - Sections with `(5s)` timing will calculate speed automatically - - Higher stability = more consistent but less expressive - - Higher similarity_boost = closer to original voice characteristics - - Style adds emotional range (0 = disabled, higher = more expressive) +This targets 10 seconds (end time is used). -5. List available voices: +## Conclusion - ```bash - ./md2audio -provider elevenlabs -list-voices - ``` +No timing specified - uses default speaking rate. +``` -## Usage +**Supported timing formats**: `(8s)`, `(10.5s)`, `(0-8s)`, `(15 seconds)` -### Basic Examples +## Command Line Options -#### Using Default Provider (say on macOS, espeak on Linux) +### General Options -```bash -# Check version -./md2audio -version +| Flag | Description | Default | +| -------------- | ------------------------------------------------------ | ------------------------------ | +| `-f` | Input markdown file | - | +| `-d` | Input directory (recursive) | - | +| `-o` | Output directory | `./audio_sections` | +| `-provider` | TTS provider (`say`, `espeak`, `elevenlabs`, `google`) | Auto-detect | +| `-format` | Output format (`aiff`, `m4a`, `mp3`, `wav`, `ogg`) | `aiff` (macOS) / `wav` (Linux) | +| `-prefix` | Filename prefix | `section` | +| `-list-voices` | List available voices | - | +| `-version` | Print version | - | +| `-debug` | Enable debug logging | `false` | +| `-dry-run` | Preview without generating files | `false` | -# List available voices (automatically uses the best provider for your OS) -./md2audio -list-voices +### Provider-Specific Options -# Process a single markdown file with voice preset -# Works on both macOS (say) and Linux (espeak) automatically! -./md2audio -f script.md -p british-female +Each provider has its own configuration options. See the provider guides for details: -# Process entire directory recursively -./md2audio -d ./docs -p british-female +- **say/espeak**: `-p` (voice preset), `-v` (voice name), `-r` (speaking rate) +- **Google Cloud**: `-google-voice`, `-google-language`, `-google-credentials`, `-google-speed`, `-google-pitch`, `-google-volume` +- **ElevenLabs**: `-elevenlabs-voice-id`, `-elevenlabs-model`, `-elevenlabs-api-key` (voice settings via env vars) -# Use specific voice with slower rate for clarity -# On macOS: uses "Kate" voice directly -# On Linux: maps "Kate" to "en-gb" voice automatically -./md2audio -f script.md -v Kate -r 170 +**[Provider Documentation](docs/providers/)** for complete option lists -# Generate M4A files instead of default format -# macOS default: AIFF, Linux default: WAV -./md2audio -d ./content -p british-female -format m4a +## Examples -# Custom output directory and prefix -./md2audio -f script.md -o ./voiceovers -prefix demo +### Basic Examples -# Preview what would be generated (dry-run mode) +```bash +# Preview what would be generated ./md2audio -f script.md -p british-female -dry-run -# Enable debug logging to troubleshoot issues -./md2audio -f script.md -p british-female -debug - -# Combine dry-run with debug for detailed preview -./md2audio -d ./docs -p british-female -dry-run -debug +# Generate M4A files instead of AIFF +./md2audio -f script.md -p british-female -format m4a -# Explicitly use espeak provider (on any Linux system) -./md2audio -f script.md -provider espeak -v en-gb +# Process directory with custom output location +./md2audio -d ./content -p us-female -o ./voiceovers -# Explicitly use say provider (on macOS) -./md2audio -f script.md -provider say -v Kate +# Enable debug logging +./md2audio -f script.md -debug ``` -#### Using ElevenLabs Provider +### Provider-Specific Examples ```bash -# List available ElevenLabs voices (cached for faster access) -./md2audio -provider elevenlabs -list-voices - -# Refresh voice cache (when new voices are available) -./md2audio -provider elevenlabs -list-voices -refresh-cache - -# Export voices to JSON for reference -./md2audio -provider elevenlabs -export-voices elevenlabs_voices.json - -# Process a single file with ElevenLabs -./md2audio -provider elevenlabs \ - -elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \ - -f script.md - -# Process entire directory with ElevenLabs +# Google Cloud TTS with Neural2 voice +export GOOGLE_APPLICATION_CREDENTIALS="/path/to/creds.json" +./md2audio -provider google \ + -google-voice en-US-Neural2-F \ + -format mp3 \ + -d ./docs + +# ElevenLabs with custom settings +export ELEVENLABS_API_KEY='your-key' ./md2audio -provider elevenlabs \ -elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \ - -d ./docs \ - -o ./audio_output - -# Use specific ElevenLabs model -./md2audio -provider elevenlabs \ - -elevenlabs-voice-id YOUR_VOICE_ID \ -elevenlabs-model eleven_multilingual_v2 \ -f script.md -``` - -### Debug Mode - -Enable debug logging to troubleshoot issues or understand what's happening under the hood: - -```bash -# Enable debug logging -./md2audio -f script.md -p british-female -debug -``` - -**Debug mode shows:** - -- Cache hits/misses for voice lookups -- API request details (ElevenLabs) -- File processing progress -- Internal operation details - -**When to use debug mode:** - -- Troubleshooting API issues with ElevenLabs -- Understanding cache behavior -- Investigating performance problems -- Reporting bugs with detailed logs - -### Dry-Run Mode - -Preview what would be generated without creating any audio files: - -```bash -# Dry-run mode - shows what would be generated -./md2audio -f script.md -p british-female -dry-run - -# Combine with debug for maximum visibility -./md2audio -d ./docs -provider elevenlabs -elevenlabs-voice-id YOUR_ID -dry-run -debug -``` - -**Dry-run mode shows:** - -- Which sections would be processed -- Output file paths that would be created -- Timing information for timed sections -- Preview of text content - -**When to use dry-run mode:** - -- Testing markdown format before generation -- Verifying output paths and filenames -- Checking section count and structure -- Planning batch processing jobs - -**Example output:** - -``` -💡 DRY-RUN MODE: No files will be created -ℹ Section 1/3: - - title: Introduction - 💡 Target duration: 8.0 seconds - 💡 Text: Welcome to this demonstration... - Would create: ./audio_sections/section_01_introduction.aiff - -ℹ Section 2/3: - - title: Main Content - 💡 Text: Here is the main content... - Would create: ./audio_sections/section_02_main_content.aiff - -✔ Would generate 3 audio files +# espeak on Linux with MP3 output +./md2audio -provider espeak \ + -v en-gb \ + -format mp3 \ + -d ./docs ``` -### Voice Caching +## Voice Caching -To improve performance, md2audio caches voice lists from providers. This is especially useful for ElevenLabs to avoid repeated API calls: +md2audio caches voice lists locally for faster access: ```bash -# First call - fetches from API and caches (slower) -./md2audio -provider elevenlabs -list-voices +# First call - fetches from provider and caches +./md2audio -provider google -list-voices # Subsequent calls - uses cache (instant) -./md2audio -provider elevenlabs -list-voices +./md2audio -provider google -list-voices -# Force refresh when new voices are available -./md2audio -provider elevenlabs -list-voices -refresh-cache +# Force refresh when new voices available +./md2audio -provider google -list-voices -refresh-cache -# Export cached voices to JSON file for reference -./md2audio -provider elevenlabs -export-voices elevenlabs_voices.json -./md2audio -provider say -export-voices say_voices.json +# Export to JSON for reference +./md2audio -provider google -export-voices voices.json ``` -**Cache Details:** - -- **Location**: `~/.md2audio/voice_cache.db` (SQLite database) -- **Duration**: 30 days (voices don't change frequently) -- **Benefits**: Instant voice listing, reduced API calls, offline access to voice list -- **Refresh**: Use `-refresh-cache` flag when you know new voices are available - -### Command Line Options - -#### General Options - -| Flag | Description | Default | -| ---------------- | --------------------------------------------------- | ----------------------- | -| `-f` | Input markdown file (use `-f` or `-d`) | - | -| `-d` | Input directory (recursive, use `-f` or `-d`) | - | -| `-o` | Output directory | `./audio_sections` | -| `-format` | Output format | `aiff` | -| `-prefix` | Filename prefix | `section` | -| `-list-voices` | List all available voices (uses cache if available) | - | -| `-refresh-cache` | Force refresh of voice cache | `false` | -| `-export-voices` | Export cached voices to JSON file | - | -| `-provider` | TTS provider (`say`, `espeak`, or `elevenlabs`) | Auto-detect by platform | -| `-version` | Print version and exit | - | -| `-debug` | Enable debug logging | `false` | -| `-dry-run` | Show what would be generated without creating files | `false` | - -#### say/espeak Provider Options - -These options work for both `say` (macOS) and `espeak` (Linux) providers: - -| Flag | Description | Default | -| ---- | -------------------------------------- | ------------------- | -| `-p` | Voice preset (see Voice Presets below) | `Kate` (if not set) | -| `-v` | Specific voice name (overrides `-p`) | - | -| `-r` | Speaking rate (lower = slower) | `180` | - -**Note:** Voice names are automatically mapped between platforms. For example, "Kate" uses the Kate voice on macOS and en-gb on Linux. +- **Cache Location**: `~/.md2audio/voice_cache.db` +- **Cache Duration**: 30 days +- **Supported Providers**: All providers -#### ElevenLabs Provider Options +## Output Structure -| Flag | Description | Default | -| ---------------------- | ----------------------------------- | ------------------------ | -| `-elevenlabs-voice-id` | ElevenLabs voice ID (required) | - | -| `-elevenlabs-model` | ElevenLabs model ID | `eleven_multilingual_v2` | -| `-elevenlabs-api-key` | ElevenLabs API key (prefer env var) | `ELEVENLABS_API_KEY` env | - -### Voice Presets - -These presets work on both macOS and Linux (automatically mapped): - -| Preset | macOS Voice | Linux Voice | -| ------------------- | ----------- | ----------- | -| `british-female` | Kate | en-gb | -| `british-male` | Daniel | en-gb | -| `us-female` | Samantha | en-us | -| `us-male` | Alex | en-us | -| `australian-female` | Karen | en-au | -| `indian-female` | Veena | en-in | - -**Cross-Platform Usage:** +### Single File ```bash -# Same command works on both macOS and Linux! ./md2audio -f script.md -p british-female - -# Or use specific voices (automatically mapped) -./md2audio -f script.md -v Kate # macOS: Kate, Linux: en-gb ``` -### ElevenLabs Voice Settings - -ElevenLabs voice quality can be fine-tuned using environment variables. All settings are optional and have sensible defaults: - -| Setting | Range | Default | Description | -| ------------------------------ | ---------- | ------- | ---------------------------------------------------------------------- | -| `ELEVENLABS_STABILITY` | 0.0-1.0 | 0.5 | Voice consistency. Higher = more consistent but less expressive | -| `ELEVENLABS_SIMILARITY_BOOST` | 0.0-1.0 | 0.5 | Voice similarity to original. Higher = closer to voice characteristics | -| `ELEVENLABS_STYLE` | 0.0-1.0 | 0.0 | Emotional range. 0 = disabled, higher = more expressive | -| `ELEVENLABS_USE_SPEAKER_BOOST` | true/false | true | Boost similarity of synthesized speech | -| `ELEVENLABS_SPEED` | 0.7-1.2 | 1.0 | Default speaking speed (only for sections without timing annotations) | - -**Speed Behavior:** - -- Sections **with** timing annotations like `## Scene 1 (5s)` → Speed is calculated automatically to fit duration -- Sections **without** timing annotations → Uses `ELEVENLABS_SPEED` setting (default: 1.0) +Output: -**Example `.env` configuration:** - -```bash -ELEVENLABS_API_KEY=your-api-key -ELEVENLABS_STABILITY=0.7 # More consistent voice -ELEVENLABS_SIMILARITY_BOOST=0.8 # Closer to original voice -ELEVENLABS_STYLE=0.3 # Slight emotional variation -ELEVENLABS_SPEED=1.1 # 10% faster for non-timed sections ``` - -## Markdown Format - -The script expects H2 headers (`##`) to denote sections. You can optionally specify target duration for each section: - -```markdown -## Scene 1: Introduction (8s) - -This is the content for scene 1. It will be converted to audio that lasts exactly 8 seconds. - -## Scene 2: Main Demo (12s) - -This is the content for scene 2. The speaking rate will be automatically adjusted to fit 12 seconds. - -## Scene 3: Conclusion - -This section has no timing specified, so it will use the default speaking rate (-r flag). +audio_sections/ +├── section_01_introduction.aiff +├── section_02_main_content.aiff +└── section_03_conclusion.aiff ``` -### Timing Formats Supported - -- `(8s)` - Target duration of 8 seconds -- `(10.5s)` - Target duration of 10.5 seconds -- `(0-8s)` - Range format, uses end time (8 seconds) -- `(15 seconds)` - Also works with "seconds" spelled out - -**How it works (macOS say provider only):** - -- The script counts the words in your text -- Calculates the required words-per-minute (WPM) to fit the target duration -- Automatically adjusts the speaking rate for that section -- Shows you the actual duration vs target after generation - -**Important Notes:** - -- **Timing is supported with both providers**, but with different accuracy: - - **macOS say provider**: Uses `-r` (rate) parameter for speed control - - Very wide range of speaking rates (90-360 wpm) - - Actual duration may differ from target (typically within 1-3 seconds) - - Applies 0.95 adjustment factor for better accuracy - - - **ElevenLabs provider**: Uses `speed` parameter (NEW!) - - Limited range: 0.7x (slower) to 1.2x (faster) of natural pace - - More accurate natural-sounding speech - - If target duration requires speed outside this range, audio will be clamped - - Example: 5s target → 5.75s actual (within 15% for typical content) - -- **Timing accuracy tip**: Test with your content and adjust timing annotations as needed. For very tight timing requirements, consider the say provider's wider speed range. - -## Directory Processing - -Process entire directory trees recursively with the `-d` flag: +### Directory Processing ```bash -./md2audio -d ./docs -p british-female -o ./audio_output -``` - -**Input structure:** - -``` -docs/ -├── intro.md -├── chapter1/ -│ ├── part1.md -│ └── part2.md -└── chapter2/ - └── overview.md +./md2audio -d ./docs -p british-female ``` -**Output structure (mirrors input):** +Input structure is mirrored in output: ``` -audio_output/ -├── intro/ -│ ├── section_01_welcome.aiff -│ └── section_02_overview.aiff -├── chapter1/ -│ ├── part1/ -│ │ ├── section_01_title.aiff -│ │ └── section_02_title.aiff -│ └── part2/ -│ └── section_01_title.aiff -└── chapter2/ - └── overview/ - └── section_01_title.aiff +docs/ audio_sections/ +├── intro.md → ├── intro/ +├── chapter1/ → ├── chapter1/ +│ ├── part1.md → │ ├── part1/ +│ └── part2.md → │ └── part2/ +└── chapter2/ → └── chapter2/ + └── overview.md → └── overview/ ``` -**Key features:** - -- Processes all `.md` files recursively -- Creates mirror directory structure -- Each markdown file gets its own subdirectory -- Preserves folder hierarchy from input -- Continues processing even if individual files fail +## Troubleshooting -**Example with examples folder:** +### Voice Not Found ```bash -# Process the included examples -./md2audio -d ./examples -p british-female -format m4a - -# Results in organized audio files matching the examples structure -``` - -## Output - -Files are named using the pattern: +# List all available voices for your provider +./md2audio -list-voices +# Use exact voice name from the list +./md2audio -f script.md -v "Samantha" ``` -{prefix}_{number}_{sanitized_title}.{format} -``` - -Example outputs: - -- `section_01_scene_1_introduction.aiff` -- `section_02_scene_2_main_demo.aiff` - -## Tips for Video Editing - -1. Generate separate files per section (this is automatic) -2. Add timing to your markdown headers to match your screen recording -3. Import all audio files into your video editing software -4. Place each audio clip on the timeline where needed -5. The audio will match your specified durations automatically -### Timing Tips +### Provider Setup Issues -- **Be realistic**: Very short durations with lots of text will sound rushed -- **Test first**: Generate one section to verify the pacing feels natural -- **Adjust if needed**: If timing is off, adjust the duration in your markdown and regenerate -- **Word count matters**: ~2-3 words per second is natural speech -- **Override if needed**: The `-r` flag still works for sections without timing +See the provider-specific guide for detailed setup instructions: -## Troubleshooting - -**Voice not found:** - -- Run `./md2audio -list-voices` to see available voices -- Use the exact voice name with `-v` flag - -**No sections found:** - -- Ensure your markdown uses `##` for headers (H2) -- Check there's content after each header +- [say Setup](docs/providers/say.md#setup) - No setup needed +- [espeak Setup](docs/providers/espeak.md#installation) - Install espeak-ng +- [Google Cloud Setup](docs/providers/google.md#setup) - GCP credentials +- [ElevenLabs Setup](docs/providers/elevenlabs.md#setup) - API key -**Audio quality:** - -- AIFF format is higher quality but larger -- M4A format is compressed and smaller -- Adjust rate with `-r` flag for clarity +### Debug Mode -## Example Workflow +Enable debug logging to troubleshoot issues: ```bash -# 1. Check your markdown format -cat examples/demo_script.md - -# 2. List available voices -./md2audio -list-voices - -# 3. Generate audio files -./md2audio -f examples/demo_script.md -p british-female -r 175 -format m4a - -# 4. Import the files from ./audio_sections into your video editor +./md2audio -f script.md -p british-female -debug ``` -## Notes - -- The script automatically cleans markdown formatting (links, bold, italic) -- Empty sections are skipped -- Section titles are sanitized for safe filenames -- Speaking rate default is 180 (macOS default is 200) - -## For Developers - -Interested in contributing or understanding the codebase? +Shows: -See the [Contributing Guide](CONTRIBUTING.md) for detailed information about: - -- Project architecture and package organization -- Development tools and workflow -- Code quality standards -- Setting up your development environment +- Cache hits/misses +- API request details +- File processing progress +- Internal operations ## Contributing -Contributions are welcome! +Contributions are welcome! See the [Contributing Guide](CONTRIBUTING.md) for: -See the [Contributing Guide](/CONTRIBUTING.md) for setup instructions. +- Project architecture +- Development setup +- Code quality standards +- Provider implementation guide ## License -This project is licensed under the MIT License – see the [LICENSE](./LICENSE) file for details. +MIT License - see [LICENSE](LICENSE) for details diff --git a/docs/provider-comparison.md b/docs/provider-comparison.md new file mode 100644 index 0000000..6195599 --- /dev/null +++ b/docs/provider-comparison.md @@ -0,0 +1,111 @@ +# TTS Provider Comparison + +This guide helps you choose the best text-to-speech provider for your needs. + +## Quick Comparison Table + +| Feature | [say](providers/say.md) | [espeak](providers/espeak.md) | [ElevenLabs](providers/elevenlabs.md) | [Google Cloud](providers/google.md) | +| --------------- | ----------------------- | ----------------------------- | ------------------------------------- | ----------------------------------- | +| **Platform** | macOS only | Linux only | All platforms | All platforms | +| **Cost** | Free | Free | Paid ($5-99/mo) | Paid ($4-16/M chars) | +| **Quality** | Good | Basic | Premium | Premium | +| **Voices** | ~70 voices | ~50 voices | 20+ premium | 400+ voices | +| **Languages** | 30+ | 50+ | 30+ | 50+ | +| **Offline** | Yes | Yes | ❌ No | ❌ No | +| **Speed Range** | 90-360 WPM | Variable | 0.7x-1.2x | 0.25x-4.0x | +| **Formats** | AIFF, M4A | WAV, MP3, M4A | MP3 | MP3, WAV, OGG | +| **Setup** | None | Install espeak | API key | GCP credentials | +| **Best For** | macOS dev/test | Linux dev/test | Premium quality | Enterprise/Scale | + +## Detailed Comparison + +### Voice Quality + +#### say (macOS) + +- **Pros**: Natural-sounding, good for local testing +- **Cons**: Not neural TTS, somewhat robotic +- **Use Case**: Development, testing, local projects + +#### espeak (Linux) + +- **Pros**: Lightweight, fast, open source +- **Cons**: Robotic voice, limited expressiveness +- **Use Case**: Development, testing, scripting + +#### ElevenLabs + +- **Pros**: Highly realistic, emotional control, voice cloning +- **Cons**: Requires paid subscription, limited speed range +- **Use Case**: Production content, audiobooks, podcasts + +#### Google Cloud TTS + +- **Pros**: Neural2/WaveNet voices, massive voice library, enterprise SLA +- **Cons**: Requires GCP setup, costs scale with usage +- **Use Case**: Enterprise, multi-language, high-volume + +### Speed Control Comparison + +| Provider | Speed Range | Timing Accuracy | Notes | +| ------------ | ----------- | --------------- | ------------------------------- | +| say | 90-360 WPM | ±1-3 seconds | Wide range, good flexibility | +| espeak | Variable | ±2-4 seconds | Adjusts rate parameter | +| ElevenLabs | 0.7x-1.2x | ±15% | Limited range, natural quality | +| Google Cloud | 0.25x-4.0x | ±10% | **Widest range**, best accuracy | + +**For precise timing control**: Google Cloud TTS (0.25x-4.0x range) +**For natural quality**: ElevenLabs (limited but realistic) +**For flexibility**: say (wide WPM range) + +### Voice Selection + +#### say (macOS) + +- ~70 voices across 30+ languages +- Organized by language/region +- Good variety, standard quality +- List with: `./md2audio -list-voices` + +#### espeak (Linux) + +- ~50 voices across 50+ languages +- Simple language codes (en-us, en-gb, etc.) +- Open source voice synthesis +- List with: `./md2audio -provider espeak -list-voices` + +#### ElevenLabs + +- 20+ professional voices +- Highly distinctive personalities +- Voice cloning available (paid tiers) +- Emotional range control +- List with: `./md2audio -provider elevenlabs -list-voices` + +#### Google Cloud TTS + +- **400+ voices** across 50+ languages +- Multiple quality tiers: + - Standard (basic quality) + - WaveNet (high quality) + - Neural2 (best quality) + - Studio (premium, highest fidelity) + - Polyglot (multi-language) +- List with: `./md2audio -provider google -list-voices` + +### Output Formats + +| Provider | Formats | Notes | +| ------------ | ------------------- | ------------------------------ | +| say | AIFF, M4A | AIFF default, converts to M4A | +| espeak | WAV, MP3, M4A, AIFF | WAV default, ffmpeg for others | +| ElevenLabs | MP3 | MP3 only from API | +| Google Cloud | MP3, WAV, OGG | Multiple formats supported | + +## Next Steps + +- [say Provider Guide](providers/say.md) +- [espeak Provider Guide](providers/espeak.md) +- [ElevenLabs Provider Guide](providers/elevenlabs.md) +- [Google Cloud TTS Provider Guide](providers/google.md) +- [Timing Control Guide](timing-guide.md) diff --git a/docs/providers/elevenlabs.md b/docs/providers/elevenlabs.md new file mode 100644 index 0000000..6b15f2a --- /dev/null +++ b/docs/providers/elevenlabs.md @@ -0,0 +1,296 @@ +# ElevenLabs Provider + +The ElevenLabs provider uses the premium ElevenLabs AI text-to-speech API for highly realistic voice synthesis. + +## Platform + +- **Cross-platform** (Works on macOS, Linux, and Windows) +- Cloud-based API service + +## Features + +- Premium, highly realistic voices +- Emotional voice control +- Voice cloning capabilities +- Professional-grade quality +- Multilingual support +- Fine-grained voice settings +- Speed control (0.7x - 1.2x) +- Timing control support + +## Prerequisites + +- Any operating system (macOS, Linux, Windows) +- ElevenLabs API key ([Sign up here](https://elevenlabs.io/)) +- Internet connection (cloud-based service) + +## Setup + +### 1. Get API Key + +1. Sign up at [ElevenLabs](https://elevenlabs.io/) +2. Navigate to your profile settings +3. Copy your API key + +### 2. Configure API Key + +**Option A: Environment Variable** + +```bash +export ELEVENLABS_API_KEY='your-api-key-here' +``` + +**Option B: .env File** + +```bash +# Create .env file in your project directory +echo 'ELEVENLABS_API_KEY=your-api-key-here' > .env +``` + +**Option C: Command Line Flag** + +```bash +./md2audio -provider elevenlabs \ + -elevenlabs-api-key your-api-key-here \ + -elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \ + -f script.md +``` + +### 3. (Optional) Configure Voice Settings + +Fine-tune voice quality in `.env`: + +```bash +ELEVENLABS_API_KEY=your-api-key-here +ELEVENLABS_STABILITY=0.5 # 0.0-1.0 (default: 0.5) +ELEVENLABS_SIMILARITY_BOOST=0.5 # 0.0-1.0 (default: 0.5) +ELEVENLABS_STYLE=0.0 # 0.0-1.0 (default: 0.0) +ELEVENLABS_USE_SPEAKER_BOOST=true # true/false (default: true) +ELEVENLABS_SPEED=1.0 # 0.7-1.2 (default: 1.0) +``` + +## Usage + +### List Available Voices + +```bash +# List all ElevenLabs voices (cached) +./md2audio -provider elevenlabs -list-voices + +# Refresh voice cache +./md2audio -provider elevenlabs -list-voices -refresh-cache + +# Export voices to JSON +./md2audio -provider elevenlabs -export-voices voices.json +``` + +### Basic Generation + +```bash +# Generate audio from markdown +./md2audio -provider elevenlabs \ + -elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \ + -f script.md + +# Process entire directory +./md2audio -provider elevenlabs \ + -elevenlabs-voice-id 21m00Tcm4TlvDq8ikWAM \ + -d ./docs \ + -o ./audio_output +``` + +### Using Specific Models + +```bash +# Use multilingual model (default) +./md2audio -provider elevenlabs \ + -elevenlabs-voice-id YOUR_VOICE_ID \ + -elevenlabs-model eleven_multilingual_v2 \ + -f script.md + +# Use English-only model (lower latency) +./md2audio -provider elevenlabs \ + -elevenlabs-voice-id YOUR_VOICE_ID \ + -elevenlabs-model eleven_monolingual_v1 \ + -f script.md +``` + +## Voice Settings + +### Stability (0.0-1.0) + +Controls voice consistency: + +- **Low (0.0-0.3)**: More expressive, variable +- **Medium (0.4-0.6)**: Balanced (default: 0.5) +- **High (0.7-1.0)**: Very consistent, less expressive + +```bash +ELEVENLABS_STABILITY=0.7 # More consistent voice +``` + +### Similarity Boost (0.0-1.0) + +Controls how closely the voice matches the original: + +- **Low (0.0-0.3)**: More creative interpretation +- **Medium (0.4-0.6)**: Balanced (default: 0.5) +- **High (0.7-1.0)**: Closer to original voice characteristics + +```bash +ELEVENLABS_SIMILARITY_BOOST=0.8 # Closer to original voice +``` + +### Style (0.0-1.0) + +Controls emotional expression: + +- **0.0**: No style/emotion (default, most stable) +- **0.1-0.5**: Subtle emotional variation +- **0.6-1.0**: High emotional expression + +```bash +ELEVENLABS_STYLE=0.3 # Slight emotional variation +``` + +### Speaker Boost (true/false) + +Enhances voice similarity: + +- **true**: Better voice matching (default) +- **false**: Standard voice synthesis + +```bash +ELEVENLABS_USE_SPEAKER_BOOST=true +``` + +### Speed (0.7-1.2) + +Default speaking speed for non-timed sections: + +- **0.7**: 30% slower +- **1.0**: Natural speed (default) +- **1.2**: 20% faster + +```bash +ELEVENLABS_SPEED=1.1 # 10% faster for non-timed sections +``` + +**Note**: Sections with timing annotations (e.g., `## Intro (5s)`) automatically calculate speed to fit the duration. + +## Timing Control + +ElevenLabs supports timing annotations with automatic speed adjustment: + +```markdown +## Introduction (8s) +This section will be adjusted to fit 8 seconds using speed control. + +## Main Demo (5-10s) +Targets 10 seconds (end time is used). + +## Conclusion +No timing specified - uses ELEVENLABS_SPEED setting (default: 1.0). +``` + +**Speed Range**: 0.7x - 1.2x + +- **Accuracy**: Typically within 15% of target duration +- **Quality**: Natural-sounding speech maintained +- **Limitation**: If target requires speed outside range, it will be clamped with a warning + +## Output Format + +- **MP3 only** - ElevenLabs API returns MP3 audio +- High quality compression +- Suitable for all platforms + +## Common Voice IDs + +Popular ElevenLabs voices: + +| Voice ID (2024) | Name | Description | +|---------------------------|-----------|----------------------| +| 21m00Tcm4TlvDq8ikWAM | Rachel | Calm, professional | +| AZnzlk1XvdvUeBnXmlld | Domi | Strong, confident | +| EXAVITQu4vr4xnSDxMaL | Bella | Soft, friendly | +| ErXwobaYiN019PkySvjV | Antoni | Well-rounded, male | +| MF3mGyEYCl7XYWbV9V6O | Elli | Emotional, young | +| TxGEqnHWrfWFTfGW9XjX | Josh | Deep, professional | +| VR6AewLTigWG4xSOukaG | Arnold | Mature, authoritative| +| pNInz6obpgDQGcFmaJgB | Adam | Deep, narrative | +| yoZ06aMxZJJ28mfd3POQ | Sam | Dynamic, energetic | + +Run `-list-voices` to see your account's available voices. + +## Pricing + +See [ElevenLabs Pricing](https://elevenlabs.io/pricing) for current details. + +## Tips + +1. **Start Simple**: Begin with default settings, then fine-tune +2. **Test First**: Generate one section to verify voice and settings +3. **Cache Voices**: First `-list-voices` call caches for 30 days +4. **Timing**: For tight timing needs, test and adjust markdown annotations +5. **Quality vs. Cost**: Higher quality settings may use more characters + +## Troubleshooting + +### API Key Errors + +```bash +# Verify API key is set +echo $ELEVENLABS_API_KEY + +# Or check .env file +cat .env | grep ELEVENLABS_API_KEY +``` + +### Voice Not Found + +```bash +# List your available voices +./md2audio -provider elevenlabs -list-voices + +# Copy the exact voice ID from the list +``` + +### Timing Issues + +```bash +# Check calculated speed in output +# If speed is clamped (0.7 or 1.2), adjust target duration + +# Example: If "Warning: Required speed (1.5) exceeds maximum" +# Increase target duration: (5s) → (7s) +``` + +## Performance + +- **Latency**: Cloud-based, requires internet +- **Quality**: Premium, highly realistic +- **Rate Limits**: Depends on plan +- **Caching**: Voice list cached locally for 30 days +- **Retry Logic**: Automatic retry on transient failures + +## Limitations + +- MP3 format only (no WAV/AIFF) +- Requires internet connection +- API rate limits apply +- Speed range limited to 0.7x - 1.2x +- Costs scale with usage + +## Best Practices + +1. **Use .env**: Keep API keys out of scripts +2. **Cache Voices**: Run `-list-voices` once, then use cached list +3. **Batch Processing**: Process multiple files in one run +4. **Monitor Usage**: Check ElevenLabs dashboard regularly +5. **Test Settings**: Find optimal stability/similarity for your use case + +## Next Steps + +- Try [Google Cloud TTS](google.md) for even wider speed range +- Check [Provider Comparison](../provider-comparison.md) for detailed comparison diff --git a/docs/providers/espeak.md b/docs/providers/espeak.md new file mode 100644 index 0000000..afe785b --- /dev/null +++ b/docs/providers/espeak.md @@ -0,0 +1,241 @@ +# Linux espeak Provider + +The `espeak` provider uses the open-source eSpeak NG text-to-speech synthesizer for Linux systems. + +## Platform + +- **Linux only** (Ubuntu, Debian, Fedora, Arch, etc.) +- Open source and free + +## Features + +- Free and open source +- 50+ voices in various languages +- Multiple output formats (WAV, MP3, M4A, AIFF via ffmpeg) +- Timing control support +- Offline operation +- Cross-platform voice mapping (macOS voice names work automatically) + +## Prerequisites + +- Linux operating system +- `espeak-ng` or `espeak` installed +- `ffmpeg` for format conversion (MP3, M4A, AIFF) + +## Installation + +### Ubuntu/Debian + +```bash +sudo apt install espeak-ng ffmpeg +``` + +### Fedora/RHEL + +```bash +sudo dnf install espeak-ng ffmpeg +``` + +### Arch Linux + +```bash +sudo pacman -S espeak-ng ffmpeg +``` + +## Setup + +After installation, verify espeak is available: + +```bash +# Check espeak-ng is installed +which espeak-ng + +# Or check for espeak (older version) +which espeak + +# Test voice +espeak-ng "Hello, this is a test" +``` + +## Usage + +### Basic Usage + +```bash +# List available voices +./md2audio -provider espeak -list-voices + +# Generate audio with default voice +./md2audio -f script.md + +# Use voice preset (automatically mapped) +./md2audio -f script.md -p british-female + +# Use specific voice +./md2audio -f script.md -v en-gb +``` + +### Voice Presets (Cross-Platform) + +These presets work the same on Linux and macOS: + +| Preset | macOS Voice | Linux Voice (espeak) | +|---------------------|-------------|----------------------| +| `british-female` | Kate | en-gb | +| `british-male` | Daniel | en-gb | +| `us-female` | Samantha | en-us | +| `us-male` | Alex | en-us | +| `australian-female` | Karen | en-au | +| `indian-female` | Veena | en-in | + +**Cross-platform example:** + +```bash +# Same command works on both macOS and Linux! +./md2audio -f script.md -p british-female + +# macOS voice names are automatically mapped +./md2audio -f script.md -v Kate # Becomes en-gb on Linux +``` + +### Advanced Options + +```bash +# Adjust speaking rate (lower = slower) +./md2audio -f script.md -v en-gb -r 170 + +# Generate MP3 instead of WAV +./md2audio -f script.md -v en-gb -format mp3 + +# Generate M4A +./md2audio -f script.md -v en-gb -format m4a + +# Process entire directory +./md2audio -d ./docs -p british-female -o ./audio +``` + +## Output Formats + +- **WAV** (default) - Uncompressed, high quality +- **MP3** - Compressed, good quality (requires ffmpeg) +- **M4A** - Compressed, compatible with Apple devices (requires ffmpeg) +- **AIFF** - Uncompressed, Apple format (requires ffmpeg) + +## Timing Control + +The espeak provider supports timing annotations in H2 headers: + +```markdown +## Introduction (8s) +This section will be adjusted to approximately 8 seconds. + +## Main Content (5-10s) +This will target 10 seconds (uses the end time). +``` + +**How it works:** + +- Similar to macOS say provider +- Adjusts speaking rate to fit target duration +- Uses words-per-minute calculation + +## Common Voice Languages + +Available voices include: + +- **English**: US (en-us), UK (en-gb), Australian (en-au), etc. +- **Spanish**: es, es-la (Latin America) +- **French**: fr, fr-be (Belgian) +- **German**: de +- **Italian**: it +- **Portuguese**: pt, pt-br (Brazilian) +- **Russian**: ru +- **Chinese**: zh (Mandarin) +- **Japanese**: ja +- And many more... + +Run `./md2audio -provider espeak -list-voices` to see all available voices. + +## Voice Mapping + +When you use macOS voice names on Linux, they're automatically mapped: + +| macOS Voice | Linux espeak Voice | +|-------------|-------------------| +| Kate | en-gb | +| Daniel | en-gb | +| Samantha | en-us | +| Alex | en-us | +| Karen | en-au | +| Veena | en-in | + +This allows scripts and commands to work across both platforms without modification. + +## Tips + +1. **Quality**: WAV provides lossless quality, MP3 is more portable +2. **ffmpeg**: Required for MP3, M4A, and AIFF output formats +3. **Testing**: Use dry-run mode to preview: `-dry-run` +4. **Caching**: Voice list is cached for 30 days for faster lookups +5. **Cross-platform**: Use voice presets for portable scripts + +## Troubleshooting + +### espeak-ng not found + +```bash +# Ubuntu/Debian +sudo apt install espeak-ng + +# Fedora +sudo dnf install espeak-ng + +# Arch +sudo pacman -S espeak-ng +``` + +### Format conversion fails + +```bash +# Install ffmpeg for MP3/M4A/AIFF support +sudo apt install ffmpeg # Ubuntu/Debian +sudo dnf install ffmpeg # Fedora +sudo pacman -S ffmpeg # Arch +``` + +### Voice not found + +```bash +# List all available voices +./md2audio -provider espeak -list-voices + +# Use espeak voice code +./md2audio -f script.md -v en-gb +``` + +### Audio quality issues + +- espeak-ng generally has better quality than legacy espeak +- For higher quality, consider ElevenLabs or Google Cloud TTS +- Adjust rate for clearer speech: `-r 170` + +## Performance + +- Fast generation (local processing) +- No API rate limits +- Works offline +- Voice cache updates instantly +- Lightweight resource usage + +## Limitations + +- Linux only (not available on macOS or Windows) +- Robotic voice quality (not neural TTS) +- Limited voice customization +- Timing accuracy varies + +## Next Steps + +- [ElevenLabs](elevenlabs.md) - Cloud-based, premium quality +- [Google Cloud TTS](google.md) - Enterprise features, Neural2 voices +- Check [Provider Comparison](../provider-comparison.md) for detailed comparison diff --git a/docs/providers/google.md b/docs/providers/google.md new file mode 100644 index 0000000..4159cdf --- /dev/null +++ b/docs/providers/google.md @@ -0,0 +1,196 @@ +# Google Cloud TTS Example + +This example demonstrates using Google Cloud Text-to-Speech with md2audio. + +## Setup + +### 1. Create a Google Cloud Project + +1. Go to [Google Cloud Console](https://console.cloud.google.com/) +2. Create a new project or select an existing one +3. Enable the Cloud Text-to-Speech API +4. Create a service account: + - Go to IAM & Admin > Service Accounts + - Click "Create Service Account" + - Grant "Cloud Text-to-Speech User" role + - Create and download a JSON key file + +### 2. Configure Credentials + +Set the environment variable to point to your service account key: + +```bash +export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-key.json" +``` + +Or add it to your `.env` file: + +```bash +echo 'GOOGLE_APPLICATION_CREDENTIALS=/path/to/service-account.json' > .env +``` + +## Usage Examples + +### List Available Voices + +```bash +# List all Google Cloud TTS voices (400+ voices) +./md2audio -provider google -list-voices + +# Export voices to JSON file for reference +./md2audio -provider google -export-voices google-voices.json +``` + +### Generate Audio from Markdown + +```bash +# Process a single file with Neural2 voice (high quality) +./md2audio -provider google -google-voice en-US-Neural2-F -f examples/demo_script.md + +# Process with British English voice +./md2audio -provider google -google-voice en-GB-Neural2-A -f examples/demo_script.md + +# Generate MP3 files +./md2audio -provider google -google-voice en-US-Neural2-F -format mp3 -f examples/demo_script.md + +# Process entire directory +./md2audio -provider google -google-voice en-US-Neural2-C -d ./docs -o ./audio_output +``` + +### Advanced Options + +```bash +# Adjust speaking rate (0.25 = very slow, 4.0 = very fast) +./md2audio -provider google -google-voice en-US-Neural2-F -google-speed 1.5 -f script.md + +# Adjust pitch (-20.0 to 20.0 semitones) +./md2audio -provider google -google-voice en-US-Neural2-F -google-pitch 2.0 -f script.md + +# Adjust volume gain (-96.0 to 16.0 dB) +./md2audio -provider google -google-voice en-US-Neural2-F -google-volume 3.0 -f script.md + +# Use different language +./md2audio -provider google -google-voice es-ES-Neural2-A -google-language es-ES -f spanish_script.md +``` + +## Voice Types + +Google Cloud TTS offers multiple voice quality tiers: + +### Neural2 (Recommended - Best Quality) + +- `en-US-Neural2-F` - Female, American English +- `en-US-Neural2-M` - Male, American English +- `en-GB-Neural2-A` - Female, British English +- `en-GB-Neural2-B` - Male, British English + +### WaveNet (High Quality) + +- `en-US-Wavenet-F` - Female, American English +- `en-US-Wavenet-A` - Male, American English + +### Standard (Basic Quality) + +- `en-US-Standard-A` - Female, American English +- `en-US-Standard-D` - Male, American English + +### Studio (Premium Quality - Highest Fidelity) + +- `en-US-Studio-M` - Male voice optimized for studio recordings +- `en-US-Studio-O` - Female voice optimized for studio recordings + +## Output Formats + +Google Cloud TTS supports: + +- **MP3** - Compressed, good for web use +- **WAV** - Uncompressed, high quality (LINEAR16 encoding) +- **OGG** - Compressed with Opus codec + +```bash +# Generate WAV files +./md2audio -provider google -google-voice en-US-Neural2-F -format wav -f script.md + +# Generate OGG files +./md2audio -provider google -google-voice en-US-Neural2-F -format ogg -f script.md +``` + +## Timing Control + +Google Cloud TTS has the widest speaking rate range (0.25x - 4.0x): + +```bash +# Slow speech for learning materials +./md2audio -provider google -google-voice en-US-Neural2-F -google-speed 0.75 -f lesson.md + +# Fast speech for quick reviews +./md2audio -provider google -google-voice en-US-Neural2-F -google-speed 1.5 -f summary.md +``` + +The tool also supports timing annotations in H2 headers: + +```markdown +## Introduction (8s) +This section will be adjusted to speak in approximately 8 seconds. + +## Quick Overview (3-5s) +This will be between 3 and 5 seconds long. +``` + +## Multi-Language Support + +Google Cloud TTS supports 50+ languages: + +```bash +# Spanish +./md2audio -provider google -google-voice es-ES-Neural2-A -google-language es-ES -f spanish.md + +# French +./md2audio -provider google -google-voice fr-FR-Neural2-A -google-language fr-FR -f french.md + +# German +./md2audio -provider google -google-voice de-DE-Neural2-F -google-language de-DE -f german.md + +# Japanese +./md2audio -provider google -google-voice ja-JP-Neural2-B -google-language ja-JP -f japanese.md +``` + +## Pricing + +See [Google Cloud TTS Pricing](https://cloud.google.com/text-to-speech/pricing) for current rates. + +## Tips + +1. **Voice Selection**: Start with Neural2 voices for the best quality-to-cost ratio +2. **Caching**: The first `-list-voices` call downloads all voices; subsequent calls use cache (instant) +3. **Credentials**: Keep your service account key secure, never commit it to version control +4. **IAM Permissions**: Ensure your service account has the "Cloud Text-to-Speech User" role +5. **Rate Limits**: Google Cloud TTS has generous rate limits, but for bulk processing consider batching +6. **Language Codes**: Use the same language code in voice name and `-google-language` flag + +## Troubleshooting + +### "Google Cloud credentials not found" error + +- Ensure `GOOGLE_APPLICATION_CREDENTIALS` is set correctly +- Check that the service account key file exists and is readable +- Verify the path doesn't contain typos + +### "Permission denied" error + +- Check that your service account has the "Cloud Text-to-Speech User" role +- Ensure the Cloud Text-to-Speech API is enabled in your project + +### Voices not appearing + +- Run with `-refresh-cache` to force update the voice cache +- Check your internet connection +- Verify API access from your network + +## Resources + +- [Google Cloud TTS Documentation](https://cloud.google.com/text-to-speech/docs) +- [Voice List](https://cloud.google.com/text-to-speech/docs/voices) +- [SSML Support](https://cloud.google.com/text-to-speech/docs/ssml) (future feature) +- [Audio Profiles](https://cloud.google.com/text-to-speech/docs/audio-profiles) (future feature) +- Check [Provider Comparison](../provider-comparison.md) for detailed comparison diff --git a/docs/providers/say.md b/docs/providers/say.md new file mode 100644 index 0000000..3c6e1c1 --- /dev/null +++ b/docs/providers/say.md @@ -0,0 +1,171 @@ +# macOS say Provider + +The `say` provider uses the built-in macOS text-to-speech system. + +## Platform + +- **macOS only** +- Built-in, no installation required + +## Features + +- Free (built-in with macOS) +- ~70 voices in various languages +- Multiple output formats (AIFF, M4A) +- Wide speaking rate range (90-360 WPM) +- Timing control support +- Offline operation + +## Prerequisites + +- macOS operating system +- No additional installation needed + +## Setup + +No setup required! The `say` command is built into macOS. + +## Usage + +### Basic Usage + +```bash +# List available voices +./md2audio -provider say -list-voices + +# Generate audio with default voice +./md2audio -f script.md + +# Use voice preset +./md2audio -f script.md -p british-female + +# Use specific voice +./md2audio -f script.md -v Kate +``` + +### Voice Presets + +| Preset | Voice Name | +|---------------------|------------| +| `british-female` | Kate | +| `british-male` | Daniel | +| `us-female` | Samantha | +| `us-male` | Alex | +| `australian-female` | Karen | +| `indian-female` | Veena | + +### Advanced Options + +```bash +# Adjust speaking rate (lower = slower) +./md2audio -f script.md -v Kate -r 170 + +# Generate M4A instead of AIFF +./md2audio -f script.md -v Kate -format m4a + +# Process entire directory +./md2audio -d ./docs -p british-female -o ./audio +``` + +## Output Formats + +- **AIFF** (default) - Uncompressed, high quality +- **M4A** - Compressed, smaller file size + +The tool uses AIFF internally and converts to M4A using `afconvert` if requested. + +## Timing Control + +The say provider supports timing annotations in H2 headers: + +```markdown +## Introduction (8s) +This section will be adjusted to approximately 8 seconds. + +## Main Content (5-10s) +This will target 10 seconds (uses the end time). +``` + +**How it works:** + +- Counts words in the text +- Calculates required WPM to fit target duration +- Applies 0.95 adjustment factor for better accuracy +- Wide range: 90-360 WPM + +**Accuracy:** + +- Typical variance: 1-3 seconds from target +- Best for general-purpose narration +- Use `afinfo` to verify actual duration + +## Common Voice Languages + +Available voices include: + +- **English**: US, UK, Australian, Indian, Irish, South African +- **Spanish**: Spain, Mexico, Argentina +- **French**: France, Canadian +- **German**: Germany +- **Italian**: Italy +- **Japanese**: Japan +- **Korean**: Korea +- **Chinese**: Mandarin, Cantonese +- And many more... + +Run `./md2audio -list-voices` to see all available voices on your system. + +## Tips + +1. **Quality**: AIFF provides the best quality, M4A is more portable +2. **Rate**: Default is 180 WPM; adjust between 90-360 for different pacing +3. **Testing**: Use dry-run mode to preview before generating: `-dry-run` +4. **Caching**: Voice list is cached for 30 days for faster lookups + +## Troubleshooting + +### Voice not found + +```bash +# List all available voices +./md2audio -provider say -list-voices + +# Use exact voice name +./md2audio -f script.md -v "Samantha" +``` + +### Audio too fast/slow + +```bash +# Slower speech (lower rate) +./md2audio -f script.md -v Kate -r 150 + +# Faster speech (higher rate) +./md2audio -f script.md -v Kate -r 200 +``` + +### M4A conversion fails + +- Ensure you have the latest macOS updates +- The `afconvert` command should be available by default +- Try generating AIFF first to verify the issue + +## Performance + +- Fast generation (local processing) +- No API rate limits +- Works offline +- Voice cache updates instantly + +## Limitations + +- macOS only (not available on Windows or Linux) +- Lower quality compared to Neural TTS services +- Limited voice customization (no pitch/volume control) +- Timing accuracy varies (±1-3 seconds typical) + +## Next Steps + +- Try [ElevenLabs](elevenlabs.md) for higher quality voices +- Try [Google Cloud TTS](google.md) for enterprise features +- Check [Provider Comparison](../provider-comparison.md) for detailed comparison