Markdown Podcast Narrator

Converts a Markdown document into a high-quality podcast-style audio file using local TTS. Supports multiple backends — run fully offline on Apple Silicon with Qwen3-TTS or Kokoro, or use the built-in macOS voice as a zero-download fallback.

Please read the medium article for the introduction

What It Does

Reads a .md file
Parses headings, paragraphs, lists, code blocks, tables, and admonitions
Converts each element into natural speech with expressive intros and transitions
ALL-CAPS acronyms (API, HTTP, MAX_RETRIES) are automatically spelled out letter by letter
Synthesizes section by section with consistent narrator tone
Stitches all sections with calibrated silence gaps
Outputs a single .wav file

Audio Examples

Engine	Sample
Qwen3-TTS	▶ podcast-qwen.wav
Kokoro	▶ podcast-kokoro.wav
macOS say	▶ podcast-macos.wav

Supported TTS Engines

Engine	Flag	Quality	Speed	Requires
Qwen3-TTS	`--engine qwen` (default)	Excellent	Very Slow (~30-60s/section on MPS)	`qwen-tts torch` + `brew install sox`
Kokoro	`--engine kokoro`	Very good	Very Fast (~3-5s/section)	`kokoro misaki[en]`
macOS say	`--engine macos` or `--fallback`	Good	Very fast (~3-5s/section)	None (built-in)

Supported Markdown Elements

Element	How It's Narrated
`# H1`	Chapter title with 1.2s pause after
`## H2`	Rotating transitions: "Now let's look at...", "Moving on to...", etc.
`### H3+`	Section title with 0.5s pause
Paragraphs	Read naturally with 0.4s gap
`- bullet list`	Detects if preceding paragraph already introduces it; "First, ..., Second, ..."
`1. ordered list`	Same as bullets with ordinal numbers
```code```	Announces language, reads lines with light symbol cleanup
`\| table \|`	Announces columns + row count; reads rows if 5 or fewer
`!!! info/warning/tip`	Expressive intro per type ("Here's a helpful tip...", "Critical alert...")
`---`	1s pause between major sections
ALL_CAPS words	Spelled out: `API` → "A P I", `MAX_RETRIES` → "M A X underscore R E T R I E S"

Requirements

Python 3.10+
uv (recommended) or pip
macOS for the say fallback; Qwen3-TTS and Kokoro work on any platform

Quick Start

1. Install core dependencies

cd markdown_narrator
uv pip install -r requirements.txt

2. Pick your engine

Option A — Qwen3-TTS (best quality, Apple Silicon)

brew install sox
uv pip install qwen-tts torch torchaudio
uv run main.py your_doc.md

Option B — Kokoro (fast, CPU-based, good quality)

uv pip install kokoro misaki[en]
uv run main.py your_doc.md --engine kokoro

Option C — macOS say (instant, no download)

uv run main.py your_doc.md --fallback

Usage

uv run main.py INPUT_FILE [OPTIONS]

Arguments:
  INPUT_FILE    Path to the Markdown (.md) file to convert

Options:
  -o, --output PATH        Output audio file path (default: output.wav)
  --engine [qwen|kokoro|macos]
                           TTS engine to use (default: qwen)
  --speaker TEXT           Qwen3-TTS speaker name (default: Ryan)
  --rate FLOAT             Speech rate multiplier 0.5-2.0 (default: 0.95)
  --fallback               Use macOS 'say' (same as --engine macos)
  --model TEXT             Override Qwen3-TTS HuggingFace model ID
  --instruct TEXT          Narrator style instruction for Qwen3-TTS
  --kokoro-voice TEXT      Kokoro voice name (default: af_heart)
  --help                   Show this message and exit

Examples

# Default (Qwen3-TTS)
uv run main.py docs/guide.md

# Custom output path
uv run main.py docs/guide.md -o podcasts/guide.wav

# Use Kokoro (fast, CPU)
uv run main.py docs/guide.md --engine kokoro

# Slightly faster narration (~15% shorter audio)
uv run main.py docs/guide.md --rate 1.15

# Test without downloading any model
uv run main.py example.md --fallback

# Use the larger 1.7B Qwen model for higher quality
uv run main.py docs/guide.md --model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice

# Custom narrator tone
uv run main.py docs/guide.md --instruct "Speak slowly and warmly, like a bedtime story"

# Different Kokoro voice
uv run main.py docs/guide.md --engine kokoro --kokoro-voice af_sky

Architecture

INPUT (.md)
    |
    v
MarkdownParser.parse_to_speech_tokens()
    — uses markdown-it-py + mdit-py-plugins
    — produces typed token list
    |
    v
MarkdownParser.tokens_to_speech_chunks() / tokens_to_section_chunks()
    — converts tokens to [(text, pause_ms)] chunks
    — speech_chunks: fine-grained (per paragraph/item) for macOS say
    — section_chunks: heading-grouped (per H1/H2) for neural TTS
    — pause_ms = silence to insert AFTER this chunk/section
    |
    v
Narrator.synthesize_sections() / synthesize_chunks()
    — Qwen3-TTS: section-level chunks with instruct for consistent tone
      calls model.generate_custom_voice(text, speaker, instruct)
    — Kokoro: section-level chunks via KPipeline
    — macOS fallback: single say call with [[slnc N]] embedded pauses
    — converts output to raw int16 PCM
    — stitches sections with silence frames between them
    — writes PCM incrementally to temp file (memory efficient)
    |
    v
_write_wav()  [stdlib struct only, no pydub]
    — writes RIFF/WAVE header + all PCM
    |
    v
OUTPUT (.wav)

File responsibilities

File	Purpose
`main.py`	CLI entry point. Orchestrates parse → chunk → synthesize pipeline.
`parser.py`	All markdown parsing and speech text generation. No audio logic here.
`narrator.py`	All audio generation. No markdown logic here. Handles all backends.
`pyproject.toml`	uv/pip project config. Core deps only; neural TTS engines are optional.
`requirements.txt`	Flat dependency list for `pip install -r`.
`example.md`	Test document covering all supported elements.
`AGENTS.md`	Instructions and context for AI agents extending this project.

Chunk pause durations

Boundary	Silence
After H1	1200 ms
After H2	800 ms
After H3+	500 ms
After paragraph	400 ms
Between list items	200 ms
After list ends	300 ms
After code/table block	400 ms
After horizontal rule	1000 ms

Available Qwen3-TTS Models

Model ID	Size	Best for
`Qwen/Qwen3-TTS-12Hz-0.6B-CustomVoice`	0.6B	Default. Fast, good quality, named speakers
`Qwen/Qwen3-TTS-12Hz-0.6B-Base`	0.6B	Voice cloning from reference audio
`Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice`	1.7B	Higher quality, slower
`Qwen/Qwen3-TTS-12Hz-1.7B-VoiceDesign`	1.7B	Describe voice in natural language

Available Kokoro Voices

Popular voices for American English (lang_code="a"):

Voice	Character
`af_heart`	Default — warm female
`af_sky`	Light female
`am_adam`	Male
`am_michael`	Male, deeper

Full list: hexgrad/Kokoro-82M on HuggingFace

Tuning for Large Documents

The MAX_SECTION_CHARS constant in parser.py (default: 600) controls how text is grouped into TTS calls:

Lower (300-500) — more calls, more consistent emotion, safer memory usage
Higher (800-1200) — fewer calls, faster overall, slight risk of prosody drift on very long sections

For a 55KB document with default settings, expect ~35-40 sections.

Adding a New TTS Engine

Add _init_<engine>(self) -> bool to Narrator in narrator.py
Add _synth_<engine>(self, text: str) -> Tuple[Optional[bytes], Optional[int]] returning raw int16 PCM + sample rate
Add the engine name to Narrator.ENGINES and the initialize() dispatch
Add _synth_one() dispatch case
Add --engine choice to main.py's click.option

The chunking loop, silence insertion, memory management, and WAV writing are all engine-agnostic.

Other Platforms

Linux

The only macOS-specific parts are the say command fallback and afconvert. Qwen3-TTS and Kokoro both work natively on Linux with CUDA or CPU.

# Install dependencies
apt install sox
pip install qwen-tts torch torchaudio  # or: pip install kokoro misaki[en]
python main.py your_doc.md

For a CPU-only fallback TTS on Linux, add an espeak-ng backend:

# narrator.py — Linux fallback example
def _init_espeak(self) -> bool:
    result = subprocess.run(["which", "espeak-ng"], capture_output=True)
    return result.returncode == 0

def _synth_espeak(self, text: str) -> Tuple[Optional[bytes], Optional[int]]:
    with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
        tmp_wav = Path(tmp.name)
    subprocess.run(["espeak-ng", "-w", str(tmp_wav), "-s", "160", text], check=True)
    # read WAV PCM and return (pcm_bytes, sample_rate)
    ...

Windows

Replace say/afconvert with pyttsx3 or Windows SAPI
All path handling already uses pathlib.Path
Qwen3-TTS and Kokoro work on Windows with CUDA or CPU

Google Colab / Cloud GPU

# Installs
!apt-get install -y sox
!pip install qwen-tts torch torchaudio

# Run
!python main.py your_doc.md

Qwen3-TTS auto-selects CUDA if available.

Troubleshooting

ModuleNotFoundError: No module named 'qwen_tts'

uv pip install qwen-tts

ModuleNotFoundError: No module named 'kokoro'

uv pip install kokoro misaki[en]

sox: command not found during Qwen3-TTS model load

brew install sox

Process killed / semaphore warning with large documents Lower MAX_SECTION_CHARS in parser.py (try 400-600). The process is running out of MPS GPU memory on large sections.

Kokoro words count mismatch warnings Harmless — Kokoro's phonemizer logs these for acronyms and numbers. They are suppressed automatically.

say fallback produces no output Test manually: say -o /tmp/test.aiff "hello". If that fails, go to System Settings → Accessibility → Spoken Content and download a voice.

Large document is slow with Qwen3-TTS

Switch to Kokoro: --engine kokoro (runs on CPU, much faster for batch work)
On Apple Silicon, confirm MPS is active (look for Loading ... on mps in output)
Use --rate 1.1 for shorter audio output (~10% faster synthesis)

Dependencies

Package	Purpose
`markdown-it-py`	Markdown parsing (CommonMark spec)
`mdit-py-plugins`	Admonition (`!!!info`) support
`click`	CLI argument parsing
`soundfile`	Used by Qwen3-TTS internally
`qwen-tts`	Qwen3-TTS model loading and inference
`torch`	PyTorch for Qwen3-TTS inference
`torchaudio`	Audio tensor utilities
`kokoro`	Kokoro TTS engine
`misaki[en]`	English phonemizer for Kokoro

No pydub, ffmpeg, or librosa required. WAV assembly uses Python stdlib struct only.

License

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Markdown Podcast Narrator

What It Does

Audio Examples

Supported TTS Engines

Supported Markdown Elements

Requirements

Quick Start

1. Install core dependencies

2. Pick your engine

Usage

Examples

Architecture

File responsibilities

Chunk pause durations

Available Qwen3-TTS Models

Available Kokoro Voices

Tuning for Large Documents

Adding a New TTS Engine

Other Platforms

Linux

Windows

Google Colab / Cloud GPU

Troubleshooting

Dependencies

License

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
examples		examples
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
example.md		example.md
main.py		main.py
narrator.py		narrator.py
parser.py		parser.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Markdown Podcast Narrator

What It Does

Audio Examples

Supported TTS Engines

Supported Markdown Elements

Requirements

Quick Start

1. Install core dependencies

2. Pick your engine

Usage

Examples

Architecture

File responsibilities

Chunk pause durations

Available Qwen3-TTS Models

Available Kokoro Voices

Tuning for Large Documents

Adding a New TTS Engine

Other Platforms

Linux

Windows

Google Colab / Cloud GPU

Troubleshooting

Dependencies

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages