Zero-Shot Voice Cloning & TTS Engine
Upload a 5–30 second audio sample of any person speaking, and generate unlimited new speech in that voice from any text.
No training, no GPU required, no cloud dependency. Everything runs locally and privately on your machine.
- How Voice Cloning Works
- Quick Start
- Architecture
- File Structure
- Features
- API Reference
- Text Features
- Python SDK
- Docker
- Performance
- Security & Privacy
- Technology Stack
Traditional TTS requires training a dedicated model for one voice — hours of data, days of compute. Zero-shot cloning skips all of that.
F5-TTS (the model Mimicry uses) works via flow matching:
- The model encodes the reference audio into a learned representation of what that speaker sounds like — their timbre, rhythm, accent, breathing pattern
- It treats speech generation as a diffusion/flow process: starting from noise, it iteratively refines a mel-spectrogram conditioned on both the target text AND the reference speaker signal
- The reference transcript (
ref_text) tells the model which parts of the reference audio correspond to which phonemes, giving it a precise phoneme-to-speaker alignment
Whisper (faster-whisper small.en) auto-transcribes the reference audio so you never have to type it manually.
flowchart LR
REF["🎤 Reference Audio\n5–30 seconds of speech"]
TXT["📝 Reference Transcript\nauto-transcribed by Whisper"]
GEN["✍️ Text to Speak\nany length, any content"]
subgraph F5TTS["F5-TTS — Flow Matching"]
ENC["Speaker Encoder\nextract voice characteristics"]
FLOW["Flow Model\nnoise → mel-spectrogram\nconditioned on speaker + text"]
VOC["Vocoder\nmel → waveform"]
end
OUT["🔊 Synthesized Speech\nin the reference speaker's voice"]
REF --> ENC
TXT --> ENC
GEN --> FLOW
ENC --> FLOW
FLOW --> VOC
VOC --> OUT
- Python 3.10–3.13
- ffmpeg installed and on PATH (for MP3 output)
- ~1 GB free disk space (model weights)
# 1. Install Python dependencies
pip install -r requirements.txt
# 2. Start the server (Windows)
start.bat
# Or directly on any OS
python -m uvicorn backend.app.main:app --host 0.0.0.0 --port 8000
# 3. Open in browser
http://localhost:8000First run downloads model weights automatically:
- Whisper small.en → ~240 MB
- F5-TTS → ~800 MB
Both are cached in ~/.cache/huggingface/hub/. Subsequent starts load in ~5 seconds.
python scripts/setup_check.pydocker compose up
# open http://localhost:8000Model weights are cached in a named Docker volume and survive container rebuilds.
Set MIMICRY_API_KEY before starting to protect write endpoints:
# Linux/macOS
MIMICRY_API_KEY=mysecret python -m uvicorn backend.app.main:app --port 8000
# Windows
set MIMICRY_API_KEY=mysecret && python -m uvicorn backend.app.main:app --port 8000Then pass the key in requests:
Authorization: Bearer mysecret
In the browser, add ?key=mysecret to the URL once — it is stored in localStorage automatically.
graph TD
subgraph BROWSER["🌐 Browser — frontend/"]
UI["index.html · style.css · app.js\n4 Tabs: Synthesize · Batch · Mix · Jobs\nVoice Library · History · Verify"]
end
subgraph FASTAPI["⚡ FastAPI — backend/main.py"]
ROUTES["18 REST Endpoints"]
QUEUE["Job Queue\nThreadPoolExecutor max_workers=1"]
AUTH["Optional Bearer Auth\nMIMICRY_API_KEY"]
STATIC["StaticFiles /api/audio"]
end
subgraph ENGINE["🧠 VoiceEngine — backend/voice_engine.py"]
WHISPER["faster-whisper small.en\nauto-transcription"]
F5["F5-TTS\nflow-matching TTS"]
PROC["noisereduce · pyloudnorm\nsoundfile · pydub"]
end
subgraph STORAGE["💾 Local Storage — backend/storage/"]
VOICES["voices/\nreference WAV + JSON"]
OUTPUTS["outputs/\ngenerated WAV + MP3"]
HIST["history.json\nlast 40 records"]
DB[("jobs.db\nSQLite")]
end
BROWSER -->|"HTTP / JSON fetch API"| FASTAPI
FASTAPI -->|"Python calls"| ENGINE
ENGINE -->|"read / write"| STORAGE
QUEUE -->|"persist every status change"| DB
flowchart TD
A["📝 Input Text"] --> B["Normalize newlines\n\\n\\n → <break time='0.7s'/>"]
B --> C["_parse_segments\nsplit on <break> and emotion tags"]
C --> D{Segment type?}
D -->|"speak"| E["_chunk_text\n≤ 180 chars per chunk"]
D -->|"pause"| P["Insert silence\nexact duration"]
E --> F["F5-TTS.infer\nref_file · ref_text · chunk · speed"]
F --> G["_trim_silence\nremove −38 dB edges"]
G --> H["Concatenate\n280 ms inter-chunk\n220 ms inter-segment"]
P --> H
H --> I["pyloudnorm\n−16 LUFS EBU R128"]
I --> J["_embed_watermark\nLSB: MIMICRY:{job_id}"]
J --> K["soundfile.write\n{job_id}.wav"]
K --> L["pydub.export\n{job_id}.mp3 192k"]
flowchart TD
A["🎤 User Audio\nWAV · MP3 · M4A · OGG"] --> B["pydub\n→ 24 kHz mono WAV"]
B --> C["noisereduce\nspectral subtraction prop_decrease=0.70"]
C --> D["validate_audio\ncheck duration · RMS · clipping · sample rate\n→ warnings list"]
D --> E{ref_text provided?}
E -->|"No"| F["faster-whisper small.en\nauto-transcribe"]
E -->|"Yes"| G["Use provided transcript"]
F --> H["Save voice profile\n{id}.wav + {id}.json"]
G --> H
Synthesis is CPU-bound and takes 15–120 s per request. The queue keeps the server responsive:
sequenceDiagram
participant B as Browser
participant API as FastAPI
participant T as Background Thread
participant DB as SQLite jobs.db
B->>API: POST /api/synthesize
API->>DB: INSERT job status=pending
API->>T: ThreadPoolExecutor.submit(_run_synth)
API-->>B: 202 {job_id}
loop Poll every 2.5 s
B->>API: GET /api/jobs/{id}
API-->>B: {status: "pending"} or {status: "running"}
end
T->>DB: UPDATE status=running
T->>T: engine.synthesize(text, voice_id, ...)
T->>DB: UPDATE status=done audio_url mp3_url
T->>DB: INSERT history record
B->>API: GET /api/jobs/{id}
API-->>B: {status: "done", audio_url, mp3_url, watermark_id}
B->>B: Show audio player + download buttons
stateDiagram-v2
[*] --> pending : POST /api/synthesize\nor /api/batch
pending --> running : background thread starts
running --> done : synthesis complete\naudio written to disk
running --> failed : exception raised
done --> [*]
failed --> [*] : Retry button\nre-submits same params
flowchart LR
VA["🎤 Voice A\nreference WAV"] --> NA["peak normalize"]
VB["🎤 Voice B\nreference WAV"] --> NB["peak normalize"]
NA --> TRIM["trim to same length\nmin of both durations"]
NB --> TRIM
TRIM --> MIX["α · wav_A + (1−α) · wav_B\n\nα = 1.0 → pure A\nα = 0.5 → equal blend\nα = 0.0 → pure B"]
MIX --> NORM["peak normalize result"]
NORM --> SAVE["Save as new voice profile\nis_mix: true\nuse for synthesis like any other voice"]
flowchart LR
subgraph EMBED["Embedding — on every synthesis"]
WM["MIMICRY:{job_id}\\0\nconvert each char → 8 bits"]
LSB["Write each bit into\nLSB of audio sample\n±1/32767 change inaudible"]
WAV["Output WAV\nwatermark survives\nfile copies"]
WM --> LSB --> WAV
end
subgraph EXTRACT["Extraction — POST /api/verify"]
READ["Read first 1024 samples\nextract LSBs → bit string"]
DECODE["Decode bits → chars\nlook for MIMICRY: prefix"]
RESULT["watermark_found: true\nwatermark_id: abc123"]
READ --> DECODE --> RESULT
end
WAV -.->|"upload to /api/verify"| READ
All job state is written to SQLite on every status change — jobs survive server restarts.
Voice Cloning/
│
├── backend/
│ ├── app/
│ │ ├── __init__.py
│ │ ├── main.py # FastAPI app — all routes
│ │ └── voice_engine.py # VoiceEngine singleton — all AI logic
│ └── storage/ # Auto-created at runtime
│ ├── voices/ # {id}.wav + {id}.json per voice
│ ├── outputs/ # {jobid}.wav + {jobid}.mp3
│ │ └── history.json # Last 40 synthesis records
│ └── jobs.db # SQLite persistent job store
│
├── scripts/
│ ├── build_release.py # Release packager
│ └── setup_check.py # Installation verifier
│
├── docs/
│ └── assets/ # README images
│
├── frontend/
│ ├── index.html # Single-page app
│ ├── style.css # Dark UI (CSS custom properties)
│ └── app.js # All frontend logic (no framework)
│
├── sdk/
│ ├── __init__.py
│ ├── client.py # Mimicry — sync client (requests)
│ ├── async_client.py # AsyncMimicry — async client (httpx)
│ └── example.py # Runnable demo
│
├── Dockerfile
├── docker-compose.yml
├── .dockerignore
├── requirements.txt
└── start.bat # Windows launcher
- Upload any WAV, MP3, M4A, or OGG clip (5–30 s recommended)
- Drag-and-drop or click to upload
- Whisper auto-transcribes the reference audio
- Optionally provide your own transcript for better accuracy
- Audio validation warns about short clips, low volume, clipping, low sample rate
- Noise reduction applied automatically on upload
- Edit the transcript at any time via the ✏ button
- Preview the reference audio with the ▶ button
- Export voices as portable
.mimicryfiles (zip of WAV + metadata) - Import
.mimicryfiles shared by others
- Select a voice, language (English / Chinese), and speed (0.5×–2.0×)
- Type up to 5000 characters — long text auto-splits into sentence chunks
- Emotion tags adjust speaking style inline (see Text Features)
- SSML
<break>tags insert precise pauses - Paragraph double-newlines insert 700 ms natural pauses
- Audio player with waveform animation
- Download as MP3 (small) or WAV (lossless, carries watermark)
- Watermark badge (🔏) shows the embedded ID on every output
- Ctrl+Enter to synthesize; Escape to close modals
- Retry button on failed jobs
- One line of text = one audio clip
- Up to 50 lines per batch
- Real-time per-line progress with status icons
- Download all clips as a ZIP archive
- Select two voices and blend them with an α slider
- α=1.0 → pure Voice A, α=0.0 → pure Voice B, α=0.5 → equal blend
- The blend is done at the waveform level (α·wavA + (1-α)·wavB)
- Resulting mixed voice appears in the library and can be used for synthesis
- Table of all synthesis and batch jobs (persisted across restarts)
- Type, status, voice, text preview, speed, age, and play button per job
- Status dots: grey (pending), amber (running), green (done), red (failed)
- Refresh button
- Last 40 synthesis records with voice name, text preview, timestamp
- Play and download each recording
- Watermark ID shown per entry
- Upload any audio file to check if it was generated by Mimicry
- Returns the embedded watermark ID if found
- Works on WAV files (MP3 compression destroys LSB watermarks — use WAV for verification)
All write endpoints (POST, PATCH, DELETE) require authentication only if the MIMICRY_API_KEY environment variable is set on the server.
Authorization: Bearer <your-key>
{"model_loaded": true, "languages": ["en", "zh-cn"], "version": "5.0.0"}Form data: name (string), audio (file), ref_text (string, optional)
{
"id": "a1b2c3d4",
"name": "Alice",
"duration": 8.4,
"ref_text": "The quick brown fox jumps over the lazy dog.",
"warnings": [],
"audio_url": "/api/voices/a1b2c3d4/audio"
}Returns array of voice objects.
Body: {"ref_text": "corrected transcript"}
Form data: file (.mimicry or .zip)
{"voice_id_a": "abc", "voice_id_b": "def", "alpha": 0.6, "name": "Alice×Bob"}{"text": "Hello world", "voice_id": "a1b2c3d4", "language": "en", "speed": 1.0}Returns: {"job_id": "xyz789", "status": "pending"}
{
"id": "xyz789",
"status": "done",
"audio_url": "/api/audio/xyz789.wav",
"mp3_url": "/api/audio/xyz789.mp3",
"watermark_id": "xyz789",
"voice_name": "Alice",
"language": "en",
"speed": 1.0,
"text_preview": "Hello world"
}{"lines": ["Line 1.", "Line 2."], "voice_id": "a1b2c3d4", "language": "en", "speed": 1.0}Returns: {"batch_id": "...", "total": 2, "status": "pending"}
Returns batch job with items[] array and zip_url when complete.
Form data: audio (file). Returns:
{"watermark_found": true, "watermark_id": "xyz789", "generated_by": "Mimicry"}Wrap any text span to change the speaking style for that section only:
| Tag | Effect | Speed multiplier |
|---|---|---|
[fast]…[/fast] |
Speaks faster | 1.35× |
[slow]…[/slow] |
Speaks slower | 0.70× |
[whisper]…[/whisper] |
Quiet, breathy | 0.80× |
[excited]…[/excited] |
Energetic | 1.20× |
[calm]…[/calm] |
Relaxed | 0.85× |
These multiply on top of the global speed slider setting.
Example:
Welcome everyone. [excited]This is incredible news![/excited]
Let me explain slowly. [slow]First, you need to understand the basics.[/slow]
Insert a precise pause anywhere in the text:
First part. <break time="2s"/> Second part after a two-second pause.
Supported range: 0.05s to any duration.
Double newlines automatically create a 700 ms natural pause:
This is the first paragraph.
This is the second paragraph with a natural gap before it.
Single newlines are treated as spaces.
Install dependency: pip install requests (sync) or pip install httpx (async).
from sdk import Mimicry
m = Mimicry("http://localhost:8000", api_key="optional-key")
# Upload a reference voice
voice = m.upload_voice("Alice", "alice.wav")
# or with explicit transcript:
voice = m.upload_voice("Alice", "alice.wav", ref_text="What the audio says.")
# Synthesize — blocks until audio is ready, returns bytes
audio = m.synthesize(voice["id"], "Hello from Mimicry!")
open("output.mp3", "wb").write(audio)
# With emotion tags and speed
audio = m.synthesize(
voice["id"],
"[excited]Incredible![/excited] Back to normal now.",
speed=0.9,
)
# Batch synthesis
items = m.batch(voice["id"], ["Line one.", "Line two.", "Line three."])
for item in items:
print(item["status"], item["audio_url"])
# Download all as ZIP
zip_bytes = m.download_batch_zip(voice["id"], ["Line 1.", "Line 2."])
open("batch.zip", "wb").write(zip_bytes)
# Mix two voices
mixed = m.mix(voice_a_id, voice_b_id, name="Alice×Bob", alpha=0.4)
# Export / import portable voice file
data = m.export_voice(voice["id"])
open("alice.mimicry", "wb").write(data)
restored = m.import_voice("alice.mimicry")
# Verify watermark (WAV files only)
result = m.verify("output.wav")
print(result) # {"watermark_found": True, "watermark_id": "...", ...}
# History and queue
print(m.history())
print(m.queue())
# Context manager
with Mimicry() as m:
voices = m.list_voices()import asyncio
from sdk import AsyncMimicry
async def main():
async with AsyncMimicry("http://localhost:8000") as m:
# Wait for server to be ready after cold start
await m.wait_for_model(timeout=300)
voice = await m.upload_voice("Alice", "alice.wav")
audio = await m.synthesize(voice["id"], "Hello async world!")
open("output.mp3", "wb").write(audio)
# Batch with parallel download
items = await m.batch(
voice["id"],
["First line.", "Second line."],
download=True, # fetches audio_bytes for each item
)
asyncio.run(main())| Method | Description |
|---|---|
status() |
Server status and version |
wait_for_model(timeout) |
Block until model is loaded |
list_voices() |
All saved voice profiles |
upload_voice(name, audio, ref_text?) |
Create voice from audio file |
delete_voice(id) |
Remove voice |
update_ref_text(id, text) |
Correct the reference transcript |
export_voice(id) |
Download as .mimicry bytes |
import_voice(data) |
Restore from .mimicry bytes |
mix(id_a, id_b, name, alpha) |
Blend two voices |
synthesize(id, text, language, speed) |
Generate speech, return bytes |
batch(id, lines, language, speed) |
Multi-line synthesis |
download_batch_zip(id, lines) |
Batch → ZIP bytes |
verify(audio) |
Extract watermark from audio file |
history() |
Last 40 synthesis records |
queue() |
All active + recent jobs |
# Build image
docker build -t mimicry .
# Run with docker compose (recommended — handles volumes)
docker compose up
# Run detached
docker compose up -d
# Rebuild after code changes
docker compose up --builddocker-compose.yml mounts two volumes:
| Volume | Purpose |
|---|---|
hf_cache (named) |
HuggingFace model weights (~1 GB, survives rebuilds) |
./backend/storage (bind) |
Voices, outputs, job history on host disk |
To set the API key in Docker:
# docker-compose.yml
environment:
- MIMICRY_API_KEY=your-secret-keyAll timings are approximate on a modern CPU (no GPU):
| Input | Time |
|---|---|
| 1 short sentence (~15 words) | 15–25 s |
| 1 paragraph (~100 words) | 45–90 s |
| Long text (500 words) | 5–12 min |
| Batch of 10 sentences | ~10× single sentence |
With a CUDA GPU: all times drop by roughly 10×. GPU is auto-detected — no configuration needed.
Synthesis runs in a single background thread (max_workers=1). Multiple requests queue up and run sequentially. This prevents memory exhaustion on CPU.
- Fully local — no audio, text, or metadata ever leaves your machine
- No accounts, no telemetry, no internet required after first model download
- API key auth protects write endpoints when
MIMICRY_API_KEYis set - Audio watermarking — every generated clip carries a hidden ID (LSB steganography) traceable back to the exact synthesis job via
/api/verify- The watermark survives file copies but not MP3 compression — use the WAV file for verification
- CORS is open (
allow_origins=["*"]) — appropriate for local use; restrict in production deployments
| Component | Library | Version |
|---|---|---|
| TTS model | F5-TTS (flow matching) | ≥1.1.0 |
| Transcription | faster-whisper small.en | ≥1.0.0 |
| Noise reduction | noisereduce | ≥3.0.0 |
| Loudness normalization | pyloudnorm (EBU R128) | ≥0.1.1 |
| Audio I/O | soundfile + pydub | — |
| Deep learning | PyTorch | ≥2.1.0 |
| Backend | FastAPI + uvicorn | ≥0.110.0 |
| Job persistence | SQLite (stdlib) | — |
| Sync SDK | requests | ≥2.31.0 |
| Async SDK | httpx | ≥0.27.0 |
| Frontend | Vanilla HTML/CSS/JS | — |
| Container | Docker + docker-compose | — |
| Version | Key additions |
|---|---|
| v1 | Planned with Coqui XTTS v2 → blocked by Python 3.13 incompatibility |
| v2 | Switched to F5-TTS. Basic upload + synthesize + dark UI. Named "Mimicry" |
| v3 | Text chunking, async job queue, ref-text editor modal, history panel, speed slider, silence trimming, audio validator, export/import .mimicry, batch synthesis, emotion tags |
| v4 | Voice mixing (α-blend), LSB audio watermarking, 4-tab UI (Synthesize/Batch/Mix/Jobs), Jobs dashboard, Verify watermark modal, queue badge, Docker, Python SDK v1 |
| v5 | GPU auto-detect, Whisper small.en upgrade, noisereduce denoising, pyloudnorm normalization, MP3 output, SSML <break> tags, paragraph prosody, SQLite job persistence, API key auth, voice preview button, keyboard shortcuts (Ctrl+Enter / Escape), retry UI, mobile-responsive layout, async SDK |