Last Updated: 2024-12-01 Project: zen-omni Organization: zenlm Website: https://zenlm.org
Zen Omni is a hypermodal language model for translation and audio generation, built on Qwen3-Omni-30B-A3B. It is part of the Zen LM model family by Hanzo AI.
| Attribute | Value |
|---|---|
| Base Model | Qwen3-Omni-30B-A3B-Instruct |
| Architecture | Qwen3OmniMoeForConditionalGeneration |
| Total Parameters | 30B |
| Active Parameters | 3B (via MoE) |
| Text Languages | 119 |
| Speech Input | 19 languages |
| Speech Output | 10 languages |
| Context Length | 32,768 tokens |
| Model | Purpose | HuggingFace |
|---|---|---|
| zen-omni | General multimodal | zenlm/zen-omni |
| zen-omni-30b-instruct | Instruction following | zenlm/zen-omni-30b-instruct |
| zen-omni-30b-thinking | Extended reasoning | zenlm/zen-omni-30b-thinking |
| zen-omni-30b-captioner | Image/video captioning | zenlm/zen-omni-30b-captioner |
zen-omni/
├── base-model/ # Downloaded Qwen3-Omni-30B-A3B weights
├── src/zen_omni/ # Python package
│ ├── __init__.py
│ ├── cli.py # Command line interface
│ ├── translator.py # ZenOmniTranslator class
│ └── pipeline.py # ZenDubbingPipeline, HanzoOrchestrationLayer
├── training/
│ ├── zen_identity_sft.yaml # ms-swift training config
│ ├── ds_config_zero2.json # DeepSpeed config
│ ├── train_identity.sh # Training script
│ └── data/zen_identity.jsonl # Identity training data
├── hf-cards/ # HuggingFace model cards
│ ├── zen-omni/
│ ├── zen-omni-30b-instruct/
│ ├── zen-omni-30b-thinking/
│ └── zen-omni-30b-captioner/
├── paper/
│ └── zen_omni_technical_report.md
├── scripts/
│ └── upload_hf_cards.sh
├── docs/ # Website documentation
├── pyproject.toml # Python package config
├── README.md # Main readme
└── LLM.md # This file
# Install package
pip install -e ".[all]"
# Run CLI
zen-omni translate audio.wav --lang en
zen-omni dub video.mp4 --lang en
zen-omni chat
zen-omni caption image.jpg# Identity fine-tuning with ms-swift
cd training
./train_identity.sh# Upload model cards
./scripts/upload_hf_cards.sh
# Upload full model weights
hf upload zenlm/zen-omni ./base-model --repo-type modelINPUT ENCODERS
├── Audio Encoder (32 layers, 1280 dim)
├── Vision Encoder (27 layers, 1152 dim)
└── Text Embeddings (151,936 vocab)
↓
THINKER (Multimodal LLM)
├── 48 transformer layers
├── 128 experts (MoE)
├── 8 experts active per token
└── Cross-modal attention fusion
↓
TALKER (Audio Generator)
├── Code2Wav audio codec
├── 16 quantizers, 2048 codebook
└── Streaming synthesis (24kHz)
Zen Omni integrates with zen-dub for video dubbing:
from zen_omni import ZenDubbingPipeline
pipeline = ZenDubbingPipeline()
pipeline.dub("video.mp4", target_lang="en", output_path="dubbed.mp4")Pipeline stages:
- Extract audio from video
- Translate speech with Zen Omni
- Generate lip-synced video with Zen Dub
- Composite final output
- Qwen3-Omni: Base multimodal architecture
- ms-swift: ModelScope fine-tuning framework
- MuseTalk: Neural lip synchronization (zen-dub)
- Whisper: Audio feature extraction
- DeepSpeed: Distributed training
- Download base model:
hf download Qwen/Qwen3-Omni-30B-A3B-Instruct --local-dir ./base-model - Identity fine-tuning:
./training/train_identity.sh - Test locally:
zen-omni chat - Upload to HuggingFace:
./scripts/upload_hf_cards.sh
This file (LLM.md) is symlinked as:
.AGENTS.mdCLAUDE.mdQWEN.mdGEMINI.md
All files reference the same knowledge base. Updates here propagate to all AI systems.
- ALWAYS update LLM.md with significant discoveries
- NEVER commit symlinked files (.AGENTS.md, CLAUDE.md, etc.) - they're in .gitignore
- NEVER create random summary files - update THIS file
- Zen models are based on Qwen3 (NOT Qwen2!)
- Use
hfCLI for HuggingFace operations - Test-driven development - always verify before marking complete
- README.md with correct architecture
- ms-swift training configuration
- Identity training data
- Python package (src/zen_omni/)
- CLI tool
- ZenOmniTranslator class
- ZenDubbingPipeline integration
- HanzoOrchestrationLayer for real-time streaming
- HuggingFace model cards for all variants
- Technical report
- pyproject.toml
- Downloading Qwen3-Omni-30B-A3B-Instruct weights (~66GB)
- Identity fine-tuning execution
- Upload fine-tuned weights to HuggingFace
- Integration testing with zen-dub
- Performance benchmarking
Zen Omni: Hypermodal Language Model for Translation and Audio Generation
Hanzo AI | https://hanzo.ai | Techstars '17 Zoo Labs Foundation | https://zoolabs.io | 501(c)(3)