Accelerated LTX-2.3 (22B) text-to-video+audio generation on Apple Silicon using MLX with quantized inference.
Generate 5-10 second videos with synchronized audio from text prompts or input images, running entirely on-device.
git clone https://github.com/appautomaton/MLX-GenAI.git
cd MLX-GenAIInstall uv if you haven't, then download the model weights (see Model Weights below).
Dependencies are installed automatically on first uv run.
# Text-to-video
uv run python generate.py "A serene mountain lake at sunrise, golden light reflecting off calm water as thin mist drifts across the surface. Tripod-locked camera, live action, 4K."
# Image-to-video
uv run python generate.py -i input/photo.jpg "The person slowly turns and smiles at the camera"
# With options
uv run python generate.py -f 121 -b 8 --upscale "your prompt here"Output is saved to output/<timestamp>/ with video.mp4, audio.wav, and individual frames.
- Text-to-video (T2V) and image-to-video (I2V) generation
- Joint audio+video through 48-layer DiT transformer (22B params)
- 8-bit / 4-bit quantized inference via MLX
quantized_matmul - 8-step distilled Euler diffusion (LoRA-fused)
- 48kHz stereo audio (BigVGAN v2 vocoder + bandwidth extension)
- Optional 2x spatial upscaler
- Aspect-ratio-aware resolution snapping for I2V
- Apple Silicon Mac (M-series, M1 or later)
- macOS with Metal support
- Python 3.12+, uv
- ffmpeg
- ~14 GB unified memory (8-bit) or ~10 GB (4-bit)
Download from HuggingFace and place under models/:
| Model | Source | Path |
|---|---|---|
| LTX-2.3 FP8 (29 GB) | Lightricks/LTX-2.3-fp8 | models/LTX-2.3-fp8/ltx-2.3-22b-dev-fp8.safetensors |
| Distilled LoRA (7.6 GB) | Lightricks/LTX-2.3 | models/LTX-2.3/ltx-2.3-22b-distilled-lora-384.safetensors |
| Spatial Upscaler 2x (1 GB) | Lightricks/LTX-2.3 | models/LTX-2.3/ltx-2.3-spatial-upscaler-x2-1.0.safetensors |
| Gemma 3 12B (~24 GB) | google/gemma-3-12b-pt | models/gemma-3-12b/ |
# Download with huggingface-cli
huggingface-cli download Lightricks/LTX-2.3-fp8 --local-dir models/LTX-2.3-fp8
huggingface-cli download Lightricks/LTX-2.3 --local-dir models/LTX-2.3
huggingface-cli download google/gemma-3-12b-pt --local-dir models/gemma-3-12buv run python generate.py [prompt] [options]
-p, --prompt-flag Text prompt (flag form)
-i, --image Input image for I2V (jpg/jpeg/png)
--strength I2V conditioning strength 0.0-1.0 (default: 0.95)
-f, --frames Frame count, must be 8k+1 (default: 121)
-H, --height Height, divisible by 32 (default: 512)
-W, --width Width, divisible by 32 (default: 768)
-b, --bits Quantization: 4 or 8 (default: 8)
-s, --seed Random seed
--fps Frames per second (default: 24)
--no-audio Skip audio generation
--upscale 2x spatial upscale
-o, --output Output directory (default: output/)
Research use. Model weights are subject to Lightricks LTX-Video license.