High-performance automatic speech recognition (ASR) for Apple Silicon, implemented in Rust with MLX acceleration.
Rosellas is a native Rust implementation of NVIDIA's Parakeet ASR models, optimized for Apple Silicon via the MLX framework. It achieves token parity with the Python reference implementation (parakeet-mlx) while providing a compiled binary that eliminates Python runtime overhead.
- Native MLX Acceleration: Leverages Apple's MLX framework for GPU-accelerated inference on Apple Silicon
- Multiple Model Architectures: Supports TDT (Token-and-Duration Transducer), RNN-T, and CTC models
- Beam Search Decoding: ALSD++ style beam search with configurable width, LM fusion, and prefix merging
- Accurate Timestamps: Word-level timing information with configurable sentence segmentation
- Multiple Output Formats: TXT, SRT, VTT, and JSON export
- Streaming Support: Real-time transcription with rotating cache architecture
- Long Audio Handling: Automatic chunking with overlap for processing multi-hour recordings
# Clone the repository
git clone https://github.com/nickpaterno/rosellas
cd rosellas
# Build release binary
cargo build --release
# The binary is at target/release/rosellas# Transcribe an audio file (outputs SRT by default)
./target/release/rosellas audio.wav
# Specify output format
./target/release/rosellas --format txt audio.wav
# Use a specific model
./target/release/rosellas --model mlx-community/parakeet-tdt-0.6b-v3 audio.wav
# Show timing metrics
./target/release/rosellas --verbose audio.wavuse rosellas::{Parakeet, TranscribeOptions};
fn main() -> anyhow::Result<()> {
// Load model from HuggingFace Hub
let mut model = Parakeet::from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")?
.bf16()
.build()?;
// Transcribe audio file
let options = TranscribeOptions::new();
let result = model.transcribe("audio.wav", options)?;
// Access transcription
println!("{}", result.text());
// Write to SRT file
result.write("output.srt", rosellas::Format::Srt)?;
Ok(())
}| Model | Type | Parameters | Description |
|---|---|---|---|
parakeet-tdt-0.6b-v3 |
TDT | 600M | Token-and-Duration Transducer, best accuracy |
parakeet-rnnt-* |
RNN-T | Various | Standard RNN-T models |
parakeet-ctc-* |
CTC | Various | CTC models (fastest) |
Models are automatically downloaded from HuggingFace Hub and cached locally.
Rosellas is organized as a Cargo workspace with specialized crates:
rosellas/
├── crates/
│ ├── rosellas-core/ # Configuration, errors, weight loading
│ ├── rosellas-audio/ # Audio I/O, mel spectrogram extraction
│ ├── rosellas-nn/ # Neural network layers (Conformer, attention)
│ ├── rosellas-decode/ # Greedy and beam search decoders
│ ├── rosellas-align/ # Token alignment, sentence segmentation
│ ├── rosellas/ # Main library (facade crate)
│ └── rosellas-cli/ # Command-line interface
Benchmarked on Apple M1 Max with a 10-minute audio file:
| Configuration | Time | Real-time Factor |
|---|---|---|
| BF16 Greedy | 2.1s | 285x |
| BF16 Beam (w=4) | 2.4s | 250x |
| FP32 Greedy | 3.8s | 158x |
Options:
--model <MODEL> HuggingFace model ID or local path
--output-dir, -o <DIR> Output directory for transcripts
--format <FORMAT> Output format: txt, srt, vtt, json, all
--precision <PRECISION> Model precision: bf16, fp32
--chunk-duration <SECS> Chunk duration for long audio (default: 120)
--overlap <SECS> Overlap between chunks (default: 15)
--max-words <N> Maximum words per sentence
--silence-gap <SECS> Split on silence gaps longer than this
--max-duration <SECS> Maximum sentence duration
--verbose, -v Enable verbose output with timing metrics
--info Print model info and exit
Create a JSON configuration file:
{
"precision": "bf16",
"verbose": false,
"chunking": {
"chunk_duration": 120.0,
"overlap": 15.0
},
"sentence": {
"max_words": 50,
"silence_gap": 0.5,
"max_duration": 10.0
}
}Load with --config config.json.
- macOS: 13.0+ (Ventura or later)
- Rust: 1.85+ (2024 edition)
- Hardware: Apple Silicon (M1/M2/M3/M4)
# Debug build
cargo build
# Release build (optimized)
cargo build --release
# Run tests
cargo test
# Build documentation
cargo doc --openLicensed under either of:
- Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
- MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)
at your option.
- NVIDIA for the Parakeet model architecture and pretrained weights
- Apple for the MLX framework
- The mlx-rs project for Rust bindings to MLX