Rosellas

High-performance automatic speech recognition (ASR) for Apple Silicon, implemented in Rust with MLX acceleration.

Rosellas is a native Rust implementation of NVIDIA's Parakeet ASR models, optimized for Apple Silicon via the MLX framework. It achieves token parity with the Python reference implementation (parakeet-mlx) while providing a compiled binary that eliminates Python runtime overhead.

Features

Native MLX Acceleration: Leverages Apple's MLX framework for GPU-accelerated inference on Apple Silicon
Multiple Model Architectures: Supports TDT (Token-and-Duration Transducer), RNN-T, and CTC models
Beam Search Decoding: ALSD++ style beam search with configurable width, LM fusion, and prefix merging
Accurate Timestamps: Word-level timing information with configurable sentence segmentation
Multiple Output Formats: TXT, SRT, VTT, and JSON export
Streaming Support: Real-time transcription with rotating cache architecture
Long Audio Handling: Automatic chunking with overlap for processing multi-hour recordings

Quick Start

Installation

# Clone the repository
git clone https://github.com/nickpaterno/rosellas
cd rosellas

# Build release binary
cargo build --release

# The binary is at target/release/rosellas

Basic Usage

# Transcribe an audio file (outputs SRT by default)
./target/release/rosellas audio.wav

# Specify output format
./target/release/rosellas --format txt audio.wav

# Use a specific model
./target/release/rosellas --model mlx-community/parakeet-tdt-0.6b-v3 audio.wav

# Show timing metrics
./target/release/rosellas --verbose audio.wav

Library Usage

use rosellas::{Parakeet, TranscribeOptions};

fn main() -> anyhow::Result<()> {
    // Load model from HuggingFace Hub
    let mut model = Parakeet::from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")?
        .bf16()
        .build()?;

    // Transcribe audio file
    let options = TranscribeOptions::new();
    let result = model.transcribe("audio.wav", options)?;

    // Access transcription
    println!("{}", result.text());

    // Write to SRT file
    result.write("output.srt", rosellas::Format::Srt)?;

    Ok(())
}

Supported Models

Model	Type	Parameters	Description
`parakeet-tdt-0.6b-v3`	TDT	600M	Token-and-Duration Transducer, best accuracy
`parakeet-rnnt-*`	RNN-T	Various	Standard RNN-T models
`parakeet-ctc-*`	CTC	Various	CTC models (fastest)

Models are automatically downloaded from HuggingFace Hub and cached locally.

Architecture

Rosellas is organized as a Cargo workspace with specialized crates:

rosellas/
├── crates/
│   ├── rosellas-core/    # Configuration, errors, weight loading
│   ├── rosellas-audio/   # Audio I/O, mel spectrogram extraction
│   ├── rosellas-nn/      # Neural network layers (Conformer, attention)
│   ├── rosellas-decode/  # Greedy and beam search decoders
│   ├── rosellas-align/   # Token alignment, sentence segmentation
│   ├── rosellas/         # Main library (facade crate)
│   └── rosellas-cli/     # Command-line interface

Performance

Benchmarked on Apple M1 Max with a 10-minute audio file:

Configuration	Time	Real-time Factor
BF16 Greedy	2.1s	285x
BF16 Beam (w=4)	2.4s	250x
FP32 Greedy	3.8s	158x

Configuration

CLI Options

Options:
  --model <MODEL>          HuggingFace model ID or local path
  --output-dir, -o <DIR>   Output directory for transcripts
  --format <FORMAT>        Output format: txt, srt, vtt, json, all
  --precision <PRECISION>  Model precision: bf16, fp32
  --chunk-duration <SECS>  Chunk duration for long audio (default: 120)
  --overlap <SECS>         Overlap between chunks (default: 15)
  --max-words <N>          Maximum words per sentence
  --silence-gap <SECS>     Split on silence gaps longer than this
  --max-duration <SECS>    Maximum sentence duration
  --verbose, -v            Enable verbose output with timing metrics
  --info                   Print model info and exit

Runtime Configuration

Create a JSON configuration file:

{
  "precision": "bf16",
  "verbose": false,
  "chunking": {
    "chunk_duration": 120.0,
    "overlap": 15.0
  },
  "sentence": {
    "max_words": 50,
    "silence_gap": 0.5,
    "max_duration": 10.0
  }
}

Load with --config config.json.

Requirements

macOS: 13.0+ (Ventura or later)
Rust: 1.85+ (2024 edition)
Hardware: Apple Silicon (M1/M2/M3/M4)

Building from Source

# Debug build
cargo build

# Release build (optimized)
cargo build --release

# Run tests
cargo test

# Build documentation
cargo doc --open

License

Licensed under either of:

Apache License, Version 2.0 (LICENSE-APACHE or http://www.apache.org/licenses/LICENSE-2.0)
MIT license (LICENSE-MIT or http://opensource.org/licenses/MIT)

at your option.

Acknowledgments

NVIDIA for the Parakeet model architecture and pretrained weights
Apple for the MLX framework
The mlx-rs project for Rust bindings to MLX

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
crates		crates
fuzz		fuzz
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
README.md		README.md
input.txt		input.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rosellas

Features

Quick Start

Installation

Basic Usage

Library Usage

Supported Models

Architecture

Performance

Configuration

CLI Options

Runtime Configuration

Requirements

Building from Source

License

Acknowledgments

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rosellas

Features

Quick Start

Installation

Basic Usage

Library Usage

Supported Models

Architecture

Performance

Configuration

CLI Options

Runtime Configuration

Requirements

Building from Source

License

Acknowledgments

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages