Skip to content

Epistates/rosellas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Rosellas

High-performance automatic speech recognition (ASR) for Apple Silicon, implemented in Rust with MLX acceleration.

Rosellas is a native Rust implementation of NVIDIA's Parakeet ASR models, optimized for Apple Silicon via the MLX framework. It achieves token parity with the Python reference implementation (parakeet-mlx) while providing a compiled binary that eliminates Python runtime overhead.

Features

  • Native MLX Acceleration: Leverages Apple's MLX framework for GPU-accelerated inference on Apple Silicon
  • Multiple Model Architectures: Supports TDT (Token-and-Duration Transducer), RNN-T, and CTC models
  • Beam Search Decoding: ALSD++ style beam search with configurable width, LM fusion, and prefix merging
  • Accurate Timestamps: Word-level timing information with configurable sentence segmentation
  • Multiple Output Formats: TXT, SRT, VTT, and JSON export
  • Streaming Support: Real-time transcription with rotating cache architecture
  • Long Audio Handling: Automatic chunking with overlap for processing multi-hour recordings

Quick Start

Installation

# Clone the repository
git clone https://github.com/nickpaterno/rosellas
cd rosellas

# Build release binary
cargo build --release

# The binary is at target/release/rosellas

Basic Usage

# Transcribe an audio file (outputs SRT by default)
./target/release/rosellas audio.wav

# Specify output format
./target/release/rosellas --format txt audio.wav

# Use a specific model
./target/release/rosellas --model mlx-community/parakeet-tdt-0.6b-v3 audio.wav

# Show timing metrics
./target/release/rosellas --verbose audio.wav

Library Usage

use rosellas::{Parakeet, TranscribeOptions};

fn main() -> anyhow::Result<()> {
    // Load model from HuggingFace Hub
    let mut model = Parakeet::from_pretrained("mlx-community/parakeet-tdt-0.6b-v3")?
        .bf16()
        .build()?;

    // Transcribe audio file
    let options = TranscribeOptions::new();
    let result = model.transcribe("audio.wav", options)?;

    // Access transcription
    println!("{}", result.text());

    // Write to SRT file
    result.write("output.srt", rosellas::Format::Srt)?;

    Ok(())
}

Supported Models

Model Type Parameters Description
parakeet-tdt-0.6b-v3 TDT 600M Token-and-Duration Transducer, best accuracy
parakeet-rnnt-* RNN-T Various Standard RNN-T models
parakeet-ctc-* CTC Various CTC models (fastest)

Models are automatically downloaded from HuggingFace Hub and cached locally.

Architecture

Rosellas is organized as a Cargo workspace with specialized crates:

rosellas/
├── crates/
│   ├── rosellas-core/    # Configuration, errors, weight loading
│   ├── rosellas-audio/   # Audio I/O, mel spectrogram extraction
│   ├── rosellas-nn/      # Neural network layers (Conformer, attention)
│   ├── rosellas-decode/  # Greedy and beam search decoders
│   ├── rosellas-align/   # Token alignment, sentence segmentation
│   ├── rosellas/         # Main library (facade crate)
│   └── rosellas-cli/     # Command-line interface

Performance

Benchmarked on Apple M1 Max with a 10-minute audio file:

Configuration Time Real-time Factor
BF16 Greedy 2.1s 285x
BF16 Beam (w=4) 2.4s 250x
FP32 Greedy 3.8s 158x

Configuration

CLI Options

Options:
  --model <MODEL>          HuggingFace model ID or local path
  --output-dir, -o <DIR>   Output directory for transcripts
  --format <FORMAT>        Output format: txt, srt, vtt, json, all
  --precision <PRECISION>  Model precision: bf16, fp32
  --chunk-duration <SECS>  Chunk duration for long audio (default: 120)
  --overlap <SECS>         Overlap between chunks (default: 15)
  --max-words <N>          Maximum words per sentence
  --silence-gap <SECS>     Split on silence gaps longer than this
  --max-duration <SECS>    Maximum sentence duration
  --verbose, -v            Enable verbose output with timing metrics
  --info                   Print model info and exit

Runtime Configuration

Create a JSON configuration file:

{
  "precision": "bf16",
  "verbose": false,
  "chunking": {
    "chunk_duration": 120.0,
    "overlap": 15.0
  },
  "sentence": {
    "max_words": 50,
    "silence_gap": 0.5,
    "max_duration": 10.0
  }
}

Load with --config config.json.

Requirements

  • macOS: 13.0+ (Ventura or later)
  • Rust: 1.85+ (2024 edition)
  • Hardware: Apple Silicon (M1/M2/M3/M4)

Building from Source

# Debug build
cargo build

# Release build (optimized)
cargo build --release

# Run tests
cargo test

# Build documentation
cargo doc --open

License

Licensed under either of:

at your option.

Acknowledgments

  • NVIDIA for the Parakeet model architecture and pretrained weights
  • Apple for the MLX framework
  • The mlx-rs project for Rust bindings to MLX

About

Automatic speech recognition (ASR) for Apple Silicon

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages