Skip to content

krafton-ai/Raon-Speech

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Raon-Speech

Raon-Speech Logo
Raon-SpeechChat Logo

Homepage GitHub Hugging Face X

Links

Raon is a speech model built on HuggingFace Ecosystem.
This repo contains two tracks:

  • Raon-Speech (Offline SpeechLM): TTS, STT, SpeechChat, TextQA
  • Raon-SpeechChat (Offline/Realtime Full-Duplex)

Both tracks share the same core model family and processor stack under src/raon/.

Key Features

  • SpeechLLM Model: Raon-Speech is a 9B bilingual (English/Korean) SpeechLM for speech understanding, answering, and generation.
  • Full-Duplex Model: Raon-SpeechChat extends Raon-Speech to natural real-time full-duplex conversation and shows strong interaction quality, especially on turn-taking, backchanneling, and interruption handling.
  • Training Scale: Raon-Speech is trained on 1M+ hours of curated English-Korean speech-text data and achieves state-of-the-art average performance across 42 speech and text benchmarks against similarly sized baselines. Raon-SpeechChat is continually trained on 116K hours of time-aligned dialogue data.
  • System Design: The system is built with a staged LLM-to-SpeechLM training recipe and a full-duplex design based on causal streaming, interleaved speech-text modeling, explicit interaction-state modeling, and text lookahead.
  • Task Coverage: Unified multi-task support for STT, TTS, TextQA, and SpeechChat, with optional speaker-conditioned TTS and TTS continuation from reference audio.
  • Transformers Integration: Hugging Face Transformers integration via AutoModel.from_pretrained(..., trust_remote_code=True).
  • Open Release: We open-source model checkpoints, the training and inference pipeline, an interactive demo, and three Korean speech benchmarks: KVoiceBench, KOpenAudioBench, and KMMAU.

Benchmark Results

Overall performance comparison of Raon-Speech and Raon-SpeechChat against baseline models across diverse benchmarks.

Raon-Speech Benchmark Results

Raon-Speech is optimized for low-latency, real-time speech generation while maintaining strong performance across ASR, speech generation, spoken QA, audio understanding, and text QA tasks. In the benchmarks above, Raon-Speech shows consistently high cross-domain scores, while Raon-SpeechChat performs strongly on conversational speech capabilities such as pause handling, backchanneling, smooth turn-taking, interruption handling, overlap robustness, and multi-turn dialogue.

On single-GPU streaming TTS setups, the model runs faster than real time on both RTX 6000 Pro and L40S, with sub-second time-to-first-token latency. Measured with LibriSpeech test-clean samples on single-GPU setups via streaming TTS. All values are averaged.

Metric RTX 6000 Pro Blackwell L40S
RTF 0.27 (3.7x real-time) 0.45 (2.2x real-time)
TTFT 617 ms 887 ms
TBT 135 ms 233 ms
  • RTF: Real-Time Factor. Lower is faster; values below 1.0 indicate faster-than-real-time synthesis.
  • TTFT: Time to First Token.
  • TBT: Time Between Tokens.

Requirements

  • Python >=3.11
  • CUDA GPU recommended (bfloat16 / float16)
  • PyTorch + Torchaudio matching your CUDA environment

Model Loading (Local or Hugging Face Hub)

All model entry points accept either:

  • local checkpoint directory
  • Hugging Face repo_id (downloaded automatically by from_pretrained)

Examples:

# Edit preset variables in the script first, then run:
bash scripts/infer.sh
# or
bash scripts/duplex_infer.sh
from raon import RaonPipeline

pipe = RaonPipeline("KRAFTON/Raon-Speech-9B", device="cuda", dtype="bfloat16")
# or
pipe = RaonPipeline("KRAFTON/Raon-SpeechChat-9B", device="cuda", dtype="bfloat16")

If you want to pre-download first:

hf download KRAFTON/Raon-Speech-9B --local-dir /path/to/model_dir
# or
hf download KRAFTON/Raon-SpeechChat-9B --local-dir /path/to/model_dir

Execution Modes

Mode 1: raon installed (recommended)

After pip install -e . (or uv sync), all entry points are supported:

  • scripts/infer.sh, scripts/duplex_infer.sh
  • scripts/train.sh, scripts/duplex_train.sh
  • demo/run_gradio_demo.sh, demo/run_gradio_duplex_demo.sh
  • Python API: from raon import RaonPipeline

Mode 2: without raon install

Supported:

  • Raon-Speech Gradio demo (demo/gradio_demo.py) has a fallback that loads Hub remote code when raon import fails. Run directly from HF repo:
# from repo root
bash demo/run_gradio_demo.sh
  • Pure Transformers flow using Hub remote code (advanced): AutoModel.from_pretrained(..., trust_remote_code=True).

    See examples: examples/message_example.ipynb and examples/duplex_example.ipynb. Minimal pipeline load without raon install:

import torch
from transformers import AutoConfig
from transformers.dynamic_module_utils import get_class_from_dynamic_module

MODEL_ID = "KRAFTON/Raon-Speech-9B"

config = AutoConfig.from_pretrained(MODEL_ID, trust_remote_code=True)
RaonPipeline = get_class_from_dynamic_module(
    "modeling_raon.RaonPipeline",
    MODEL_ID,
    revision=getattr(config, "_commit_hash", None),
)

pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16")

Not supported as-is:

  • python -m raon.* module commands (used by scripts/*.sh)
  • Raon-SpeechChat (Full-duplex) realtime demo runtime (demo/gradio_duplex_demo.py) because runtime code imports raon.* modules.

Environment Setup

Option A: venv + pip

git clone https://github.com/krafton-ai/Raon-Speech
cd Raon-Speech
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .

Option B: uv

git clone https://github.com/krafton-ai/Raon-Speech
cd Raon-Speech
uv sync

Demo dependencies (optional)

The realtime full-duplex demo needs extra packages (sglang, gradio, fastapi, uvicorn):

pip install -e ".[demo]"
# or
uv sync --extra demo

FlashAttention (optional)

If you want to use FlashAttention during training, install it separately:

pip install flash-attn

Project Layout

Raon-Speech/
├── src/raon/                 # package code
│   ├── models/               # RaonModel / RaonDuplexModel
│   ├── modules/              # audio encoder, tokenizer, speaker encoder, etc.
│   ├── utils/                # processor, datasets, losses, prompts, special tokens
│   ├── train.py              # SpeechLLM training entry
│   ├── duplex_train.py       # Full-duplex training entry
│   ├── generate.py           # SpeechLLM JSONL inference entry
│   ├── duplex_generate.py    # Full-duplex inference entry
│   └── pipeline.py           # high-level API (RaonPipeline)
├── scripts/                  # shell wrappers
├── demo/                     # Gradio demos
├── config/                   # inference configs
├── data/                     # sample datasets
└── examples/                 # notebooks and scripts

Model Architecture

  • One shared backbone: RaonModel (LM backbone + audio encoder + Mimi codec path).
  • Two model types: raon (Raon-Speech) and raon_duplex (Raon-SpeechChat).
  • Main trainable blocks include text/audio alignment and audio code prediction (input_adaptor, output_adaptor, audio_lm_head, proj_code, code_predictor).

SpeechLLM

Data Format (JSONL)

Each line is one sample.

Field Type Description
conversations list[dict] turns with from (human/gpt) and value
audios list[str] audio paths consumed by <audio> tags in order
speaker_ref_audios list[str] optional speaker reference audio for TTS
channel str tts, stt, speech-chat, textqa
system str optional system prompt

Sample eval data is under data/speechllm/eval.

Inference

bash scripts/infer.sh
  • task defaults come from config/infer.yaml
  • edit the preset variables at the top of scripts/infer.sh for quick runs
  • --attn_implementation controls attention backend (default sdpa; use fa for FlashAttention)
  • --data_dir is scanned for JSONL files, and each line is treated as one sample.

Advanced CLI example:

python -m raon.generate \
  --model_path /path/to/model \
  --data_dir /path/to/data_dir \
  --output_dir /path/to/output_dir \
  --config /path/to/config.yaml \
  --batch_size 4 \
  --attn_implementation sdpa

Pipeline API

from raon import RaonPipeline

pipe = RaonPipeline("/path/to/model", device="cuda", dtype="bfloat16", config="/path/to/config.yaml")

text = pipe.stt("/path/to/stt.wav")
audio, sr = pipe.tts("Hello!", speaker_audio="/path/to/speaker_ref.wav")
ans1 = pipe.speech_chat("/path/to/speech-chat.wav")
ans2 = pipe.textqa("What did the speaker say?", audio="/path/to/textqa.wav")

pipe.save_audio((audio, sr), "/path/to/tts.wav")

Continuation TTS is also supported:

audio, sr = pipe.tts_continuation(
    target_text="Continue this sentence.",
    ref_audio="/path/to/speaker_ref.wav",
    ref_text="Optional transcription of ref audio.",
)

See:

  • examples/message_example.py
  • examples/message_example.ipynb

Training

bash scripts/train.sh

Common options:

  • edit the preset variables at the top of scripts/train.sh
  • NPROC_PER_NODE controls multi-GPU torchrun
  • MASTER_PORT sets the torchrun rendezvous port
  • USE_SPEAKER_EMBEDDING toggles speaker-conditioning inputs

Notes:

  • Current training code freezes these modules: audio_encoder, input_adaptor, output_adaptor, audio_lm_head, proj_code, code_predictor, speaker_encoder
  • Default training attention implementation is sdpa; to use FlashAttention, pass --attn_implementation fa

Full-Duplex

Data Format (JSONL)

Training (duplex_train.py) expects one JSON object per line with stereo audio metadata.

Required top-level fields:

Field Type Description
audio_path str Path to stereo wav (2 channels).
language str Language code (for prompt selection), e.g. eng, kor.
channel str Duplex channel type, e.g. full_duplex or duplex_instruct.
speak_first list[int or bool] Per-channel initial speaking mode for the two channels. 1/true means that channel is treated as speaking first; 0/false means listen-first.
include_in_training list[int or bool] Per-channel training mask for the two channels. 1/true includes that channel's targets in loss computation; 0/false excludes them.
turns or scripts list Conversation timing/transcript annotations (one of the two formats below).

Supported annotation format A (turns):

  • Use this format when each utterance is already represented as a turn.
  • Each turn contains channel, start_sample, end_sample, and ipus.
  • Each IPU contains words, and each word has word, start_sample, end_sample.

Supported annotation format B (scripts):

  • Use this format when you have per-channel word timestamps instead of explicit turns.
  • scripts is [ch0_words, ch1_words], and each word item has word, start, end in seconds.
  • You can optionally provide timeline or rough_timeline for utterance boundaries.
  • If those boundaries include channel, they are used directly.
  • Otherwise, the loader derives per-channel utterance boundaries from the word timestamps in scripts.

Optional fields:

  • sample_rate (used for turns timestamps; defaults to 24000 if omitted)
  • system_prompt (if omitted, prompt is built from persona/context/name metadata when available)

Optional inference metadata sidecar:

  • You can place a same-stem .jsonl file next to the input wav for prompt and speaker metadata.
  • This is an input metadata format, not a separate CLI argument schema.
Field Type Description
audio_path str Input audio path metadata (optional for CLI run, used in dataset-style records).
speak_first bool Override initial speaking mode.
language str Optional language hint.
persona str Persona key/text for prompt builder.
context str Extra context appended to system prompt.
system_prompt str Explicit prompt override.
speaker_audio str Speaker reference wav path (preferred key).
speaker_ref_audios list[str] Fallback speaker reference list (first element used).

Sample data:

  • train: data/duplex/train
  • eval: data/duplex/eval
  • persona catalog: data/duplex/personas.json

Training data uses timeline JSONL + stereo audio assets.

Inference

bash scripts/duplex_infer.sh
  • edit the preset variables at the top of scripts/duplex_infer.sh for quick runs
  • --attn_implementation controls attention backend (default sdpa; use fa for FlashAttention)
  • --data_dir is scanned recursively for *.jsonl, and every line is treated as one sample.

Advanced CLI example:

python -m raon.duplex_generate \
  --model_path /path/to/model \
  --data_dir /path/to/data_dir \
  --output_dir /path/to/output_dir \
  --config /path/to/config.yaml \
  --speaker_audio /path/to/speaker_ref.wav \
  --persona /path/or/persona_key \
  --context "Optional extra context"
  • data_dir is scanned recursively for *.jsonl, and CLI arguments override per-record metadata when both are provided.

Pipeline API

from raon import RaonPipeline

pipe = RaonPipeline("/path/to/duplex-model", device="cuda", dtype="bfloat16")
audio = pipe.load_audio("/path/to/input.wav")

summary = pipe.duplex(
    audio_input=audio,
    output_dir="/path/to/output_dir",
    speak_first=False,
    system_prompt="You are engaging in real-time conversation.",
    speaker_audio="/path/to/speaker_ref.wav",
)
print(summary)

See:

  • examples/duplex_example.ipynb

Training

bash scripts/duplex_train.sh

Common options:

  • edit the preset variables at the top of scripts/duplex_train.sh
  • NPROC_PER_NODE controls multi-GPU torchrun
  • MASTER_PORT sets the torchrun rendezvous port
  • BATCH_SIZE is fixed to 1

Notes:

  • same freeze list pattern as SpeechLLM training
  • Default training attention implementation is sdpa; to use FlashAttention, pass --attn_implementation fa

Gradio Demos

Raon-Speech demo

bash demo/run_gradio_demo.sh
  • edit the preset variables at the top of demo/run_gradio_demo.sh

Raon-SpeechChat realtime demo

  1. Export HF checkpoint to SGLang bundle:
bash scripts/export.sh
  • edit the preset variables at the top of scripts/export.sh
  1. Run demo:
bash demo/run_gradio_duplex_demo.sh
  • edit the preset variables at the top of demo/run_gradio_duplex_demo.sh

License

Apache License 2.0. See LICENSE.

Citation

@misc{raonspeech,
    title  = {Raon-Speech Technical Report},
    author = {{KRAFTON}},
    month  = {April},
    year   = {2026}
}

About

Open-source speech AI models from KRAFTON, including Raon-Speech and Raon-SpeechChat for speech understanding, generation, and real-time full-duplex conversation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors