Links
- GitHub: https://github.com/krafton-ai/Raon-Speech
- Official Demo: https://raon.krafton.ai
- Hugging Face Org: https://huggingface.co/KRAFTON
- Speech model card: https://huggingface.co/KRAFTON/Raon-Speech-9B
- SpeechChat (Full-Duplex) model card: https://huggingface.co/KRAFTON/Raon-SpeechChat-9B
- Technical Report: https://huggingface.co/KRAFTON/Raon-Speech-9B/resolve/main/Technical_Report_Raon_Speech.pdf
Raon is a speech model built on HuggingFace Ecosystem.
This repo contains two tracks:
- Raon-Speech (Offline SpeechLM):
TTS,STT,SpeechChat,TextQA - Raon-SpeechChat (Offline/Realtime Full-Duplex)
Both tracks share the same core model family and processor stack under src/raon/.
- SpeechLLM Model: Raon-Speech is a 9B bilingual (English/Korean) SpeechLM for speech understanding, answering, and generation.
- Full-Duplex Model: Raon-SpeechChat extends Raon-Speech to natural real-time full-duplex conversation and shows strong interaction quality, especially on turn-taking, backchanneling, and interruption handling.
- Training Scale: Raon-Speech is trained on 1M+ hours of curated English-Korean speech-text data and achieves state-of-the-art average performance across 42 speech and text benchmarks against similarly sized baselines. Raon-SpeechChat is continually trained on 116K hours of time-aligned dialogue data.
- System Design: The system is built with a staged LLM-to-SpeechLM training recipe and a full-duplex design based on causal streaming, interleaved speech-text modeling, explicit interaction-state modeling, and text lookahead.
- Task Coverage: Unified multi-task support for
STT,TTS,TextQA, andSpeechChat, with optional speaker-conditioned TTS and TTS continuation from reference audio. - Transformers Integration: Hugging Face Transformers integration via
AutoModel.from_pretrained(..., trust_remote_code=True). - Open Release: We open-source model checkpoints, the training and inference pipeline, an interactive demo, and three Korean speech benchmarks: KVoiceBench, KOpenAudioBench, and KMMAU.
Overall performance comparison of Raon-Speech and Raon-SpeechChat against baseline models across diverse benchmarks.
Raon-Speech is optimized for low-latency, real-time speech generation while maintaining strong performance across ASR, speech generation, spoken QA, audio understanding, and text QA tasks. In the benchmarks above, Raon-Speech shows consistently high cross-domain scores, while Raon-SpeechChat performs strongly on conversational speech capabilities such as pause handling, backchanneling, smooth turn-taking, interruption handling, overlap robustness, and multi-turn dialogue.
On single-GPU streaming TTS setups, the model runs faster than real time on both RTX 6000 Pro and L40S, with sub-second time-to-first-token latency. Measured with LibriSpeech test-clean samples on single-GPU setups via streaming TTS. All values are averaged.
| Metric | RTX 6000 Pro Blackwell | L40S |
|---|---|---|
RTF |
0.27 (3.7x real-time) |
0.45 (2.2x real-time) |
TTFT |
617 ms |
887 ms |
TBT |
135 ms |
233 ms |
RTF: Real-Time Factor. Lower is faster; values below1.0indicate faster-than-real-time synthesis.TTFT: Time to First Token.TBT: Time Between Tokens.
- Python
>=3.11 - CUDA GPU recommended (
bfloat16/float16) - PyTorch + Torchaudio matching your CUDA environment
All model entry points accept either:
- local checkpoint directory
- Hugging Face
repo_id(downloaded automatically byfrom_pretrained)
Examples:
# Edit preset variables in the script first, then run:
bash scripts/infer.sh
# or
bash scripts/duplex_infer.shfrom raon import RaonPipeline
pipe = RaonPipeline("KRAFTON/Raon-Speech-9B", device="cuda", dtype="bfloat16")
# or
pipe = RaonPipeline("KRAFTON/Raon-SpeechChat-9B", device="cuda", dtype="bfloat16")If you want to pre-download first:
hf download KRAFTON/Raon-Speech-9B --local-dir /path/to/model_dir
# or
hf download KRAFTON/Raon-SpeechChat-9B --local-dir /path/to/model_dirAfter pip install -e . (or uv sync), all entry points are supported:
scripts/infer.sh,scripts/duplex_infer.shscripts/train.sh,scripts/duplex_train.shdemo/run_gradio_demo.sh,demo/run_gradio_duplex_demo.sh- Python API:
from raon import RaonPipeline
Supported:
- Raon-Speech Gradio demo (
demo/gradio_demo.py) has a fallback that loads Hub remote code whenraonimport fails. Run directly from HF repo:
# from repo root
bash demo/run_gradio_demo.sh-
Pure Transformers flow using Hub remote code (advanced):
AutoModel.from_pretrained(..., trust_remote_code=True).See examples:
examples/message_example.ipynbandexamples/duplex_example.ipynb. Minimal pipeline load withoutraoninstall:
import torch
from transformers import AutoConfig
from transformers.dynamic_module_utils import get_class_from_dynamic_module
MODEL_ID = "KRAFTON/Raon-Speech-9B"
config = AutoConfig.from_pretrained(MODEL_ID, trust_remote_code=True)
RaonPipeline = get_class_from_dynamic_module(
"modeling_raon.RaonPipeline",
MODEL_ID,
revision=getattr(config, "_commit_hash", None),
)
pipe = RaonPipeline(MODEL_ID, device="cuda", dtype="bfloat16")Not supported as-is:
python -m raon.*module commands (used byscripts/*.sh)- Raon-SpeechChat (Full-duplex) realtime demo runtime (
demo/gradio_duplex_demo.py) because runtime code importsraon.*modules.
git clone https://github.com/krafton-ai/Raon-Speech
cd Raon-Speech
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
pip install -e .git clone https://github.com/krafton-ai/Raon-Speech
cd Raon-Speech
uv syncThe realtime full-duplex demo needs extra packages (sglang, gradio, fastapi, uvicorn):
pip install -e ".[demo]"
# or
uv sync --extra demoIf you want to use FlashAttention during training, install it separately:
pip install flash-attnRaon-Speech/
├── src/raon/ # package code
│ ├── models/ # RaonModel / RaonDuplexModel
│ ├── modules/ # audio encoder, tokenizer, speaker encoder, etc.
│ ├── utils/ # processor, datasets, losses, prompts, special tokens
│ ├── train.py # SpeechLLM training entry
│ ├── duplex_train.py # Full-duplex training entry
│ ├── generate.py # SpeechLLM JSONL inference entry
│ ├── duplex_generate.py # Full-duplex inference entry
│ └── pipeline.py # high-level API (RaonPipeline)
├── scripts/ # shell wrappers
├── demo/ # Gradio demos
├── config/ # inference configs
├── data/ # sample datasets
└── examples/ # notebooks and scripts
- One shared backbone:
RaonModel(LM backbone + audio encoder + Mimi codec path). - Two model types:
raon(Raon-Speech) andraon_duplex(Raon-SpeechChat). - Main trainable blocks include text/audio alignment and audio code prediction
(
input_adaptor,output_adaptor,audio_lm_head,proj_code,code_predictor).
Each line is one sample.
| Field | Type | Description |
|---|---|---|
conversations |
list[dict] |
turns with from (human/gpt) and value |
audios |
list[str] |
audio paths consumed by <audio> tags in order |
speaker_ref_audios |
list[str] |
optional speaker reference audio for TTS |
channel |
str |
tts, stt, speech-chat, textqa |
system |
str |
optional system prompt |
Sample eval data is under data/speechllm/eval.
bash scripts/infer.sh- task defaults come from
config/infer.yaml - edit the preset variables at the top of
scripts/infer.shfor quick runs --attn_implementationcontrols attention backend (defaultsdpa; usefafor FlashAttention)--data_diris scanned for JSONL files, and each line is treated as one sample.
Advanced CLI example:
python -m raon.generate \
--model_path /path/to/model \
--data_dir /path/to/data_dir \
--output_dir /path/to/output_dir \
--config /path/to/config.yaml \
--batch_size 4 \
--attn_implementation sdpafrom raon import RaonPipeline
pipe = RaonPipeline("/path/to/model", device="cuda", dtype="bfloat16", config="/path/to/config.yaml")
text = pipe.stt("/path/to/stt.wav")
audio, sr = pipe.tts("Hello!", speaker_audio="/path/to/speaker_ref.wav")
ans1 = pipe.speech_chat("/path/to/speech-chat.wav")
ans2 = pipe.textqa("What did the speaker say?", audio="/path/to/textqa.wav")
pipe.save_audio((audio, sr), "/path/to/tts.wav")Continuation TTS is also supported:
audio, sr = pipe.tts_continuation(
target_text="Continue this sentence.",
ref_audio="/path/to/speaker_ref.wav",
ref_text="Optional transcription of ref audio.",
)See:
examples/message_example.pyexamples/message_example.ipynb
bash scripts/train.shCommon options:
- edit the preset variables at the top of
scripts/train.sh NPROC_PER_NODEcontrols multi-GPU torchrunMASTER_PORTsets the torchrun rendezvous portUSE_SPEAKER_EMBEDDINGtoggles speaker-conditioning inputs
Notes:
- Current training code freezes these modules:
audio_encoder,input_adaptor,output_adaptor,audio_lm_head,proj_code,code_predictor,speaker_encoder - Default training attention implementation is
sdpa; to use FlashAttention, pass--attn_implementation fa
Training (duplex_train.py) expects one JSON object per line with stereo audio metadata.
Required top-level fields:
| Field | Type | Description |
|---|---|---|
audio_path |
str |
Path to stereo wav (2 channels). |
language |
str |
Language code (for prompt selection), e.g. eng, kor. |
channel |
str |
Duplex channel type, e.g. full_duplex or duplex_instruct. |
speak_first |
list[int or bool] |
Per-channel initial speaking mode for the two channels. 1/true means that channel is treated as speaking first; 0/false means listen-first. |
include_in_training |
list[int or bool] |
Per-channel training mask for the two channels. 1/true includes that channel's targets in loss computation; 0/false excludes them. |
turns or scripts |
list |
Conversation timing/transcript annotations (one of the two formats below). |
Supported annotation format A (turns):
- Use this format when each utterance is already represented as a turn.
- Each turn contains
channel,start_sample,end_sample, andipus. - Each IPU contains
words, and each word hasword,start_sample,end_sample.
Supported annotation format B (scripts):
- Use this format when you have per-channel word timestamps instead of explicit turns.
scriptsis[ch0_words, ch1_words], and each word item hasword,start,endin seconds.- You can optionally provide
timelineorrough_timelinefor utterance boundaries. - If those boundaries include
channel, they are used directly. - Otherwise, the loader derives per-channel utterance boundaries from the word timestamps in
scripts.
Optional fields:
sample_rate(used forturnstimestamps; defaults to24000if omitted)system_prompt(if omitted, prompt is built from persona/context/name metadata when available)
Optional inference metadata sidecar:
- You can place a same-stem
.jsonlfile next to the input wav for prompt and speaker metadata. - This is an input metadata format, not a separate CLI argument schema.
| Field | Type | Description |
|---|---|---|
audio_path |
str |
Input audio path metadata (optional for CLI run, used in dataset-style records). |
speak_first |
bool |
Override initial speaking mode. |
language |
str |
Optional language hint. |
persona |
str |
Persona key/text for prompt builder. |
context |
str |
Extra context appended to system prompt. |
system_prompt |
str |
Explicit prompt override. |
speaker_audio |
str |
Speaker reference wav path (preferred key). |
speaker_ref_audios |
list[str] |
Fallback speaker reference list (first element used). |
Sample data:
- train:
data/duplex/train - eval:
data/duplex/eval - persona catalog:
data/duplex/personas.json
Training data uses timeline JSONL + stereo audio assets.
bash scripts/duplex_infer.sh- edit the preset variables at the top of
scripts/duplex_infer.shfor quick runs --attn_implementationcontrols attention backend (defaultsdpa; usefafor FlashAttention)--data_diris scanned recursively for*.jsonl, and every line is treated as one sample.
Advanced CLI example:
python -m raon.duplex_generate \
--model_path /path/to/model \
--data_dir /path/to/data_dir \
--output_dir /path/to/output_dir \
--config /path/to/config.yaml \
--speaker_audio /path/to/speaker_ref.wav \
--persona /path/or/persona_key \
--context "Optional extra context"data_diris scanned recursively for*.jsonl, and CLI arguments override per-record metadata when both are provided.
from raon import RaonPipeline
pipe = RaonPipeline("/path/to/duplex-model", device="cuda", dtype="bfloat16")
audio = pipe.load_audio("/path/to/input.wav")
summary = pipe.duplex(
audio_input=audio,
output_dir="/path/to/output_dir",
speak_first=False,
system_prompt="You are engaging in real-time conversation.",
speaker_audio="/path/to/speaker_ref.wav",
)
print(summary)See:
examples/duplex_example.ipynb
bash scripts/duplex_train.shCommon options:
- edit the preset variables at the top of
scripts/duplex_train.sh NPROC_PER_NODEcontrols multi-GPU torchrunMASTER_PORTsets the torchrun rendezvous portBATCH_SIZEis fixed to1
Notes:
- same freeze list pattern as SpeechLLM training
- Default training attention implementation is
sdpa; to use FlashAttention, pass--attn_implementation fa
bash demo/run_gradio_demo.sh- edit the preset variables at the top of
demo/run_gradio_demo.sh
- Export HF checkpoint to SGLang bundle:
bash scripts/export.sh- edit the preset variables at the top of
scripts/export.sh
- Run demo:
bash demo/run_gradio_duplex_demo.sh- edit the preset variables at the top of
demo/run_gradio_duplex_demo.sh
Apache License 2.0. See LICENSE.
@misc{raonspeech,
title = {Raon-Speech Technical Report},
author = {{KRAFTON}},
month = {April},
year = {2026}
}