Add HunyuanOCR Vision-Language Model Support #3306

SpenserCai · 2026-01-15T05:57:08Z

Summary

This PR adds support for HunyuanOCR, a Vision-Language Model optimized for document OCR tasks developed by Tencent. The model combines a Vision Transformer encoder with a Transformer decoder for high-quality text recognition from document images.

Model: tencent/HunyuanOCR

Features

Dynamic Resolution Support: Variable-sized document images with smart resize algorithm
xDRoPE (Extended Dynamic Rotary Position Embedding): Multi-dimensional position encoding for image patches
Flash Attention Support: Accelerated inference on CUDA with --features flash-attn
SDPA on Metal: Efficient decode phase on Apple Silicon using Scaled Dot Product Attention
Multi-image Support: Process multi-page documents in a single inference
Batch Mode: Process multiple images sequentially without reloading the model

Architecture

HunyuanOCR consists of:

Vision Encoder (vision.rs)
- Patch embedding with bilinear position interpolation
- 27-layer Vision Transformer blocks
- PatchMerger for spatial merging with special tokens (image_begin, image_end, image_newline)
- Projects vision features to text embedding space
Text Decoder (text.rs)
- 24-layer Transformer decoder with SwiGLU MLP
- Q/K LayerNorm (unique to HunyuanOCR)
- xDRoPE for 4D position encoding (text, width, height, time)
- GQA (Grouped Query Attention) with 16 heads / 8 KV heads
Configuration (config.rs)
- Full HuggingFace config.json compatibility
- Vision config: 1152 hidden size, 16 patch size, 2048 max image size
- Text config: 1024 hidden size, 120818 vocab size

Usage

# Basic OCR with default prompt
cargo run --example hunyuan-ocr --release --features cuda -- \
    --image document.png

# Enable Flash Attention for faster inference (CUDA only, requires BF16)
cargo run --example hunyuan-ocr --release --features cuda,flash-attn -- \
    --image document.png \
    --flash-attn \
    --bf16

# Custom prompt
cargo run --example hunyuan-ocr --release --features cuda -- \
    --image document.png \
    --prompt "Extract all text from this image"

# Multi-page document OCR
cargo run --example hunyuan-ocr --release --features cuda -- \
    --image page1.png --image page2.png

# Batch mode - process multiple images sequentially
cargo run --example hunyuan-ocr --release --features cuda -- \
    --batch doc1.png doc2.png doc3.png

# Run on CPU
cargo run --example hunyuan-ocr --release -- \
    --cpu \
    --image document.png

# Run on Metal (Apple Silicon)
cargo run --example hunyuan-ocr --release --features metal -- \
    --image document.png

Implementation Details

xDRoPE Position Encoding

HunyuanOCR uses Extended Dynamic Rotary Position Embedding (xDRoPE) which encodes 4 dimensions:

Dimension 0: Sequential text position
Dimension 1: Width/column position for image patches
Dimension 2: Height/row position for image patches
Dimension 3: Time/frame position (for video, 0 for images)

The xdrope_section config [16, 16, 16, 16] specifies how the 64-dim head is split across these 4 dimensions.

Smart Resize Algorithm

Images are resized to ensure:

Both dimensions are divisible by factor (patch_size × spatial_merge_size = 32)
Total pixels are within [512×512, 2048×2048] range
Aspect ratio is maintained (max 200:1)

Vision-Text Integration

Image tokens are injected into the text sequence using special tokens:

<IM_START> (120118): Marks beginning of image tokens
<IMAGE> (120120): Placeholder for each image patch token
<IM_END> (120119): Marks end of image tokens
<IM_NEWLINE> (120121): Added at end of each patch row

Files Changed

candle-transformers/src/models/hunyuan_ocr/mod.rs - Main model implementation
candle-transformers/src/models/hunyuan_ocr/vision.rs - Vision encoder
candle-transformers/src/models/hunyuan_ocr/text.rs - Text decoder with xDRoPE
candle-transformers/src/models/hunyuan_ocr/config.rs - Configuration structs
candle-transformers/src/models/mod.rs - Module export
candle-examples/examples/hunyuan-ocr/main.rs - Example application
candle-examples/examples/hunyuan-ocr/README.md - Documentation

Test Results

Metal (Apple Silicon)

CUDA

CUDA + Flash Attention

Testing

The implementation has been validated against the official PyTorch implementation to ensure numerical correctness.

Huggingface Space

Tencent WebSite

References

lucasjinreal · 2026-01-28T14:16:59Z

How's the speed compare with PaddleOCR_VL

SpenserCai added 9 commits January 15, 2026 10:27

init hunyuan_ocr

3d4e68f

support flash-attn and sdpa

68122ff

update readme

c1ac40d

fix

763575d

remove unnecessary repetiti_kv in flash-attn

17ce48a

fix fmt

3d3a4e7

update example image

84988ae

update example readme

0433f22

support sampling parameters

ca9d71b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HunyuanOCR Vision-Language Model Support #3306

Add HunyuanOCR Vision-Language Model Support #3306

Uh oh!

SpenserCai commented Jan 15, 2026 •

edited

Loading

Uh oh!

lucasjinreal commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add HunyuanOCR Vision-Language Model Support #3306

Are you sure you want to change the base?

Add HunyuanOCR Vision-Language Model Support #3306

Uh oh!

Conversation

SpenserCai commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Features

Architecture

Usage

Implementation Details

xDRoPE Position Encoding

Smart Resize Algorithm

Vision-Text Integration

Files Changed

Test Results

Metal (Apple Silicon)

CUDA

CUDA + Flash Attention

Testing

Huggingface Space

Tencent WebSite

References

Uh oh!

lucasjinreal commented Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SpenserCai commented Jan 15, 2026 •

edited

Loading