Skip to content

Conversation

@SpenserCai
Copy link
Contributor

@SpenserCai SpenserCai commented Jan 15, 2026

Summary

This PR adds support for HunyuanOCR, a Vision-Language Model optimized for document OCR tasks developed by Tencent. The model combines a Vision Transformer encoder with a Transformer decoder for high-quality text recognition from document images.

Model: tencent/HunyuanOCR

Features

  • Dynamic Resolution Support: Variable-sized document images with smart resize algorithm
  • xDRoPE (Extended Dynamic Rotary Position Embedding): Multi-dimensional position encoding for image patches
  • Flash Attention Support: Accelerated inference on CUDA with --features flash-attn
  • SDPA on Metal: Efficient decode phase on Apple Silicon using Scaled Dot Product Attention
  • Multi-image Support: Process multi-page documents in a single inference
  • Batch Mode: Process multiple images sequentially without reloading the model

Architecture

HunyuanOCR consists of:

  1. Vision Encoder (vision.rs)

    • Patch embedding with bilinear position interpolation
    • 27-layer Vision Transformer blocks
    • PatchMerger for spatial merging with special tokens (image_begin, image_end, image_newline)
    • Projects vision features to text embedding space
  2. Text Decoder (text.rs)

    • 24-layer Transformer decoder with SwiGLU MLP
    • Q/K LayerNorm (unique to HunyuanOCR)
    • xDRoPE for 4D position encoding (text, width, height, time)
    • GQA (Grouped Query Attention) with 16 heads / 8 KV heads
  3. Configuration (config.rs)

    • Full HuggingFace config.json compatibility
    • Vision config: 1152 hidden size, 16 patch size, 2048 max image size
    • Text config: 1024 hidden size, 120818 vocab size

Usage

# Basic OCR with default prompt
cargo run --example hunyuan-ocr --release --features cuda -- \
    --image document.png

# Enable Flash Attention for faster inference (CUDA only, requires BF16)
cargo run --example hunyuan-ocr --release --features cuda,flash-attn -- \
    --image document.png \
    --flash-attn \
    --bf16

# Custom prompt
cargo run --example hunyuan-ocr --release --features cuda -- \
    --image document.png \
    --prompt "Extract all text from this image"

# Multi-page document OCR
cargo run --example hunyuan-ocr --release --features cuda -- \
    --image page1.png --image page2.png

# Batch mode - process multiple images sequentially
cargo run --example hunyuan-ocr --release --features cuda -- \
    --batch doc1.png doc2.png doc3.png

# Run on CPU
cargo run --example hunyuan-ocr --release -- \
    --cpu \
    --image document.png

# Run on Metal (Apple Silicon)
cargo run --example hunyuan-ocr --release --features metal -- \
    --image document.png

Implementation Details

xDRoPE Position Encoding

HunyuanOCR uses Extended Dynamic Rotary Position Embedding (xDRoPE) which encodes 4 dimensions:

  • Dimension 0: Sequential text position
  • Dimension 1: Width/column position for image patches
  • Dimension 2: Height/row position for image patches
  • Dimension 3: Time/frame position (for video, 0 for images)

The xdrope_section config [16, 16, 16, 16] specifies how the 64-dim head is split across these 4 dimensions.

Smart Resize Algorithm

Images are resized to ensure:

  1. Both dimensions are divisible by factor (patch_size × spatial_merge_size = 32)
  2. Total pixels are within [512×512, 2048×2048] range
  3. Aspect ratio is maintained (max 200:1)

Vision-Text Integration

Image tokens are injected into the text sequence using special tokens:

  • <IM_START> (120118): Marks beginning of image tokens
  • <IMAGE> (120120): Placeholder for each image patch token
  • <IM_END> (120119): Marks end of image tokens
  • <IM_NEWLINE> (120121): Added at end of each patch row

Files Changed

  • candle-transformers/src/models/hunyuan_ocr/mod.rs - Main model implementation
  • candle-transformers/src/models/hunyuan_ocr/vision.rs - Vision encoder
  • candle-transformers/src/models/hunyuan_ocr/text.rs - Text decoder with xDRoPE
  • candle-transformers/src/models/hunyuan_ocr/config.rs - Configuration structs
  • candle-transformers/src/models/mod.rs - Module export
  • candle-examples/examples/hunyuan-ocr/main.rs - Example application
  • candle-examples/examples/hunyuan-ocr/README.md - Documentation

Test Results

Metal (Apple Silicon)

5d24bed5ce9aa778c9009d6b72f7052c

CUDA

7463e5d9ac7d4bc2a4e12eb1e9aa0dce

CUDA + Flash Attention

ac98e481185497cff6c5663a7c2abb90

Testing

The implementation has been validated against the official PyTorch implementation to ensure numerical correctness.

Huggingface Space

image

Tencent WebSite

image

References

@lucasjinreal
Copy link

How's the speed compare with PaddleOCR_VL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants