A tiny, Dockerized FastAPI server around the Hugging Face model Jalea96/DeepSeek-OCR-bnb-4bit-NF4 that:
- exposes one endpoint:
POST /infer - streams raw text (
text/plain) as the model emits it (usecurl -N) - is configurable entirely via a
.envfile (port, model, dtype, preset, etc.) - ships a lightweight client CLI (
ocr) that can send an image or let you draw a screen region across any monitor and stream the OCR text - When you run ocr it let's you select an area on your screen which is sent to the model - you will get the response streamed in terminal then on your clipboard.
Hardware target: NVIDIA GPU with CUDA 13.0+ (Docker base: nvidia/cuda:13.0.1-cudnn-runtime-ubuntu22.04) and PyTorch cu130 wheels.
-
NVIDIA Driver + NVIDIA Container Toolkit (for GPU access)
-
Docker & docker-compose
-
(Optional client) Python 3.10+ on your host with
requests,Pillow,mss,tkinterOn minimal distros you may need to install
python3-tk.
Create a file .env next to your docker-compose.yml:
# Server
PORT=8000
# Model/runtime
MODEL_ID=Jalea96/DeepSeek-OCR-bnb-4bit-NF4
ATTN_IMPL=eager # or flash_attention_2 (if installed)
TORCH_DTYPE=bf16 # bf16|fp16|fp32
DEVICE_MAP=auto
PRESET=gundam
PROMPT=<image>\n<|grounding|>Convert the document to markdown.
# Hugging Face cache inside the container
HF_HOME=/data/hf_cache
HF_TOKEN=
# Optional quantization toggles (if wired in code)
FORCE_BNB_4BIT=1
BNB_DOUBLE_QUANT=1Tip: keep
PRESET=gundamunless you know you want other sizes.
docker compose up --build -dThis builds the CUDA 13 image, installs PyTorch (cu130) and friends, and starts the API.
If your
docker-compose.ymlmaps${PORT:-8000}:8000, the container always listens on 8000 inside, while the host port is taken from.env.
Use curl -N to avoid buffering:
curl -s -N -F "file=@test.png" http://localhost:8000/inferYou should see tokens/lines flowing until inference completes.
- Request:
multipart/form-datawith a single part namedfile(png/jpg/webp/tiff…) - Response:
text/plainstream (chunked); the server pipes the model’s raw output as it happens, then exits.
Example:
curl -s -N -F "file=@docs/sample.png" http://localhost:8000/inferNote: there is intentionally no JSON and no “final file” here; the contract is stream raw text. If you need markdown files or boxes, build those on the client from the stream.
A tiny Python script that:
- accepts an image path or lets you select a region across any monitor and uploads the crop
- streams cleaned text by default (removes
<|det|>…</|det|>,<|ref|>…</|ref|>, ANSI, tqdm bars, tensor-size logs likeBASE:/PATCHES: torch.Size([...]), etc.) --rawto see the unfiltered stream
# install deps
python3 -m pip install --upgrade requests Pillow mss
# put the script somewhere in PATH, e.g.
install -Dm755 scripts/ocr ~/.local/bin/ocr
# or just make it executable and add folder to PATH
chmod +x scripts/ocrOn some Linux distros you may need
sudo apt install python3-tk(fortkinter) if it isn’t preinstalled.
# Select a screen region and stream cleaned OCR
ocr
# Send a file
ocr test.png
# Raw (no cleaning)
ocr --raw test.png
# Different server
ocr --url http://server:8000 test.png.
├─ Dockerfile # CUDA 13.0.1 base, PyTorch cu130 wheels, FastAPI server
├─ docker-compose.yml # uses .env for port/model/dtype/etc.
├─ .env # runtime configuration
├─ ocr_api_allinone.py # FastAPI server; loads env via python-dotenv
├─ scripts/
│ └─ ocr # Client CLI (upload file or screen selection; stream text)
├─ uploads/ # (mounted) temporary uploads
└─ results/ # (mounted) optional output dir for model side-effects
| Key | Meaning | Default |
|---|---|---|
PORT |
Host port mapping for the API | 8000 |
MODEL_ID |
HF repo id of the model | Jalea96/DeepSeek-OCR-bnb-4bit-NF4 |
ATTN_IMPL |
Attention backend (eager, flash_attention_2) |
eager |
TORCH_DTYPE |
Torch dtype (bf16, fp16, fp32) |
bf16 |
DEVICE_MAP |
auto or device mapping string |
auto |
PRESET |
Resolution preset (tiny/small/base/large/gundam) |
gundam |
PROMPT |
Initial prompt text | see .env |
HF_HOME |
HF cache dir in container | /data/hf_cache |
HF_TOKEN |
HF token (not required for this public model) | (empty) |
FORCE_BNB_4BIT |
(optional) force BitsAndBytes 4-bit quant | 1 |
BNB_DOUBLE_QUANT |
(optional) BnB double-quant toggle | 1 |
- GPU access: ensure the host has NVIDIA Container Toolkit installed and your compose service uses
gpus: all. - flash-attn: the Dockerfile attempts to install
flash-attn(2.7.x). For CUDA 13 it may not have prebuilt wheels yet; failure is ignored. If you enable it (ATTN_IMPL=flash_attention_2) without a working install, the code falls back toeager. - Quantization messages: warnings like
Unused kwargs: ['_load_in_4bit' ...]are expected when the model config already contains quantization; they are harmless. - Streaming client cleaning: by default the
ocrclient removes noisy lines such as<|det|>...,<|ref|>..., tqdm bars (other: 0%|...), and tensor size logs (BASE:/PATCHES:). Use--rawto see unfiltered output.
Q: Server says python-multipart required / 422 on upload
A: The Dockerfile installs python-multipart. If you run locally, ensure:
python3 -m pip install python-multipartQ: Error loading ASGI app. Could not import module
A: Make sure the filename in the container matches the CMD or uvicorn module path (this repo uses ocr_api_allinone.py with CMD ["python3", "-u", "ocr_api_allinone.py"]).
Q: No GPU found in container
A: Install NVIDIA Container Toolkit on the host; run compose with gpus: all.
Q: Client won’t open region selector
A: Install tkinter (often python3-tk package) and Pillow. On Wayland, some DEs restrict global screenshots; try X11 session or grant screen-capture permissions.
Q: Stream shows only raw tokens
A: That’s expected while the model is generating. The client stitches small token lines and removes noise unless --raw is passed.
- This is a development-focused server that accepts arbitrary image uploads and runs a large model in the same process. Place it behind a trusted network or reverse proxy if exposed.
- Use separate volumes for
uploads/andresults/if you care about persistence.
- Model: Jalea96/DeepSeek-OCR-bnb-4bit-NF4 on Hugging Face.
- Thanks to the Hugging Face ecosystem, FastAPI, and the PyTorch team.
This repository’s scripts are provided under the MIT License (see LICENSE).
Model weights have their own license—please review the corresponding Hugging Face model card.