Docling Ingest

A web application for PDF document ingestion with interactive preview, content editing, and vector database storage. Built on Docling for PDF parsing and ChromaDB for vector storage.

Features

PDF Upload & Conversion — Upload PDFs, automatically extract text, tables, and images using Docling
Interactive Content Viewer — Preview extracted content page-by-page with visual annotations
Content Editing — Delete, reorder, and edit items before ingestion. Full undo/redo support
Image Classification — Classify images (logo, chart, diagram, photo, etc.) with customizable presets
AI Image Descriptions — Optional vision model integration to auto-generate image alt-text
Embedding Preview — Preview how documents will be chunked before ingestion
Vector DB Ingestion — Ingest processed documents into ChromaDB collections
Multiple Strategies — Choose between per-page, chunked, or custom ingestion strategies
Dark Mode — Automatic theme detection based on system preferences
Offline-First — Document state is managed in the browser via localStorage

Architecture

┌─────────────┐     ┌─────────────┐     ┌─────────────────┐
│   Frontend   │────▶│    Proxy     │────▶│     Engine       │
│  React/Vite  │     │   Node.js   │     │  Python/FastAPI  │
│   :3000      │     │   :4006     │     │   :8000          │
└─────────────┘     └─────────────┘     └─────────────────┘
                                              │
                                         ┌────┴────┐
                                         │ChromaDB │
                                         │(embedded)│
                                         └─────────┘

Frontend — React + Vite + Tailwind CSS. Single-page app with the ingestion interface
Proxy — Node.js/Express. Aggregates local and remote config, proxies requests to the engine
Engine — Python/FastAPI. Runs Docling for PDF conversion, manages embeddings and ChromaDB

Quick Start

Docker Compose (Recommended)

git clone https://github.com/liyanfeng129/docling-ingest.git
cd docling-ingest
docker compose up --build

Open http://localhost:3000 in your browser.

With Vision Model (Optional)

To enable AI-powered image descriptions:

docker compose --profile vision up --build

Then pull the vision model:

docker exec -it docling-ingest-ollama-1 ollama pull granite3.2-vision:2b

And set ENABLE_VISION_MODEL=true in the engine environment.

Manual Setup

Engine (Python):

cd backend/engine
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000

Proxy (Node.js):

cd backend/proxy
npm install
PORT=4006 ENGINE_URL=http://localhost:8000 node index.js

Frontend:

cd frontend
npm install
npm run dev

Configuration

Environment Variables

Variable	Service	Default	Description
`VITE_INGESTION_URL`	Frontend	`http://localhost:4006`	Proxy service URL
`PORT`	Proxy	`4006`	Proxy listen port
`ENGINE_URL`	Proxy	`http://localhost:8000`	Engine service URL
`ENABLE_VISION_MODEL`	Engine	`false`	Enable Ollama vision model
`EMBEDDING_MODEL`	Engine	`Snowflake/snowflake-arctic-embed-l`	Sentence Transformers model
`CHROMA_PERSIST_DIRECTORY`	Engine	`./resources/chroma_db/default`	ChromaDB storage path
`CHROMA_COLLECTION_NAME`	Engine	`documents`	Default collection name
`OLLAMA_BASE_URL`	Engine	`http://localhost:11434`	Ollama server URL
`VISION_MODEL`	Engine	`granite3.2-vision:2b`	Ollama vision model name

Persistent Data & Docker Volumes

When running with Docker Compose, any files created by the app at runtime (such as the ChromaDB vector database) are stored in named Docker volumes, not on your local filesystem directly.

Where is the Vector DB?

When you ingest a document, ChromaDB writes its data to the chroma_data volume, mapped to /app/resources/chroma_db inside the engine container.

To browse the data in Docker Desktop:

Open Docker Desktop
Click Volumes in the left sidebar
Click docling-ingest_chroma_data
Click the Data tab to browse the files

The ChromaDB collection files will be under the default/ folder.

To inspect via terminal:

# List files inside the volume
docker exec -it docling-ingest-engine-1 ls /app/resources/chroma_db/default

# Copy the entire database to your local machine
docker cp docling-ingest-engine-1:/app/resources/chroma_db ./chroma_db_backup

To use a local folder instead of a Docker volume (so data appears directly in your project), replace the volume in docker-compose.yml:

volumes:
  - ./chroma_db:/app/resources/chroma_db   # bind mount — visible in Finder/Explorer

Note: The chroma_data volume persists across container restarts and rebuilds. To fully delete it run docker volume rm docling-ingest_chroma_data.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github		.github
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docling Ingest

Features

Architecture

Quick Start

Docker Compose (Recommended)

With Vision Model (Optional)

Manual Setup

Configuration

Environment Variables

Persistent Data & Docker Volumes

Where is the Vector DB?

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Docling Ingest

Features

Architecture

Quick Start

Docker Compose (Recommended)

With Vision Model (Optional)

Manual Setup

Configuration

Environment Variables

Persistent Data & Docker Volumes

Where is the Vector DB?

License

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages