Skip to content

feat: Grpc native converter#504

Open
krickert wants to merge 24 commits intodocling-project:mainfrom
ai-pipestream:grpc-native-converter
Open

feat: Grpc native converter#504
krickert wants to merge 24 commits intodocling-project:mainfrom
ai-pipestream:grpc-native-converter

Conversation

@krickert
Copy link
Copy Markdown

@krickert krickert commented Feb 20, 2026

This pull request introduces new support and documentation for running and testing the gRPC server in the docling-serve project. The main changes include adding full gRPC compatibility with the docling project.

gRPC-specific test jobs to CI, new Makefile commands for running gRPC tests, improved developer documentation for running tests, and the initial implementation of the gRPC server entrypoint and package structure.

gRPC Server Implementation

  • Introduced docling_serve/grpc/__main__.py as the Typer CLI entrypoint for running the gRPC server, with options for host, port, and artifact path, and a version command.
  • Created docling_serve/grpc/__init__.py to set up the Python import path for generated gRPC code, ensuring it is importable at runtime.

CI and Testing Improvements

  • Added two new GitHub Actions jobs to .github/workflows/job-checks.yml for running gRPC unit and integration tests separately, ensuring faster feedback and clearer test separation.
  • Added corresponding Makefile commands: test-grpc-unit, test-grpc-integration, and test-grpc-ocr to easily run gRPC unit, integration, and OCR tests locally.

Developer Documentation

  • Updated CONTRIBUTING.md with clear instructions and examples for running different types of tests using pytest markers, including gRPC-specific commands and explanations of what each test suite covers.

Add a gRPC transport layer that mirrors the existing REST API endpoints
for document conversion and chunking. Includes proto definitions, server
implementation with sync/async/streaming RPCs, mapping layer between
protobuf and domain types, codegen script, and tests.
Fix missing await on _check_api_key in WatchChunkHierarchicalSource and
WatchChunkHybridSource. Add missing docling_serve_types_pb2 import and
TaskNotFoundError import. Add tests for NOT_FOUND polling, both Watch
chunk RPCs, and ConvertSourceStream.
…ive mapping tests

Add unit/integration markers to all gRPC test files. Add integration
tests for ChunkHierarchicalSource and ChunkHybridSource with real
pipeline. Add API key enforcement tests for all streaming RPCs. Add
negative mapping tests for UNSPECIFIED and bogus enum values. Harden
json_content serialization to warn instead of crash on malformed data.
Stabilize integration fixtures with orchestrator cache clearing.
Split the single grpc-tests job into grpc-unit-tests (fast, no models)
and grpc-integration-tests (real pipeline, excludes OCR to avoid large
model downloads on CI). Add a testing section to CONTRIBUTING.md
documenting the unit/integration/ocr markers and example commands.
Add test-grpc-unit, test-grpc-integration, and test-grpc-ocr make
targets. Update CONTRIBUTING.md wording to clarify OCR model scope.
…ument converter

Replace JSON bridge (model_dump → ParseDict) with native field-by-field
protobuf conversion. Response messages now carry a structured
DoclingDocument proto as the primary payload, with optional serialized
exports (md/html/text/doctags/json strings) populated only when requested.

Proto changes:
- Add DocumentExports message with optional string fields
- Rework DocumentResponse/ExportDocumentResponse: doc + optional exports
- Add FineRef, TrackSource, SourceType messages to core proto
- Add custom_fields (map<string, Value>) to all meta types

Server changes:
- Always inject OutputFormat.JSON so native doc is produced
- Track requested formats per async task for export filtering
- Thread requested_formats through convert/chunk result builders

Converter (new docling_document_converter.py):
- Field-by-field converters for every DoclingDocument structure type
- Enum mapping dicts for ContentLayer, GroupLabel, DocItemLabel, etc.
- Custom fields via google.protobuf.Value maps
- Renamed _to_floating_item_fields → _to_table_item_base
- Widened _to_source_type type hint to BaseSource

Tests:
- 24 unit tests for converter (all item types, meta, custom fields, errors)
- Updated fake/integration tests for new response shape
- Added requested_output_formats mapping test
…ports

- Add schema_validator.py that compares Pydantic DoclingDocument against
  proto descriptors at server startup, failing hard on incompatible type
  mismatches and warning on missing fields
- Replace repeated int32 with proper IntSpan message for charspan and
  FineRef.range fields in proto, converter, and tests
- Teach validator structural equivalences: oneof wrappers (SourceType),
  tuple-to-message (IntSpan), and string-compatible unions (Path)
- Make gRPC exports opt-in: only include JSON/MD/HTML/text/doctags when
  explicitly requested via to_formats
- Remove exclude_none from JSON export to match REST canonical serializer
- Rename response field to .doc for clarity
Required by test_grpc_mapping.py but was missing from pyproject.toml.
- Add furniture field to DoclingDocument proto and converter
- Add all 10 picture annotation types (description, misc, classification,
  molecule, tabular chart, line/bar/stacked-bar/pie/scatter chart)
- Add 2 table annotation types (description, misc)
- Add FloatPair and StringIntPair helper messages for chart data
- Add PictureAnnotation and TableAnnotation oneof wrappers
- Update schema validator with oneof wrapper and tuple equivalences
- Add json_content naming clarification comments in mapping.py
- Optimize default gRPC path to skip unnecessary Markdown generation
- Suppress union member recursion in schema validator warnings
- Add regression tests for default-path and JSON export parity
…-loss gaps

- Add CoordOrigin and CodeLanguageLabel proto enums with *_raw fallbacks
- Add TrackSource.kind field to proto and converter (was silently dropped)
- Add TableCell.ref field for RichTableCell parity
- Raise TypeError on unknown annotation variants instead of silent empty wrapper
- Replace global _visited with path-scoped recursion stack in validator
- Add map-value recursion, cref/ref aliasing, base-field flattening
- Skip oneof wrapper descriptors and leaf messages to eliminate false warnings
- Validator now passes clean with zero missing-field warnings
- Replace ".grid" in path with _PROTO_ONLY_PREFIXES (explicit path set)
- Replace path.endswith("_raw") with _RAW_FALLBACK_SUFFIXES (named set)
- Remove dead TableRow entry from _BASE_FIELD_WRAPPERS
- Add chart_data.grid prefix for oneof wrapper member validation
- Document all suppression rulesets in grpc_schema_validation.md (§13)
- Add new-field checklist (§14) for contributors
- Update §6 to reflect IntSpan structural equivalences
…ound-trip test

- HealthResponse now returns package version
- ConvertSource validates non-empty sources list
- ConvertSourceStream comment clarified (single response when complete)
- Docs reorganized into docs/grpc/ directory
- Added round-trip test for DoclingDocument proto conversion
- Added health version and empty sources integration tests
- Proto stubs regenerated for HealthResponse version field
- TestRoundTrip class with 10 tests covering all major document features
- Rich document JSON export round-trip (furniture, texts, tables, pictures, kv, groups)
- Native proto field verification on rich document with furniture assertions
- Custom fields round-trip (Struct encoding and JSON deserialization)
- CoordOrigin enum round-trip and proto mapping
- FormItem with graph data round-trip
- Empty document round-trip (edge case)
- Furniture tree with page-header child properly validated
- to_task_sources now raises ValueError on Source with no oneof variant
  instead of silently dropping it
- Server _parse_sources helper catches ValueError and empty lists,
  returning INVALID_ARGUMENT for all 9 RPC call sites
- Added mapping tests: empty oneof, mixed-with-empty oneof
- Added fake service tests: no-variant rejection (unary + streaming),
  HttpSource, S3Source, mixed file+http end-to-end
- Enable grpc_reflection for grpcurl/client service discovery
- Set max send/receive message size to 2GB for large documents
- Bump grpcio>=1.78.0, grpcio-reflection>=1.78.0, protobuf>=4.29.6
- Add Python e2e gRPC client (tests/e2e_grpc_client.py) covering
  health, convert, stream, watch, async workflow, JSON/MD export,
  error rejection, clear operations (26 assertions)
- Add grpcurl e2e shell script (tests/e2e_grpcurl.sh)
- Run ruff format across all files
- Exclude .sh files from ruff linting
- Add E402 ignore for e2e_grpc_client.py (sys.path before import)
- Fix protobuf descriptor pool import ordering in e2e_grpc_client.py
  (struct.proto must be loaded before docling_document.proto)
- Add ChunkHierarchicalSource and ChunkHybridSource E2E tests
- Fix $B64 undefined variable bug in e2e_grpcurl.sh test 8 (use jq)
- Update docs/grpc/README.md with E2E testing section
- Remove stale E402 per-file-ignore from pyproject.toml
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 20, 2026

DCO Check Passed

Thanks @krickert, all your commits are properly signed off. 🎉

@krickert krickert marked this pull request as draft February 20, 2026 17:43
@mergify
Copy link
Copy Markdown

mergify bot commented Feb 20, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link
Copy Markdown

dosubot bot commented Feb 20, 2026

Related Documentation

3 document(s) may need updating based on files changed in this PR:

Docling

How can I set up and run docling-serve on a MacBook Pro using Docker, and what performance and stability considerations should I be aware of?
View Suggested Changes
@@ -24,6 +24,26 @@
    - To use more CPU threads for a single document, add `-e DOCLING_NUM_THREADS=4` (or up to your CPU core count).
 
 5. **Access the UI**: Open [http://localhost:5001](http://localhost:5001) in your browser.
+
+**Choosing Between REST and gRPC:**
+
+Docling Serve supports both REST and gRPC protocols. Users can choose the protocol that best fits their application:
+
+- **REST API (default, port 5001)**: Uses JSON over HTTP/1.1. Supports multipart/form-data file uploads and provides a web UI for interactive use. This is the standard option and runs by default when starting the container.
+
+- **gRPC API (port 50051)**: Uses Protocol Buffers over HTTP/2. File uploads use base64 encoding in the `FileSource` message instead of multipart/form-data. gRPC provides server-streaming RPCs for status updates and is suitable for language-agnostic service-to-service communication.
+
+To run the gRPC server instead of REST using Docker:
+```sh
+docker run -p 50051:50051 \
+  -v $(pwd)/models:/opt/app-root/src/models \
+  -e DOCLING_SERVE_ARTIFACTS_PATH="/opt/app-root/src/models" \
+  quay.io/docling-project/docling-serve -- docling-serve-grpc run --host 0.0.0.0 --port 50051
+```
+
+The gRPC and REST servers run independently. If both protocols are needed, run them as separate containers on different ports.
+
+**Note:** gRPC support is experimental. File uploads in gRPC use base64-encoded content in the `FileSource` message rather than multipart/form-data. Refer to the gRPC documentation for details on protobuf schemas and client implementation.
 
 **Performance:**
 - On a modern MacBook Pro (M1/M2, 12 cores, 32GB RAM), expect to process about 5 PDFs (127 pages, 6.2MB total) in roughly 10 minutes (CPU-only). Increasing CPU cores, RAM, or worker count does not significantly improve throughput due to Python's concurrency limitations.

[Accept] [Decline]

How can you mount a PersistentVolumeClaim (PVC) in docling-serve on OpenShift to store EasyOCR models, and what steps are required to ensure docling-serve can access these models?
View Suggested Changes
@@ -1,4 +1,8 @@
-To mount a PVC in docling-serve on OpenShift for EasyOCR models, you need to edit the Deployment manifest directly (there is no Operator-specific CRD for this). Add the PVC as a volume and mount it in the container, then set the `DOCLING_SERVE_ARTIFACTS_PATH` environment variable to the mount path. Here is an example snippet:
+To mount a PVC in docling-serve on OpenShift for EasyOCR models, you need to edit the Deployment manifest directly (there is no Operator-specific CRD for this). Add the PVC as a volume and mount it in the container, then set the `DOCLING_SERVE_ARTIFACTS_PATH` environment variable to the mount path.
+
+**Note:** For the gRPC server, you can also configure the artifacts path using the `--artifacts-path` CLI argument instead of the environment variable. Both methods work with PVC mounts—the path must point to the mounted volume location. The environment variable approach works for both REST and gRPC servers.
+
+Here is an example snippet:
 
 ```yaml
 spec:
@@ -59,6 +63,16 @@
 
 Ensure your EasyOCR models are present in `/modelcache` (or your chosen path) and the directory structure matches what docling expects. If a required model is missing, docling-serve will raise a runtime error. It's recommended to preload models into the PVC using a Kubernetes Job before starting docling-serve. For more details, see the [official docling-serve documentation](https://github.com/docling-project/docling-serve/blob/a179338c785ef9b84696f41b7ab2f2cafe80973d/docs/models.md#L7-L173).
 
+**Example for gRPC server with custom artifacts path:**
+
+If running the gRPC server and using a custom mount path, you can specify it via the CLI:
+
+```sh
+docling-serve-grpc run --artifacts-path /modelcache --host 0.0.0.0 --port 50051
+```
+
+This is equivalent to setting `DOCLING_SERVE_ARTIFACTS_PATH=/modelcache` and works with the same PVC mount configuration shown above.
+
 ## Health probe configuration
 
 The deployment manifest includes health probes to ensure docling-serve is fully ready before accepting traffic:

[Accept] [Decline]

Models handling in Docling Serve
View Suggested Changes
@@ -4,7 +4,29 @@
 With Docling v2.56.x and later, EasyOCR models are now included in the default set of models downloaded by the auto-ocr feature. You no longer need to explicitly request EasyOCR models unless you are customizing the download set.
 
 ## Model Storage Location
-Docling Serve loads models from the directory specified by the `DOCLING_SERVE_ARTIFACTS_PATH` environment variable. This path must be consistent across model download and runtime. When running with multiple workers or reload enabled, you must use the environment variable rather than the CLI argument for configuration [[source]](https://github.com/docling-project/docling-serve/blob/fd1b987e8dc174f1a6013c003dde33e9acbae39a/docling_serve/settings.py).
+Docling Serve loads models from the directory specified by the `DOCLING_SERVE_ARTIFACTS_PATH` environment variable. This path must be consistent across model download and runtime.
+
+### Configuration Options
+
+**Environment Variable (REST and gRPC)**
+
+Set `DOCLING_SERVE_ARTIFACTS_PATH` to configure the artifacts path for both REST and gRPC servers:
+
+```sh
+export DOCLING_SERVE_ARTIFACTS_PATH=/path/to/models
+```
+
+This is the recommended approach for production deployments and is required when running the REST server with multiple workers or reload enabled [[source]](https://github.com/docling-project/docling-serve/blob/fd1b987e8dc174f1a6013c003dde33e9acbae39a/docling_serve/settings.py).
+
+**CLI Argument (gRPC Only)**
+
+When running the gRPC server via `docling-serve-grpc`, you can also configure the artifacts path using the `--artifacts-path` CLI argument:
+
+```sh
+docling-serve-grpc run --artifacts-path /path/to/models --host 0.0.0.0 --port 50051
+```
+
+The CLI argument is specific to the gRPC server entrypoint and provides a convenient way to set the path without environment variables. For REST server deployments with multiple workers or reload enabled, use the environment variable instead.
 
 ## Approaches for Making Extra Models Available
 There are several ways to ensure required models are present:
@@ -197,7 +219,7 @@
 ## Troubleshooting and Best Practices
 - If a required model is missing from the artifacts path, Docling Serve will raise a runtime error.
 - Always ensure the value of `DOCLING_SERVE_ARTIFACTS_PATH` matches the directory where models are stored and mounted.
-- For multi-worker or reload scenarios, use the environment variable, not the CLI argument, to set the artifacts path.
+- For REST server deployments with multiple workers or reload enabled, use the environment variable to set the artifacts path. For gRPC server deployments, you can use either the environment variable or the `--artifacts-path` CLI argument.
 - For production and cluster environments, prefer persistent storage and pre-loading models via a dedicated job.
 - EasyOCR models are now included by default in auto-ocr; explicit inclusion is only needed for custom workflows.
 - Use the `/ready` endpoint for startupProbe and readinessProbe to prevent traffic before models are loaded and dependencies are available.

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 64b1eed
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: c86db23
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: bee2d7a
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: c3a431c
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: b43a3a8
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: bc87f01
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 2c9a19e
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 184f094
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: d0bc469
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 5999b4b
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 5884481
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: e5139a4
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: da53c79
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: fdf0563
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 4c84538
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 52120e1
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 4abe905

Signed-off-by: Kristian Rickert <krickert@gmail.com>
@krickert krickert changed the title Grpc native converter (feat) Grpc native converter Feb 20, 2026
@krickert
Copy link
Copy Markdown
Author

@dolfim-ibm this is the first attempt at this. It's fully featured now, and this is what I'll run the 768/10K common crawl PDF test against.

I will look into how to move parts of this to the docling-core for the spec and grpc stubs.

@krickert krickert changed the title (feat) Grpc native converter feat: Grpc native converter Feb 23, 2026
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 64b1eed
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: c86db23
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: bee2d7a
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: c3a431c
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: b43a3a8
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: bc87f01
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 2c9a19e
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 184f094
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: d0bc469
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 5999b4b
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 5884481
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: e5139a4
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: da53c79
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: fdf0563
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 4c84538
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 52120e1
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: 4abe905

Signed-off-by: Kristian Rickert <krickert@gmail.com>
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: c817b49
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: c33b3a2

Signed-off-by: Kristian Rickert <krickert@gmail.com>
Signed-off-by: Kristian Rickert <krickert@gmail.com>
@krickert krickert marked this pull request as ready for review March 16, 2026 02:29
@krickert
Copy link
Copy Markdown
Author

This is tied with docling-project/docling-core#546 now - the protobuf definitions and mappers have been moved there while the server functionality is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant