feat: add gRPC protobuf definitions and conversion utility#546
feat: add gRPC protobuf definitions and conversion utility#546krickert wants to merge 4 commits intodocling-project:mainfrom
Conversation
|
✅ DCO Check Passed Thanks @krickert, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: e233312 Signed-off-by: Kristian Rickert <krickert@gmail.com>
|
Docling Team, I have updated this implementation and moved the core defninitions and mapping into here. I debated about keeping the mapping in docling-serve, but feel this might be the best place since it's so model specific. Specifically, I’ve moved the Protobuf definitions and the Pydantic-to-Proto conversion logic into
I am currently running this implementation through a stress test of 80k+ PDFs over the coming week to verify stability at scale. I would love to start a discussion on how we can expand this into native streaming functionality, and in the meantime I’d be happy to contribute language-specific tutorials. Looking forward to your feedback... you built an incredible product and I'm happy to contribute. |
|
This merge is tied together with docling-project/docling-serve#504 - this is the model definition and mapping while the other project is the gRPC server option. |
|
Bump.. is this enough to start a review? Just an initial pass.. I can convert to draft if we need a few rounds of discussion. On my side, I'm getting ready to test this against a large corpus in common-crawl @dolfim-ibm |
|
@dolfim-ibm updated latest commit with main to keep it up to date. The latest protobufs were sync'ed (the grpc server was working without it though, but the new model changes have been added and properly mapped) |
feat: add gRPC protobuf definitions and Pydantic conversion utility
Description
This PR introduces official Protocol Buffer definitions for the
DoclingDocumentmodel and a high-performance conversion utility to map between Docling's Pydantic models and Protobuf representations.By moving the Protobuf source of truth into
docling-core, we enable:Key Changes
1. Protobuf Definitions (
/proto)ai/docling/core/v1/docling_document.proto.DoclingDocumentPydantic model, including:field_regions,field_items,field_heading, andfield_value.2. Conversion Utility (
docling_core/utils/conversion.py)docling_document_to_proto: A surgical, field-by-field mapper.google.protobuf.Structfor custom metadata.DocItemLabel,GroupLabel, andCoordOrigin.3. Tooling & Dependencies
protobufas a core dependency.grpcio-toolsto thedevdependency group for local development.scripts/gen_proto.pyto automate code generation usinguv.buflinting and formatting standards.Validation Performed
Unit Tests
test/test_proto_conversion.pyto verify:_root_).Integration Testing (via
docling-serve)docling-servegRPC suite.docling-servestartup schema validator, ensuring 100% parity between Pydantic and Proto schemas.Related Issues/PRs
docling-servePR fix(Doclang): fix image URI serialization #504 (feat: Grpc native converter).Protocol buffer integration:
docling_core/proto/__init__.pyto centralize imports for DoclingDocument protocol buffer definitions and conversion utilities.