Skip to content

feat: add gRPC protobuf definitions and conversion utility#546

Open
krickert wants to merge 4 commits intodocling-project:mainfrom
ai-pipestream:feat/add-protobuf
Open

feat: add gRPC protobuf definitions and conversion utility#546
krickert wants to merge 4 commits intodocling-project:mainfrom
ai-pipestream:feat/add-protobuf

Conversation

@krickert
Copy link
Copy Markdown

feat: add gRPC protobuf definitions and Pydantic conversion utility

Description

This PR introduces official Protocol Buffer definitions for the DoclingDocument model and a high-performance conversion utility to map between Docling's Pydantic models and Protobuf representations.

By moving the Protobuf source of truth into docling-core, we enable:

  1. Cross-language support: Standardized schema for clients in Go, Java, Rust, etc.
  2. Efficient Serialization: Significant reduction in payload size and faster (de)serialization compared to JSON.
  3. Architectural Decoupling: Separation of the document schema from the transport layer (docling-serve).

Key Changes

1. Protobuf Definitions (/proto)

  • Added ai/docling/core/v1/docling_document.proto.
  • Fully mirrors the DoclingDocument Pydantic model, including:
    • Text items (Titles, Headers, Paragraphs, etc.).
    • Structured items (Tables, Pictures, Key-Value pairs).
    • Metadata (Provenance, Bounding Boxes, Image references).
    • New field types: field_regions, field_items, field_heading, and field_value.

2. Conversion Utility (docling_core/utils/conversion.py)

  • Implemented docling_document_to_proto: A surgical, field-by-field mapper.
  • Handles complex types like google.protobuf.Struct for custom metadata.
  • Validates enum mappings for DocItemLabel, GroupLabel, and CoordOrigin.

3. Tooling & Dependencies

  • Added protobuf as a core dependency.
  • Added grpcio-tools to the dev dependency group for local development.
  • Added scripts/gen_proto.py to automate code generation using uv.
  • Integrated buf linting and formatting standards.

Validation Performed

Unit Tests

  • Added test/test_proto_conversion.py to verify:
    • Minimal document conversion.
    • Rich text and title mapping.
    • Consistency of default field names (e.g., _root_).

Integration Testing (via docling-serve)

  • Verified against the docling-serve gRPC suite.
  • Successfully processed the full array of standard Docling test PDFs via gRPC conversion.
  • Verified schema consistency using the docling-serve startup schema validator, ensuring 100% parity between Pydantic and Proto schemas.

Related Issues/PRs

Protocol buffer integration:

  • Created docling_core/proto/__init__.py to centralize imports for DoclingDocument protocol buffer definitions and conversion utilities.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 16, 2026

DCO Check Passed

Thanks @krickert, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown

mergify bot commented Mar 16, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

I, Kristian Rickert <krickert@gmail.com>, hereby add my Signed-off-by to this commit: e233312

Signed-off-by: Kristian Rickert <krickert@gmail.com>
@krickert
Copy link
Copy Markdown
Author

Docling Team,

I have updated this implementation and moved the core defninitions and mapping into here. I debated about keeping the mapping in docling-serve, but feel this might be the best place since it's so model specific.

Specifically, I’ve moved the Protobuf definitions and the Pydantic-to-Proto conversion logic into docling-core to ensure that the document schema remains the single source of truth for all downstream services.

  • Protobuf Definitions: The schemas strictly follow buf lint conventions and are designed as 1:1 mirrors of the current Pydantic models. I have been very careful to ensure full field parity and it works with all protobuf standard tools (buf was just for linting because it has the strongest linting standards).
  • Dynamic Schema Validation: At startup, the server crawls the Pydantic models and validates them against the Protobuf descriptors. Any deltas are logged as warnings, ensuring we never silently drift out of sync.
  • Resilient Mapping: If new fields are added to the Pydantic models during development, the server gracefully maps them to a custom_fields section, ensuring the gRPC server remains operational while the Proto is being updated.
  • Extensive Validation: I have verified this implementation using both Python and Java clients to ensure the generated stubs are idiomatic and performant across languages. I intend to try more... because why not?

I am currently running this implementation through a stress test of 80k+ PDFs over the coming week to verify stability at scale. I would love to start a discussion on how we can expand this into native streaming functionality, and in the meantime I’d be happy to contribute language-specific tutorials.

Looking forward to your feedback... you built an incredible product and I'm happy to contribute.

@krickert
Copy link
Copy Markdown
Author

This merge is tied together with docling-project/docling-serve#504 - this is the model definition and mapping while the other project is the gRPC server option.

@krickert
Copy link
Copy Markdown
Author

Bump.. is this enough to start a review? Just an initial pass.. I can convert to draft if we need a few rounds of discussion. On my side, I'm getting ready to test this against a large corpus in common-crawl @dolfim-ibm

@krickert
Copy link
Copy Markdown
Author

@dolfim-ibm updated latest commit with main to keep it up to date. The latest protobufs were sync'ed (the grpc server was working without it though, but the new model changes have been added and properly mapped)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant