feat: DocChunk expansion by odelliab · Pull Request #549 · docling-project/docling-core

odelliab · 2026-03-16T12:43:54Z

Overview

This PR introduces chunk expansion functionality to the DocChunk class, enabling chunks to be expanded to include complete document objects or full pages. It also includes a comprehensive test suite with 15 test cases.
Issue #542

Features Added

1. Chunk Expansion Methods

Three new methods added to DocChunk class:

`get_top_containing_objects(doc: DoclingDocument) -> list[DocItem] | None`

Traverses the document tree upward from chunk items to find top-level containing objects
Maintains reading order as in the original document
Returns None if no top objects found

`expand_to_object(dl_doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk`

Expands chunk to include complete top-level document objects (sections, tables, lists)
Prevents partial object inclusion in chunks
Returns original chunk if expansion fails

`expand_to_page(doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk | None`

Expands chunk to include all content from the pages it spans
Useful for maintaining page-level context
Returns original chunk if expansion not possible

2. Comprehensive Test Suite

Created test/test_doc_chunk_expansion.py with 15 test cases covering:

Dependencies

No new dependencies added.

#Note:
In the current implementation the expanded chunk has the same headings as the original chunk.
This may need revisiting.

This reverts commit 5d17bda.

This reverts commit 91b43f9.

This reverts commit 5cc61d9.

Signed-off-by: odelliab <odelliab@il.ibm.com>

mergify · 2026-03-16T12:44:30Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

dosubot · 2026-03-16T12:45:32Z

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

How can I find page numbers and bounding box information for content in a chunk produced by the hybrid chunker, and what is the structure of the doc_items list within a chunk?

View Suggested Changes

@@ -41,6 +41,73 @@
 )
 ```
 
+## Chunk Expansion Methods
+
+The `DocChunk` class provides three methods to expand chunks to include additional context from the document:
+
+### get_top_containing_objects(doc: DoclingDocument) -> list[DocItem] | None
+
+Traverses the document tree upward from the chunk's items to find the top-level containing objects (direct children of the document body). This method maintains the reading order as in the original document.
+
+**Returns:** A list of top-level `DocItem` objects that contain the chunk's items, or `None` if no top objects are found.
+
+**Example:**
+```python
+top_objects = chunk.get_top_containing_objects(dl_doc)
+if top_objects:
+    for obj in top_objects:
+        print(f"Top object: {obj.label}")
+```
+
+### expand_to_object(dl_doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk
+
+Expands a chunk to include complete document objects (e.g., full sections, tables, lists) that contain the chunk's items. This prevents partial object inclusion in chunks.
+
+This method retrieves the top-level containing objects using `get_top_containing_objects()`, serializes them completely, and returns a new `DocChunk` with the expanded content. If expansion fails, the original chunk is returned.
+
+**Use case:** When you need to ensure you have complete objects rather than partial content (e.g., the full table instead of just a few rows, or an entire paragraph instead of a fragment).
+
+**Example:**
+```python
+from docling_core.transforms.chunker import HybridChunker
+
+# Create chunks using the hybrid chunker
+chunker = HybridChunker(max_tokens=512)
+chunks = list(chunker.chunk(dl_doc=dl_doc))
+
+# Expand a chunk to include complete objects
+serializer = chunker.serializer_provider.get_serializer(dl_doc)
+expanded_chunk = chunks[0].expand_to_object(dl_doc=dl_doc, serializer=serializer)
+
+print(f"Original text length: {len(chunks[0].text)}")
+print(f"Expanded text length: {len(expanded_chunk.text)}")
+```
+
+### expand_to_page(doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk | None
+
+Expands a chunk to include all content from the page(s) where the chunk's items appear. This method uses the `page_no` from the provenance information (`item.prov`) to identify which pages to include, then serializes all content from those pages.
+
+If the document has no pages or page information is unavailable, the original chunk is returned.
+
+**Use case:** When you want page-level context for a chunk, such as retrieving all content from the same page(s) as the chunk.
+
+**Example:**
+```python
+from docling_core.transforms.chunker import HybridChunker
+
+# Create chunks
+chunker = HybridChunker(max_tokens=512)
+chunks = list(chunker.chunk(dl_doc=dl_doc))
+
+# Expand a chunk to include full page content
+serializer = chunker.serializer_provider.get_serializer(dl_doc)
+page_expanded_chunk = chunks[0].expand_to_page(doc=dl_doc, serializer=serializer)
+
+# Check which pages are included
+page_ids = [i.page_no for item in page_expanded_chunk.meta.doc_items for i in item.prov]
+print(f"Pages included: {set(page_ids)}")
+```
+
 **Limitations:**
 - Line numbers are not available.
 - DOCX files do not provide page or bounding box metadata; convert to PDF if you need this data.

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

^{How did I do? Any feedback?}

github-actions · 2026-03-16T12:52:52Z

✅ DCO Check Passed

Thanks @odelliab, all your commits are properly signed off. 🎉

I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5cc61d9 I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 91b43f9 I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5d17bda I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: a50392e I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: e589429 I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 30c72a9 I, odelliab <91875866+odelliab@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 6aa0019 Signed-off-by: odelliab <odelliab@il.ibm.com>

ceberam

@odelliab Thanks for suggesting this implementation!
I left some technical comments, but I also wanted to share some high level comments:

~~Please, sync your fork, rebase to main, and resolve the current conflicts.~~ All good! It was a temporary issue with Github's CI/CD.
Can we recap the use case to build those expansion functions? I understood that we want to expand the context for the answer generation in RAG from the chunks obtained in the retrieval phase. If this is the case, do we really need to return a chunk with those functions? Or text would do the job? Because I see some drawbacks in this implementation choice:
- we serialize again a document (or part of the document) that was already serialized, since the starting point is a chunk (presumably obtained by a chunker with a serializer)
- the expand_to_page and expand_to_object trigger a new serialization that could be different to the one that generated the original chunk. What would be the consequences? What happens if, for instance, we run those functions with an HTML serializer on a chunk created using the default serializer (ChunkingDocSerializer with a markdown serialization)?
- if the given chunk was generated by a HybridChunker that respects a maximum token length, wouldn’t it be logical that the new chunk from expand_to_object also respects that condition?
- in general, we detach the chunk with the chunker that originated it.
Isn’t it enough the chunk and the Docling document for those expansions? Why do we need a serializer? We can locate the necessary items and then concatenate the text as we do it in BaseChunker.contextualize function. But maybe I’m missing a use case that would require building a new chunk.
I would use the word item instead of object in the new method names, since that’s the terminology we used in Docling document (TextItem, DocItem, ListItem, …)
Just wondering if we really nee 15 tests (thinking of keeping the pre-commit checks agile). Also, I would expect that we really check the content of the expanded chunks, to verify that they cover the expected doc items. For instance, I don’t see a test that checks that the output of expand_to_page contains all the doc items of the page or that the text field is exactly what we expect to get.
- Can we reuse artifacts across the tests, like the HybridChunker?
We could also let other users know about the potential of these expansions for RAG, by adding a small section in https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/ (in another PR on docling library).

docling_core/transforms/chunker/doc_chunk.py

ceberam · 2026-03-19T10:41:39Z

docling_core/transforms/chunker/doc_chunk.py

+        doc_items = []
+
+        for top_object in top_objects:
+            if isinstance(top_object, ListGroup | InlineGroup | DocItem):


At this point top_object will always be a DocItem. Is this a wrong type hint or maybe you were expecting to serialize groups? In that case you may need to review get_top_containing_objects.

Not sure why is this the case. A DocItem can be part of a GroupItem that resides in doc.body. No ?

See for instance test\data\chunker\0_inp_dl_doc.json. #/texts/113 -> #/groups/5 -> body

docling_core/transforms/chunker/doc_chunk.py

test/test_doc_chunk_expansion.py

docling_core/transforms/chunker/doc_chunk.py

Signed-off-by: odelliab <odelliab@il.ibm.com>

codecov · 2026-03-19T15:50:27Z

Codecov Report

❌ Patch coverage is 92.15686% with 4 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
docling_core/transforms/chunker/doc_chunk.py	92.15%	4 Missing ⚠️

📢 Thoughts on this report? Let us know!

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

Signed-off-by: odelliab <odelliab@il.ibm.com>

odelliab and others added 11 commits February 25, 2026 21:04

line_chunker

5cc61d9

split table to header and body

91b43f9

duplicat table headers

5d17bda

Revert "duplicat table headers"

a50392e

This reverts commit 5d17bda.

Revert "split table to header and body"

e589429

This reverts commit 91b43f9.

Revert "line_chunker"

30c72a9

This reverts commit 5cc61d9.

Merge branch 'docling-project:main' into main

6aa0019

chunk expansion + test

df0c30a

Signed-off-by: odelliab <odelliab@il.ibm.com>

small fixes

d5bb63b

Signed-off-by: odelliab <odelliab@il.ibm.com>

small fixes

322ae61

Signed-off-by: odelliab <odelliab@il.ibm.com>

remove unnecessary code

d974125

Signed-off-by: odelliab <odelliab@il.ibm.com>

odelliab changed the title ~~DocChunk expansion~~ feat:/DocChunk expansion Mar 16, 2026

odelliab changed the title ~~feat:/DocChunk expansion~~ feat: DocChunk expansion Mar 16, 2026

ceberam self-requested a review March 16, 2026 16:27

ceberam self-assigned this Mar 16, 2026

ceberam requested changes Mar 19, 2026

View reviewed changes

odelliab and others added 4 commits March 19, 2026 16:09

Merge branch 'docling-project:main' into main

c52dfe2

Merge branch 'main' into chunk_expansion

c65029d

change names

916ccb5

Signed-off-by: odelliab <odelliab@il.ibm.com>

remove some tests

0ae0b64

Signed-off-by: odelliab <odelliab@il.ibm.com>

odelliab and others added 5 commits March 22, 2026 13:42

Apply suggestions from code review

8fc78a4

Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>

address review comments

0dfecfc

Signed-off-by: odelliab <odelliab@il.ibm.com>

resolve conflict

d716946

Signed-off-by: odelliab <odelliab@il.ibm.com>

consolidate tests

1607cd1

Signed-off-by: odelliab <odelliab@il.ibm.com>

fix failing test

62e1f08

Signed-off-by: odelliab <odelliab@il.ibm.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: DocChunk expansion#549

feat: DocChunk expansion#549
odelliab wants to merge 21 commits intodocling-project:mainfrom
odelliab:chunk_expansion

odelliab commented Mar 16, 2026 •

edited

Loading

Uh oh!

mergify bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

dosubot bot commented Mar 16, 2026

Uh oh!

github-actions bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

ceberam left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ceberam Mar 19, 2026

Uh oh!

odelliab Mar 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

odelliab commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Features Added

1. Chunk Expansion Methods

get_top_containing_objects(doc: DoclingDocument) -> list[DocItem] | None

expand_to_object(dl_doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk

expand_to_page(doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk | None

2. Comprehensive Test Suite

Dependencies

Uh oh!

mergify bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

dosubot bot commented Mar 16, 2026

How can I find page numbers and bounding box information for content in a chunk produced by the hybrid chunker, and what is the structure of the doc_items list within a chunk?

Uh oh!

github-actions bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ceberam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ceberam Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

odelliab Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Mar 19, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

odelliab commented Mar 16, 2026 •

edited

Loading

`get_top_containing_objects(doc: DoclingDocument) -> list[DocItem] | None`

`expand_to_object(dl_doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk`

`expand_to_page(doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk | None`

mergify bot commented Mar 16, 2026 •

edited

Loading

github-actions bot commented Mar 16, 2026 •

edited

Loading

ceberam left a comment •

edited

Loading

odelliab Mar 22, 2026 •

edited

Loading