Skip to content

feat: DocChunk expansion#549

Open
odelliab wants to merge 21 commits intodocling-project:mainfrom
odelliab:chunk_expansion
Open

feat: DocChunk expansion#549
odelliab wants to merge 21 commits intodocling-project:mainfrom
odelliab:chunk_expansion

Conversation

@odelliab
Copy link
Copy Markdown
Contributor

@odelliab odelliab commented Mar 16, 2026

Overview

This PR introduces chunk expansion functionality to the DocChunk class, enabling chunks to be expanded to include complete document objects or full pages. It also includes a comprehensive test suite with 15 test cases.
Issue #542

Features Added

1. Chunk Expansion Methods

Three new methods added to DocChunk class:

get_top_containing_objects(doc: DoclingDocument) -> list[DocItem] | None

  • Traverses the document tree upward from chunk items to find top-level containing objects
  • Maintains reading order as in the original document
  • Returns None if no top objects found

expand_to_object(dl_doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk

  • Expands chunk to include complete top-level document objects (sections, tables, lists)
  • Prevents partial object inclusion in chunks
  • Returns original chunk if expansion fails

expand_to_page(doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk | None

  • Expands chunk to include all content from the pages it spans
  • Useful for maintaining page-level context
  • Returns original chunk if expansion not possible

2. Comprehensive Test Suite

Created test/test_doc_chunk_expansion.py with 15 test cases covering:

Dependencies

No new dependencies added.

#Note:
In the current implementation the expanded chunk has the same headings as the original chunk.
This may need revisiting.

odelliab and others added 11 commits February 25, 2026 21:04
This reverts commit 5cc61d9.
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 16, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

@dosubot
Copy link
Copy Markdown

dosubot bot commented Mar 16, 2026

Related Documentation

1 document(s) may need updating based on files changed in this PR:

Docling

How can I find page numbers and bounding box information for content in a chunk produced by the hybrid chunker, and what is the structure of the doc_items list within a chunk?
View Suggested Changes
@@ -41,6 +41,73 @@
 )
 ```
 
+## Chunk Expansion Methods
+
+The `DocChunk` class provides three methods to expand chunks to include additional context from the document:
+
+### get_top_containing_objects(doc: DoclingDocument) -> list[DocItem] | None
+
+Traverses the document tree upward from the chunk's items to find the top-level containing objects (direct children of the document body). This method maintains the reading order as in the original document.
+
+**Returns:** A list of top-level `DocItem` objects that contain the chunk's items, or `None` if no top objects are found.
+
+**Example:**
+```python
+top_objects = chunk.get_top_containing_objects(dl_doc)
+if top_objects:
+    for obj in top_objects:
+        print(f"Top object: {obj.label}")
+```
+
+### expand_to_object(dl_doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk
+
+Expands a chunk to include complete document objects (e.g., full sections, tables, lists) that contain the chunk's items. This prevents partial object inclusion in chunks.
+
+This method retrieves the top-level containing objects using `get_top_containing_objects()`, serializes them completely, and returns a new `DocChunk` with the expanded content. If expansion fails, the original chunk is returned.
+
+**Use case:** When you need to ensure you have complete objects rather than partial content (e.g., the full table instead of just a few rows, or an entire paragraph instead of a fragment).
+
+**Example:**
+```python
+from docling_core.transforms.chunker import HybridChunker
+
+# Create chunks using the hybrid chunker
+chunker = HybridChunker(max_tokens=512)
+chunks = list(chunker.chunk(dl_doc=dl_doc))
+
+# Expand a chunk to include complete objects
+serializer = chunker.serializer_provider.get_serializer(dl_doc)
+expanded_chunk = chunks[0].expand_to_object(dl_doc=dl_doc, serializer=serializer)
+
+print(f"Original text length: {len(chunks[0].text)}")
+print(f"Expanded text length: {len(expanded_chunk.text)}")
+```
+
+### expand_to_page(doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk | None
+
+Expands a chunk to include all content from the page(s) where the chunk's items appear. This method uses the `page_no` from the provenance information (`item.prov`) to identify which pages to include, then serializes all content from those pages.
+
+If the document has no pages or page information is unavailable, the original chunk is returned.
+
+**Use case:** When you want page-level context for a chunk, such as retrieving all content from the same page(s) as the chunk.
+
+**Example:**
+```python
+from docling_core.transforms.chunker import HybridChunker
+
+# Create chunks
+chunker = HybridChunker(max_tokens=512)
+chunks = list(chunker.chunk(dl_doc=dl_doc))
+
+# Expand a chunk to include full page content
+serializer = chunker.serializer_provider.get_serializer(dl_doc)
+page_expanded_chunk = chunks[0].expand_to_page(doc=dl_doc, serializer=serializer)
+
+# Check which pages are included
+page_ids = [i.page_no for item in page_expanded_chunk.meta.doc_items for i in item.prov]
+print(f"Pages included: {set(page_ids)}")
+```
+
 **Limitations:**
 - Line numbers are not available.
 - DOCX files do not provide page or bounding box metadata; convert to PDF if you need this data.

[Accept] [Decline]

Note: You must be authenticated to accept/decline updates.

How did I do? Any feedback?  Join Discord

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 16, 2026

DCO Check Passed

Thanks @odelliab, all your commits are properly signed off. 🎉

I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5cc61d9
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 91b43f9
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5d17bda
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: a50392e
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: e589429
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 30c72a9
I, odelliab <91875866+odelliab@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 6aa0019

Signed-off-by: odelliab <odelliab@il.ibm.com>
@odelliab odelliab changed the title DocChunk expansion feat:/DocChunk expansion Mar 16, 2026
@odelliab odelliab changed the title feat:/DocChunk expansion feat: DocChunk expansion Mar 16, 2026
@ceberam ceberam self-requested a review March 16, 2026 16:27
@ceberam ceberam self-assigned this Mar 16, 2026
Copy link
Copy Markdown
Member

@ceberam ceberam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@odelliab Thanks for suggesting this implementation!
I left some technical comments, but I also wanted to share some high level comments:

  • Please, sync your fork, rebase to main, and resolve the current conflicts. All good! It was a temporary issue with Github's CI/CD.
  • Can we recap the use case to build those expansion functions? I understood that we want to expand the context for the answer generation in RAG from the chunks obtained in the retrieval phase. If this is the case, do we really need to return a chunk with those functions? Or text would do the job? Because I see some drawbacks in this implementation choice:
    • we serialize again a document (or part of the document) that was already serialized, since the starting point is a chunk (presumably obtained by a chunker with a serializer)
    • the expand_to_page and expand_to_object trigger a new serialization that could be different to the one that generated the original chunk. What would be the consequences? What happens if, for instance, we run those functions with an HTML serializer on a chunk created using the default serializer (ChunkingDocSerializer with a markdown serialization)?
    • if the given chunk was generated by a HybridChunker that respects a maximum token length, wouldn’t it be logical that the new chunk from expand_to_object also respects that condition?
    • in general, we detach the chunk with the chunker that originated it.
  • Isn’t it enough the chunk and the Docling document for those expansions? Why do we need a serializer? We can locate the necessary items and then concatenate the text as we do it in BaseChunker.contextualize function. But maybe I’m missing a use case that would require building a new chunk.
  • I would use the word item instead of object in the new method names, since that’s the terminology we used in Docling document (TextItem, DocItem, ListItem, …)
  • Just wondering if we really nee 15 tests (thinking of keeping the pre-commit checks agile). Also, I would expect that we really check the content of the expanded chunks, to verify that they cover the expected doc items. For instance, I don’t see a test that checks that the output of expand_to_page contains all the doc items of the page or that the text field is exactly what we expect to get.
    • Can we reuse artifacts across the tests, like the HybridChunker?
  • We could also let other users know about the potential of these expansions for RAG, by adding a small section in https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/ (in another PR on docling library).

doc_items = []

for top_object in top_objects:
if isinstance(top_object, ListGroup | InlineGroup | DocItem):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point top_object will always be a DocItem. Is this a wrong type hint or maybe you were expecting to serialize groups? In that case you may need to review get_top_containing_objects.

Copy link
Copy Markdown
Contributor Author

@odelliab odelliab Mar 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why is this the case. A DocItem can be part of a GroupItem that resides in doc.body. No ?

See for instance test\data\chunker\0_inp_dl_doc.json. #/texts/113 -> #/groups/5 -> body

odelliab and others added 4 commits March 19, 2026 16:09
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 19, 2026

Codecov Report

❌ Patch coverage is 92.15686% with 4 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
docling_core/transforms/chunker/doc_chunk.py 92.15% 4 Missing ⚠️

📢 Thoughts on this report? Let us know!

odelliab and others added 5 commits March 22, 2026 13:42
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com>
Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants