Conversation
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: Docling How can I find page numbers and bounding box information for content in a chunk produced by the hybrid chunker, and what is the structure of the doc_items list within a chunk?View Suggested Changes@@ -41,6 +41,73 @@
)
```
+## Chunk Expansion Methods
+
+The `DocChunk` class provides three methods to expand chunks to include additional context from the document:
+
+### get_top_containing_objects(doc: DoclingDocument) -> list[DocItem] | None
+
+Traverses the document tree upward from the chunk's items to find the top-level containing objects (direct children of the document body). This method maintains the reading order as in the original document.
+
+**Returns:** A list of top-level `DocItem` objects that contain the chunk's items, or `None` if no top objects are found.
+
+**Example:**
+```python
+top_objects = chunk.get_top_containing_objects(dl_doc)
+if top_objects:
+ for obj in top_objects:
+ print(f"Top object: {obj.label}")
+```
+
+### expand_to_object(dl_doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk
+
+Expands a chunk to include complete document objects (e.g., full sections, tables, lists) that contain the chunk's items. This prevents partial object inclusion in chunks.
+
+This method retrieves the top-level containing objects using `get_top_containing_objects()`, serializes them completely, and returns a new `DocChunk` with the expanded content. If expansion fails, the original chunk is returned.
+
+**Use case:** When you need to ensure you have complete objects rather than partial content (e.g., the full table instead of just a few rows, or an entire paragraph instead of a fragment).
+
+**Example:**
+```python
+from docling_core.transforms.chunker import HybridChunker
+
+# Create chunks using the hybrid chunker
+chunker = HybridChunker(max_tokens=512)
+chunks = list(chunker.chunk(dl_doc=dl_doc))
+
+# Expand a chunk to include complete objects
+serializer = chunker.serializer_provider.get_serializer(dl_doc)
+expanded_chunk = chunks[0].expand_to_object(dl_doc=dl_doc, serializer=serializer)
+
+print(f"Original text length: {len(chunks[0].text)}")
+print(f"Expanded text length: {len(expanded_chunk.text)}")
+```
+
+### expand_to_page(doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk | None
+
+Expands a chunk to include all content from the page(s) where the chunk's items appear. This method uses the `page_no` from the provenance information (`item.prov`) to identify which pages to include, then serializes all content from those pages.
+
+If the document has no pages or page information is unavailable, the original chunk is returned.
+
+**Use case:** When you want page-level context for a chunk, such as retrieving all content from the same page(s) as the chunk.
+
+**Example:**
+```python
+from docling_core.transforms.chunker import HybridChunker
+
+# Create chunks
+chunker = HybridChunker(max_tokens=512)
+chunks = list(chunker.chunk(dl_doc=dl_doc))
+
+# Expand a chunk to include full page content
+serializer = chunker.serializer_provider.get_serializer(dl_doc)
+page_expanded_chunk = chunks[0].expand_to_page(doc=dl_doc, serializer=serializer)
+
+# Check which pages are included
+page_ids = [i.page_no for item in page_expanded_chunk.meta.doc_items for i in item.prov]
+print(f"Pages included: {set(page_ids)}")
+```
+
**Limitations:**
- Line numbers are not available.
- DOCX files do not provide page or bounding box metadata; convert to PDF if you need this data.Note: You must be authenticated to accept/decline updates. |
|
✅ DCO Check Passed Thanks @odelliab, all your commits are properly signed off. 🎉 |
I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5cc61d9 I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 91b43f9 I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 5d17bda I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: a50392e I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: e589429 I, odelliab <odelliab@il.ibm.com>, hereby add my Signed-off-by to this commit: 30c72a9 I, odelliab <91875866+odelliab@users.noreply.github.com>, hereby add my Signed-off-by to this commit: 6aa0019 Signed-off-by: odelliab <odelliab@il.ibm.com>
There was a problem hiding this comment.
@odelliab Thanks for suggesting this implementation!
I left some technical comments, but I also wanted to share some high level comments:
Please, sync your fork, rebase toAll good! It was a temporary issue with Github's CI/CD.main, and resolve the current conflicts.- Can we recap the use case to build those expansion functions? I understood that we want to expand the context for the answer generation in RAG from the chunks obtained in the retrieval phase. If this is the case, do we really need to return a chunk with those functions? Or text would do the job? Because I see some drawbacks in this implementation choice:
- we serialize again a document (or part of the document) that was already serialized, since the starting point is a chunk (presumably obtained by a chunker with a serializer)
- the
expand_to_pageandexpand_to_objecttrigger a new serialization that could be different to the one that generated the original chunk. What would be the consequences? What happens if, for instance, we run those functions with an HTML serializer on a chunk created using the default serializer (ChunkingDocSerializerwith a markdown serialization)? - if the given chunk was generated by a HybridChunker that respects a maximum token length, wouldn’t it be logical that the new chunk from
expand_to_objectalso respects that condition? - in general, we detach the chunk with the chunker that originated it.
- Isn’t it enough the chunk and the Docling document for those expansions? Why do we need a serializer? We can locate the necessary items and then concatenate the text as we do it in
BaseChunker.contextualizefunction. But maybe I’m missing a use case that would require building a new chunk. - I would use the word
iteminstead ofobjectin the new method names, since that’s the terminology we used in Docling document (TextItem,DocItem,ListItem, …) - Just wondering if we really nee 15 tests (thinking of keeping the
pre-commitchecks agile). Also, I would expect that we really check the content of the expanded chunks, to verify that they cover the expected doc items. For instance, I don’t see a test that checks that the output ofexpand_to_pagecontains all the doc items of the page or that thetextfield is exactly what we expect to get.- Can we reuse artifacts across the tests, like the HybridChunker?
- We could also let other users know about the potential of these expansions for RAG, by adding a small section in https://docling-project.github.io/docling/examples/advanced_chunking_and_serialization/ (in another PR on docling library).
| doc_items = [] | ||
|
|
||
| for top_object in top_objects: | ||
| if isinstance(top_object, ListGroup | InlineGroup | DocItem): |
There was a problem hiding this comment.
At this point top_object will always be a DocItem. Is this a wrong type hint or maybe you were expecting to serialize groups? In that case you may need to review get_top_containing_objects.
There was a problem hiding this comment.
Not sure why is this the case. A DocItem can be part of a GroupItem that resides in doc.body. No ?
See for instance test\data\chunker\0_inp_dl_doc.json. #/texts/113 -> #/groups/5 -> body
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Co-authored-by: Cesar Berrospi Ramis <75900930+ceberam@users.noreply.github.com> Signed-off-by: odelliab <91875866+odelliab@users.noreply.github.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Signed-off-by: odelliab <odelliab@il.ibm.com>
Overview
This PR introduces chunk expansion functionality to the
DocChunkclass, enabling chunks to be expanded to include complete document objects or full pages. It also includes a comprehensive test suite with 15 test cases.Issue #542
Features Added
1. Chunk Expansion Methods
Three new methods added to
DocChunkclass:get_top_containing_objects(doc: DoclingDocument) -> list[DocItem] | NoneNoneif no top objects foundexpand_to_object(dl_doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunkexpand_to_page(doc: DoclingDocument, serializer: BaseDocSerializer) -> DocChunk | None2. Comprehensive Test Suite
Created
test/test_doc_chunk_expansion.pywith 15 test cases covering:Dependencies
No new dependencies added.
#Note:
In the current implementation the expanded chunk has the same headings as the original chunk.
This may need revisiting.