fix: drop malformed bounding boxes by b-hahn · Pull Request #454 · docling-project/docling-core

b-hahn · 2025-12-10T19:46:07Z

This change handles the case where bounding boxes predicted by VLMs have a width or a height <=0. Previously, this would cause the pipeline to crash while saving the PIL image. Now, malformed bounding boxes are simply ignored, resulting in an empty provenance field while keeping the extracted text.

Fixes docling-project/docling#2763

github-actions · 2025-12-10T19:46:17Z

✅ DCO Check Passed

Thanks @b-hahn, all your commits are properly signed off. 🎉

mergify · 2025-12-10T19:46:45Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

codecov · 2025-12-10T20:24:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

kklein · 2025-12-11T10:31:43Z

@dolfim-ibm @cau-git:

I think that @b-hahn from yesterday would require another approval of the CI runs on your end. :)

dolfim-ibm · 2025-12-11T12:16:55Z

@b-hahn you identified the right function, i.e. load_from_doctags(), to fix the issue. As we discussed, instead of making an empty prov: [] we would like to drop the element completely if it has zero size.

cau-git · 2025-12-11T12:25:33Z

@dolfim-ibm we agreed on the proposed solution here. It’s legit for some elements to have missing prov

kklein · 2025-12-12T08:44:17Z

@dolfim-ibm and @cau-git : How do you suggest to move forward now?

simon376 · 2025-12-15T15:07:58Z

I think I found another issue that might relate to this, unfortunately I can't provide a stack trace since it is hard to reproduce and I didn't fully log everything on the server, and I couldn't find this error message in the code yet. Maybe you can have a look at this, too:
Pipeline VlmPipeline failed: Coordinate 'lower' is less than 'upper'

dolfim-ibm · 2025-12-15T15:24:29Z

@b-hahn for the moment we should be good with the option which provides the empty prov: [].

cau-git

@b-hahn @kklein just reflecting the desired solution below. Let's first go only with your change in extract_bounding_box, then observe before introducing conditions that could potentially skip valid content.

cau-git · 2025-12-15T15:26:46Z

docling_core/types/doc/document.py

-                    text=caption_text,
-                    parent=None,
-                )
+                if bbox is not None:


I would remove this condition so captions are created also when the bbox is not present, as before.

cau-git · 2025-12-15T15:27:05Z

docling_core/types/doc/document.py

-                            doc=doc,
-                            parent=inline_group,
-                        )
+                        if common_bbox is not None:


I would remove this condition so that _add_text is called also with a None bounding box, as before.

dosubot · 2025-12-17T21:32:17Z

Related Documentation

Checked 8 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

ceberam

LGTM, it works as expected, thus approved.

I would only add the following suggestion. Turn this variable into a module constant, so that we remember the value choice we made. We can also add some docstrings. Something like:

_DOCTAGS_BBOX_MIN_DIMENSION: Final = 1 / 500
"""Minimum bounding box dimension threshold for DocTags import.

This constant represents the minimum width or height (as a fraction of the image
dimensions) that a bounding box must have to be considered valid during DocTags
import. The value ``1/500 = 0.002 = 0.2%`` means that any bounding box with width or
height less than 0.2% of the image width or height will be discarded.
"""

ceberam · 2026-02-03T13:55:28Z

@b-hahn we just need your sign-off in the second commit (af71bb5) to merge this PR.

ceberam · 2026-02-11T12:29:14Z

@b-hahn we just need your sign-off in the second commit (af71bb5) to merge this PR.

@b-hahn ^^^ were you able to have a look at this? ^^^
Also, consider the remark here #454 (review)

Signed-off-by: Benjamin Hahn <benjamin.hahn1@gmail.com>

b-hahn · 2026-02-20T19:58:04Z

@ceberam Added a module-level constant and signed off all commits. Let me know if you need anything else to see this through.

b-hahn force-pushed the main branch 4 times, most recently from 1c3bd97 to dd586ed Compare December 10, 2025 20:22

cau-git reviewed Dec 15, 2025

View reviewed changes

b-hahn closed this Dec 17, 2025

b-hahn force-pushed the main branch from 5e23abf to 2920f24 Compare December 17, 2025 21:24

b-hahn reopened this Dec 17, 2025

b-hahn force-pushed the main branch 2 times, most recently from 378b1d6 to 074400f Compare December 17, 2025 21:30

b-hahn marked this pull request as ready for review December 17, 2025 21:32

ceberam previously approved these changes Feb 3, 2026

View reviewed changes

cau-git previously approved these changes Feb 3, 2026

View reviewed changes

b-hahn added 3 commits February 20, 2026 17:00

Ignore zero-height and zero-width bounding boxes.

58f3c09

Signed-off-by: Benjamin Hahn <benjamin.hahn1@gmail.com>

Return no element if empty bounding box is extracted.

2867d22

Signed-off-by: Benjamin Hahn <benjamin.hahn1@gmail.com>

Ensure captions always returned irrespective of bbox presence.

5c6c795

Signed-off-by: Benjamin Hahn <benjamin.hahn1@gmail.com>

b-hahn force-pushed the main branch from 074400f to 5c6c795 Compare February 20, 2026 19:52

b-hahn dismissed cau-git’s stale review via b3e77ef February 20, 2026 19:55

b-hahn dismissed ceberam’s stale review via b3e77ef February 20, 2026 19:55

Extract bbox min dimension threshold into a module-level constant.

6c88499

Signed-off-by: Benjamin Hahn <benjamin.hahn1@gmail.com>

b-hahn force-pushed the main branch from b3e77ef to 6c88499 Compare February 20, 2026 19:57

Conversation

b-hahn commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

codecov bot commented Dec 10, 2025

Codecov Report

Uh oh!

kklein commented Dec 11, 2025

Uh oh!

dolfim-ibm commented Dec 11, 2025

Uh oh!

cau-git commented Dec 11, 2025

Uh oh!

kklein commented Dec 12, 2025

Uh oh!

simon376 commented Dec 15, 2025

Uh oh!

dolfim-ibm commented Dec 15, 2025

Uh oh!

cau-git left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cau-git Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

b-hahn Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

cau-git Dec 15, 2025

Choose a reason for hiding this comment

Uh oh!

b-hahn Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

dosubot bot commented Dec 17, 2025

Uh oh!

ceberam left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ceberam commented Feb 3, 2026

Uh oh!

ceberam commented Feb 11, 2026

Uh oh!

b-hahn commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

b-hahn commented Dec 10, 2025 •

edited

Loading

github-actions bot commented Dec 10, 2025 •

edited

Loading

mergify bot commented Dec 10, 2025 •

edited

Loading

cau-git left a comment •

edited

Loading

ceberam left a comment •

edited

Loading