fix: drop malformed bounding boxes#454
Conversation
|
✅ DCO Check Passed Thanks @b-hahn, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🔴 Require two reviewer for test updatesThis rule is failing.When test data is updated, we require two reviewers
🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
1c3bd97 to
dd586ed
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
I think that @b-hahn from yesterday would require another approval of the CI runs on your end. :) |
|
@b-hahn you identified the right function, i.e. |
|
@dolfim-ibm we agreed on the proposed solution here. It’s legit for some elements to have missing prov |
|
@dolfim-ibm and @cau-git : How do you suggest to move forward now? |
|
I think I found another issue that might relate to this, unfortunately I can't provide a stack trace since it is hard to reproduce and I didn't fully log everything on the server, and I couldn't find this error message in the code yet. Maybe you can have a look at this, too: |
|
@b-hahn for the moment we should be good with the option which provides the empty |
docling_core/types/doc/document.py
Outdated
| text=caption_text, | ||
| parent=None, | ||
| ) | ||
| if bbox is not None: |
There was a problem hiding this comment.
I would remove this condition so captions are created also when the bbox is not present, as before.
docling_core/types/doc/document.py
Outdated
| doc=doc, | ||
| parent=inline_group, | ||
| ) | ||
| if common_bbox is not None: |
There was a problem hiding this comment.
I would remove this condition so that _add_text is called also with a None bounding box, as before.
378b1d6 to
074400f
Compare
There was a problem hiding this comment.
LGTM, it works as expected, thus approved.
I would only add the following suggestion. Turn this variable into a module constant, so that we remember the value choice we made. We can also add some docstrings. Something like:
_DOCTAGS_BBOX_MIN_DIMENSION: Final = 1 / 500
"""Minimum bounding box dimension threshold for DocTags import.
This constant represents the minimum width or height (as a fraction of the image
dimensions) that a bounding box must have to be considered valid during DocTags
import. The value ``1/500 = 0.002 = 0.2%`` means that any bounding box with width or
height less than 0.2% of the image width or height will be discarded.
"""
@b-hahn ^^^ were you able to have a look at this? ^^^ |
Signed-off-by: Benjamin Hahn <benjamin.hahn1@gmail.com>
Signed-off-by: Benjamin Hahn <benjamin.hahn1@gmail.com>
Signed-off-by: Benjamin Hahn <benjamin.hahn1@gmail.com>
Signed-off-by: Benjamin Hahn <benjamin.hahn1@gmail.com>
|
@ceberam Added a module-level constant and signed off all commits. Let me know if you need anything else to see this through. |
This change handles the case where bounding boxes predicted by VLMs have a width or a height <=0. Previously, this would cause the pipeline to crash while saving the PIL image. Now, malformed bounding boxes are simply ignored, resulting in an empty provenance field while keeping the extracted text.
Fixes docling-project/docling#2763