feat: update OmniDocBench with Parquet support and add test by samiuc · Pull Request #198 · docling-project/docling-eval

samiuc · 2026-01-21T06:46:21Z

The OmniDocBench dataset contains ~2700 individual files. When downloading via the raw mode (file-by-file), each file triggers a separate request to HuggingFace's API. In parallel test environments, this quickly hits HuggingFace's rate limit of 1000 requests per 5 minutes, causing test failures.

So in the PR, I have:

Added a new use_parquet parameter to OmniDocBenchDatasetBuilder that enables downloading pre-converted Parquet shards instead of individual files
Implemented _iterate_parquet() method that uses load_dataset() to stream data from Parquet files, reducing the number of API requests
Parquet mode downloads the entire dataset instead of 2700+ individual requests
Maintains backward compatibility with the original raw mode (default behavior unchanged)

The Parquet version of OmniDocBench is available at: https://huggingface.co/datasets/samiuc/OmniDocBench-parquet

Question: can we move it to docling repo on HF?

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

github-actions · 2026-01-21T06:46:31Z

✅ DCO Check Passed

Thanks @samiuc, all your commits are properly signed off. 🎉

mergify · 2026-01-21T06:46:56Z

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

#approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

divekarsc

LGTM!

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

samiuc · 2026-03-19T19:27:12Z

@cau-git can you please review the PR when you have sometime? Thank you

feat: update OmniDocBench with Parquet support and add test

10251dc

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

fix: build error and import tableformer provider conditionally

457e248

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

samiuc requested review from cau-git and divekarsc January 22, 2026 00:02

divekarsc previously approved these changes Jan 23, 2026

View reviewed changes

samiuc mentioned this pull request Feb 12, 2026

fix: PIL Image Memory Leaks in Dataset Builders #194

Open

fix import

2d9994c

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>

samiuc dismissed divekarsc’s stale review via 2d9994c March 19, 2026 19:20

Merge branch 'main' into sami/update-omnidoc-builder

c2657b3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: update OmniDocBench with Parquet support and add test#198

feat: update OmniDocBench with Parquet support and add test#198
samiuc wants to merge 4 commits intomainfrom
sami/update-omnidoc-builder

samiuc commented Jan 21, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

mergify bot commented Jan 21, 2026 •

edited

Loading

Uh oh!

divekarsc left a comment

Uh oh!

samiuc commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

samiuc commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jan 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merge Protections

🔴 Require two reviewer for test updates

🟢 Enforce conventional commit

Uh oh!

divekarsc left a comment

Choose a reason for hiding this comment

Uh oh!

samiuc commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

samiuc commented Jan 21, 2026 •

edited

Loading

github-actions bot commented Jan 21, 2026 •

edited

Loading

mergify bot commented Jan 21, 2026 •

edited

Loading