Skip to content

feat: update OmniDocBench with Parquet support and add test#198

Open
samiuc wants to merge 4 commits intomainfrom
sami/update-omnidoc-builder
Open

feat: update OmniDocBench with Parquet support and add test#198
samiuc wants to merge 4 commits intomainfrom
sami/update-omnidoc-builder

Conversation

@samiuc
Copy link
Copy Markdown
Contributor

@samiuc samiuc commented Jan 21, 2026

The OmniDocBench dataset contains ~2700 individual files. When downloading via the raw mode (file-by-file), each file triggers a separate request to HuggingFace's API. In parallel test environments, this quickly hits HuggingFace's rate limit of 1000 requests per 5 minutes, causing test failures.

So in the PR, I have:

  • Added a new use_parquet parameter to OmniDocBenchDatasetBuilder that enables downloading pre-converted Parquet shards instead of individual files
  • Implemented _iterate_parquet() method that uses load_dataset() to stream data from Parquet files, reducing the number of API requests
  • Parquet mode downloads the entire dataset instead of 2700+ individual requests
  • Maintains backward compatibility with the original raw mode (default behavior unchanged)

The Parquet version of OmniDocBench is available at: https://huggingface.co/datasets/samiuc/OmniDocBench-parquet

Question: can we move it to docling repo on HF?

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Jan 21, 2026

DCO Check Passed

Thanks @samiuc, all your commits are properly signed off. 🎉

@mergify
Copy link
Copy Markdown

mergify bot commented Jan 21, 2026

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
@samiuc samiuc requested review from cau-git and divekarsc January 22, 2026 00:02
divekarsc
divekarsc previously approved these changes Jan 23, 2026
Copy link
Copy Markdown
Contributor

@divekarsc divekarsc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Signed-off-by: samiuc <sami.ullah.chat@gmail.com>
@samiuc
Copy link
Copy Markdown
Contributor Author

samiuc commented Mar 19, 2026

@cau-git can you please review the PR when you have sometime? Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants