First of all, thank you for building such a powerful and well-structured document intelligence toolkit!
I'm currently using Docling to process PDF documents and extract structured content with metadata via the chunker pipeline. I noticed that the dl_meta output already includes useful provenance information like page_no and bbox under the prov field — which is great!
However, I have two related questions:
Bounding Box (bbox):
Is the bbox in prov always guaranteed to be present for text items? And is it possible to get more granular bbox information (e.g., per word or per line) if needed?
Confidence Score (confidence):
Does Docling support exposing a confidence score (e.g., from OCR or layout detection models) in the metadata? This would be extremely helpful for downstream filtering or quality assessment, especially when processing scanned or low-quality documents.
If these fields are not currently exposed but are available internally, would you consider adding them as optional metadata in future versions?
First of all, thank you for building such a powerful and well-structured document intelligence toolkit!
I'm currently using Docling to process PDF documents and extract structured content with metadata via the chunker pipeline. I noticed that the dl_meta output already includes useful provenance information like page_no and bbox under the prov field — which is great!
However, I have two related questions:
Bounding Box (bbox):
Is the bbox in prov always guaranteed to be present for text items? And is it possible to get more granular bbox information (e.g., per word or per line) if needed?
Confidence Score (confidence):
Does Docling support exposing a confidence score (e.g., from OCR or layout detection models) in the metadata? This would be extremely helpful for downstream filtering or quality assessment, especially when processing scanned or low-quality documents.
If these fields are not currently exposed but are available internally, would you consider adding them as optional metadata in future versions?