-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Improving the accuracy of table extraction is an ongoing process.
Currently ocirs' bordered and borderless table extraction algorithm is adopted from Open-Intelligence, with small modifications tailored to work with IRS 990 forms.
Here is an example form page:
When the image is cropped down using CascadeTabNet to just the table itself. We see high cell structure accuracy during table extraction.
But when the page is left as is, the table extraction process erroneously includes other text not in the table, which could mess up the cell structure accuracy and leads to a need for significant post-processing on the back end by the user.
In general, the bordered table extraction process should be able to rely on lines detected by open-cv. However, in instances such as the example page above, there are extra lines on the page that lead to the extraction of the extra text.
There may be some way to tweak the logic of LineDetector().detect_lines() in ocirs/table_extraction/line_detector/line_detector.py to remove lines whose position do not conform to an expected table-like standard.
The entry point for editing the current Open-Intelligence inspired bordered table extraction process is get_bordered_table_OI() called in ocirs/table_extraction/table_extraction.py.
