-
Notifications
You must be signed in to change notification settings - Fork 1
Description
Improving the accuracy of table extraction is an ongoing process.
Currently ocirs' bordered and borderless table extraction algorithm is adopted from Open-Intelligence, with small modifications tailored to work with IRS 990 forms.
Here is an example form page with a borderless table
When the form page is cropped down to just the table using CascadeTabNet we see decent cell structure recognition, but it's not perfect.
There are some issues detecting that multiple lines of text belong to the same row. The algorithm also erroneously detects more columns than there are. This is a fairly dirty page, but we'd like to ensure as little post-processing as possible, especially when CascadeTabNet is applied.
The uncropped page has similar column and row group issues, with the additional challenge of extra text above and below the table.
Moving forward, we'd like to improve our method of grouping rows and columns together for borderless tables. This would involve tweaking the get_clustering_indexes() method called in ocirs/table_extraction/borderless_table_extraction.py.
The starting point for borderless table extraction is get_borderless_table() called in ocirs/table_extraction/table_extraction.py.
