Extraction improvements: bordered tables

Improving the accuracy of table extraction is an ongoing process.

Currently ocirs' bordered and borderless table extraction algorithm is adopted from [Open-Intelligence](https://github.com/nazarimilad/open-intelligence-backend/tree/61847c5b0153bf431c2bc107a099eb3355d76ba6), with small modifications tailored to work with IRS 990 forms. 

Here is an example form page: 

![Charles Koch Institute_2013_25](https://user-images.githubusercontent.com/35546183/119994465-9f18bf00-bf9a-11eb-8de2-807344f45c3e.jpg)

When the image is cropped down using CascadeTabNet to just the table itself. We see high cell structure accuracy during table extraction.

[bordered_cropped.txt](https://github.com/aaronbrezel/ocirs/files/6561401/bordered_cropped.txt)
 
But when the page is left as is, the table extraction process erroneously includes other text not in the table, which could mess up the cell structure accuracy and leads to a need for significant post-processing on the back end by the user. 

[bordered_uncropped.txt](https://github.com/aaronbrezel/ocirs/files/6561350/bordered_uncropped.txt)

In general, the bordered table extraction process should be able to rely on lines detected by open-cv. However, in instances such as the example page above, there are extra lines on the page that lead to the extraction of the extra text. 

There may be some way to tweak the logic of `LineDetector().detect_lines()` in `ocirs/table_extraction/line_detector/line_detector.py` to remove lines whose position do not conform to an expected table-like standard. 

The entry point for editing the current Open-Intelligence inspired bordered table extraction process is `get_bordered_table_OI()` called in `ocirs/table_extraction/table_extraction.py`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extraction improvements: bordered tables #1

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extraction improvements: bordered tables #1

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions