Extraction improvements: borderless tables

Improving the accuracy of table extraction is an ongoing process.

Currently ocirs' bordered and borderless table extraction algorithm is adopted from [Open-Intelligence](https://github.com/nazarimilad/open-intelligence-backend/tree/master), with small modifications tailored to work with IRS 990 forms.

Here is an example form page with a borderless table 

![Sarah Scaife Foundation_2015_36](https://user-images.githubusercontent.com/35546183/120002207-94622800-bfa2-11eb-85ec-c059e9168bbb.jpg)

When the form page is cropped down to just the table using CascadeTabNet we see decent cell structure recognition, but it's not perfect. 

[borderless_cropped.txt](https://github.com/aaronbrezel/ocirs/files/6561708/borderless_cropped.txt)

There are some issues detecting that multiple lines of text belong to the same row. The algorithm also erroneously detects more columns than there are. This is a fairly dirty page, but we'd like to ensure as little post-processing as possible, especially when CascadeTabNet is applied. 

The uncropped page has similar column and row group issues, with the additional challenge of extra text above and below the table.

[borderless_uncropped.txt](https://github.com/aaronbrezel/ocirs/files/6561714/borderless_uncropped.txt)

Moving forward, we'd like to improve our method of grouping rows and columns together for borderless tables. This would involve tweaking the `get_clustering_indexes()` method called in `ocirs/table_extraction/borderless_table_extraction.py`.

The starting point for borderless table extraction is `get_borderless_table()` called in `ocirs/table_extraction/table_extraction.py`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extraction improvements: borderless tables #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extraction improvements: borderless tables #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions