Skip to content

Extraction improvements: bordered tables #1

@aaronbrezel

Description

@aaronbrezel

Improving the accuracy of table extraction is an ongoing process.

Currently ocirs' bordered and borderless table extraction algorithm is adopted from Open-Intelligence, with small modifications tailored to work with IRS 990 forms.

Here is an example form page:

Charles Koch Institute_2013_25

When the image is cropped down using CascadeTabNet to just the table itself. We see high cell structure accuracy during table extraction.

bordered_cropped.txt

But when the page is left as is, the table extraction process erroneously includes other text not in the table, which could mess up the cell structure accuracy and leads to a need for significant post-processing on the back end by the user.

bordered_uncropped.txt

In general, the bordered table extraction process should be able to rely on lines detected by open-cv. However, in instances such as the example page above, there are extra lines on the page that lead to the extraction of the extra text.

There may be some way to tweak the logic of LineDetector().detect_lines() in ocirs/table_extraction/line_detector/line_detector.py to remove lines whose position do not conform to an expected table-like standard.

The entry point for editing the current Open-Intelligence inspired bordered table extraction process is get_bordered_table_OI() called in ocirs/table_extraction/table_extraction.py.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions