Skip to content

Extraction improvements: borderless tables #2

@aaronbrezel

Description

@aaronbrezel

Improving the accuracy of table extraction is an ongoing process.

Currently ocirs' bordered and borderless table extraction algorithm is adopted from Open-Intelligence, with small modifications tailored to work with IRS 990 forms.

Here is an example form page with a borderless table

Sarah Scaife Foundation_2015_36

When the form page is cropped down to just the table using CascadeTabNet we see decent cell structure recognition, but it's not perfect.

borderless_cropped.txt

There are some issues detecting that multiple lines of text belong to the same row. The algorithm also erroneously detects more columns than there are. This is a fairly dirty page, but we'd like to ensure as little post-processing as possible, especially when CascadeTabNet is applied.

The uncropped page has similar column and row group issues, with the additional challenge of extra text above and below the table.

borderless_uncropped.txt

Moving forward, we'd like to improve our method of grouping rows and columns together for borderless tables. This would involve tweaking the get_clustering_indexes() method called in ocirs/table_extraction/borderless_table_extraction.py.

The starting point for borderless table extraction is get_borderless_table() called in ocirs/table_extraction/table_extraction.py.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions