Skip to content

Table post-processing #5

@aaronbrezel

Description

@aaronbrezel

orirs' primary goal is to make it easier for researchers to pull out structured tabular data from scanned IRS form 990s.

Currently, the data that ocirs' returns with its table extraction methods (NineNinetyPage().extract_tables() and NineNinetyForm().extract_component_tables()) are more or less presented exactly as detected on the page.

This means there may be undesired text (especially if the page is not cropped with CascadeTabNet) or errors in the number of columns detected (especially when extracting borderless tables). See "Extraction improvements" issues for more information on those errors. And since non-profits have the ability to attach their own tables to form 990's there will be natural variation in column names of the same from component across different 990s. See for yourself in the two attached tables from two different organizations which both represent the same form component: 990PF Part XV, Grants and contributions paid during the year.

Screenshot_2021-05-27 912073258_201712_990PF pdf_rotated

Sarah Scaife Foundation_2015_36

In a perfect world, ocirs should be able to standardize output (i.e. consistent columns names, consistent column order, consistent data types) for a given form component. In practical terms, when a user requests the form component "PFGrntOrCntrApprvFrFt", they should know what the output dataframe is going to look like (recipent name in one column, address in another) ahead of time, even if they're processing forms from multiple non-profits. This would allow users to confidently process larger batches of 990 form pdfs.

This issue would likely have to be tackled on a component-by-component basis. Any table-extraction post-processing step would likely take place inside NineNinetyForm().extract_component_tables() and activated by a user set boolean parameter. This is similar to how the fuzzy merge_dataframes() function is applied now.

In terms of column naming conventions, we could leverage IRSx's form index, which provides a dictionary for all the various 990 tables and associated columns.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions