PDFPlumber - A filter to remove superscipt

Recall how PDFPlumber extracts superscript text as a part of the cell text, like "Assignment 11".

```py
tables = pdf.pages[1].extract_tables()

for table in tables:
    for row in table:
        print(row)

```

<img width="700" alt="Screenshot 2023-06-28 at 11 57 33 PM" src="https://github.com/techstartucalgary/lifeline/assets/20573623/50f60530-04d4-4cb9-8634-e51752f100d1">

<img width="700" alt="Screenshot 2023-06-28 at 11 57 20 PM" src="https://github.com/techstartucalgary/lifeline/assets/20573623/dd553c70-26fb-4edb-8a02-2666945753d0">

------

Programmatically using a filter on the page can get rid of superscript texts. 


```py
def filter(obj):
    if obj["object_type"] == "char" and obj["size"] >= 7.0:
        return True
    elif obj["object_type"] != "char":
        return True
```

When the object is a character, the normal size character size around 7.200000099000022 but the superscript text is only 6.000000082499923. Excluding specific characters by their size can be used.

Dictionary data structure of a "char":

<img width="275" alt="Screenshot 2023-06-29 at 12 02 51 AM" src="https://github.com/techstartucalgary/lifeline/assets/20573623/8e326093-2166-484f-8418-334a4b8a4f9d">


Now:

```py
# print table
def filter(obj):
    if obj["object_type"] == "char" and obj["size"] >= 7.0:
        return True
    elif obj["object_type"] != "char":
        return True

tables = pdf.pages[1].filter(filter).extract_tables()

for table in tables:
    for row in table:
        print(row)
```

We can extract table to this better version:

<img width="700" alt="Screenshot 2023-06-29 at 12 03 53 AM" src="https://github.com/techstartucalgary/lifeline/assets/20573623/f7937acb-33da-40ca-bf37-404a20fbdd56">


This is not the only approach to realize for the filter, condition on the y0 and y1 might do it too.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

PDFPlumber - A filter to remove superscipt #289

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PDFPlumber - A filter to remove superscipt #289

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions