-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Recall how PDFPlumber extracts superscript text as a part of the cell text, like "Assignment 11".
tables = pdf.pages[1].extract_tables()
for table in tables:
for row in table:
print(row)
Programmatically using a filter on the page can get rid of superscript texts.
def filter(obj):
if obj["object_type"] == "char" and obj["size"] >= 7.0:
return True
elif obj["object_type"] != "char":
return TrueWhen the object is a character, the normal size character size around 7.200000099000022 but the superscript text is only 6.000000082499923. Excluding specific characters by their size can be used.
Dictionary data structure of a "char":
Now:
# print table
def filter(obj):
if obj["object_type"] == "char" and obj["size"] >= 7.0:
return True
elif obj["object_type"] != "char":
return True
tables = pdf.pages[1].filter(filter).extract_tables()
for table in tables:
for row in table:
print(row)We can extract table to this better version:
This is not the only approach to realize for the filter, condition on the y0 and y1 might do it too.
tim-macphail
Metadata
Metadata
Assignees
Labels
No labels