Skip to content

Conversation

@bosd
Copy link
Collaborator

@bosd bosd commented Feb 25, 2023

Was running some tests, encountered following error when pdfplumber is not available.
This PR returns and empty value and let invoice2data fail gracefully.

Before:
invoice2data input.pdf --input-reader=pdfplumber

DEBUG:invoice2data.input.pdfplumber: Cannot import pdfplumber
Traceback (most recent call last):
  File "/home/emiel/.local/bin/invoice2data", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/emiel/.local/lib/python3.11/site-packages/invoice2data/main.py", line 312, in main
    res = extract_data(f.name, templates=templates, input_module=input_module)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emiel/.local/lib/python3.11/site-packages/invoice2data/main.py", line 166, in extract_data
    extracted_str = input_module.to_text(invoicefile).decode("utf-8")
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/emiel/.local/lib/python3.11/site-packages/invoice2data/input/pdfplumber.py", line 29, in to_text
    with pdfplumber.open(path, laparams={"detect_vertical": True}) as pdf:
         ^^^^^^^^^^
UnboundLocalError: cannot access local variable 'pdfplumber' where it is not associated with a value

After:

ERROR:invoice2data.input.pdfplumber: Cannot import pdfplumber
ERROR:root: Failed to extract text from testin.pdf using invoice2data.input.pdfplumber

fixes #362

@bosd bosd force-pushed the importerr-pdfplumber branch from 024d674 to ed9499d Compare February 25, 2023 16:38
except ImportError:
logger.debug("Cannot import pdfplumber")
logger.error("Cannot import pdfplumber")
return "".encode("UTF-8")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning empty string suggests that invoice was parsed but was empty.
If we want to return some value then make it None please.

If you take a look at pdftotext.py however, you'll see it raises EnvironmentError if pdftotext is missing. So returning None will make pdfplumber.py somehow incompatible with the pdftotext.py.

As for decision what is better: return None or raise EnvironmentError - I have no idea or preference.

Copy link
Collaborator Author

@bosd bosd Mar 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rmilecki I agree, we need to think of an solution for this.

Returning None conflicts with:

extracted_str = input_module.to_text(invoicefile).decode("utf-8")

as Nonetype cannot be decoded
(Maybe we can remove the decode line? I assume it is a python2 leftover)

pdftotext might be a different story, as it is one of the default/main parsers.
So making everything fail when it's unavailable is not a big deal.

Is there a way to raise the error, only if the pdfplumber input module is called?
We don't want the whole lib to fail on this missing requirement.

This pr #491
is currently failing because of the missing pdfplumber.
(the test should not even run when it is unavailable but that's a different sunbject)
In that example, there should be an ImportError or EnvironmentError

@bosd bosd force-pushed the importerr-pdfplumber branch from 1ae80ab to db0ac11 Compare March 30, 2023 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Template validator ?

2 participants