This project aims to automate the processing of invoices using PDF extraction and data manipulation techniques.
After setting up all the files, the project structure will look like this:
project/
├── docs/
│ ├── setup.md
│ ├── data_processing.md
│ ├── file_extraction.md
│ ├── zip_data_processing.md
│ ├── pdf_operations.md
│ └── exception.md
├── InvoicesData/
│ └── TestDataSet/
│ └── Pdfs
├── src/
│ ├── logging_utils.py
│ ├── pdf_operations.py
│ ├── file_extraction.py
│ ├── fail_file_extraction.py
│ ├── exception_handler.py
│ ├── zip_data_processing.py
│ ├── data_processing.py
│ └── client_config.json
├── output
│ ├──failed/
│ │ └── output81.json (File generated after executing fail_file_extraction.py)
│ ├── LogFile.log (File generated after file_extraction.py)
│ ├── failed_files.txt (File generated after executing file_extraction.py)
│ ├── invoice.json (File generated after executing file_extraction.py)
│ └── exception.json (File generated after executing exception_handler.py)
│
├── pdfservices-api-credentials.json(You have to setup according setup.md)
├── private.key(You have to setup according setup.md)
├── output.csv (File generated after executing data_processing.py)
└── README.md
- Setup the project according to setup.md
- If you have any doubts in the source code docs
- Furthur any doubts post in github
- Place the PDF files to be processed in the source folder specified in
file_extraction.py. - Run
file_extraction.pyto initiate the processing of the PDF files. - The script will extract relevant data from the PDF files, update the master data, and save it in the
invoice.jsonfile. - If any files fail to process initially, the script will retry a maximum number of times specified by
MAX_RETRY_LIMITinfile_extraction.py. - If there is any problem from the user-end like finishing of API quota or network issues, the files will be written into
failed_files.txt, and you can runfail_file_extraction.pydirectly to process the remaining files after solving the user-end problems. - If the maximum retry limit is reached and there are still failed files, the script will save the list of failed files in
failed_files.txtand save the json data in the failed folder. - Run
exception_handler.pyto process the failed files in the failed folder separately and generate the data inexception.json. - Now run the
data_processing.pyby specifying the paths to theinvoice.jsonandexception.jsonthe output will be displayed intooutput.csv
Note: Make sure to set up the necessary credentials and configurations for the Adobe PDF Services API as described in the project documentation.
The project relies on the following dependencies:
python 3.x,adobe-pdfservices-sdk,logging,json,tempfile,csv,re,zipfile
Make sure to install the dependencies using the appropriate package manager or pip.


