-
Notifications
You must be signed in to change notification settings - Fork 43
Enhance OCR script: dual output (Markdown & Word) and better batch handling #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…and enhance error messages
…and usage details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request enhances the OCR script by adding dual output format support (Markdown and Word) and improving batch processing capabilities. The changes introduce automatic Word document generation using pypandoc, better error handling, and automatic directory creation.
Key changes:
- Dual output format support: automatically generates both Markdown and Word (.docx) files from PDF inputs
- Enhanced database management: replaced
append_to_dbwithupdate_dbfor better record tracking - Improved robustness: automatic directory creation and better error handling for Word conversion
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| README.md | Updated documentation to reflect dual output support, new pypandoc dependency, and improved batch processing features |
| BatchPdfConv.py | Enhanced script with Word conversion capability, improved database operations, and automatic directory creation |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| def update_db(record): | ||
| """Update or insert a processing record into the CSV database.""" | ||
| records = [] | ||
| found = False | ||
|
|
||
| if os.path.exists(DB_CSV): | ||
| with open(DB_CSV, newline="", encoding="utf-8") as csvfile: | ||
| reader = csv.DictReader(csvfile) | ||
| for row in reader: | ||
| if row["filename"] == record["filename"]: | ||
| records.append(record) | ||
| found = True | ||
| else: | ||
| records.append(row) | ||
|
|
||
| if not found: | ||
| records.append(record) | ||
|
|
||
| with open(DB_CSV, "w", newline="", encoding="utf-8") as csvfile: | ||
| writer = csv.DictWriter(csvfile, fieldnames=FIELDNAMES) | ||
| if not file_exists: | ||
| writer.writeheader() | ||
| writer.writerow(record) | ||
| writer.writeheader() | ||
| writer.writerows(records) | ||
|
|
||
|
|
Copilot
AI
Aug 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update_db function rewrites the entire CSV file every time it's called, which could lead to data loss if the process is interrupted during file writing. Consider using atomic file operations or a more robust database solution for production use.
| if os.path.exists(output_path): | ||
| try: | ||
| docx_path = os.path.join( | ||
| EXPORT_DIR, pdf_filename.rsplit(".", 1)[0] + ".docx" | ||
| ) | ||
| pypandoc.convert_file(output_path, "docx", outputfile=docx_path) | ||
| print(f"Converted to Word: {docx_path}") | ||
| except Exception as e: | ||
| logging.error(f"Word conversion failed for {pdf_filename}: {e}") | ||
| print(f"Word conversion failed for {pdf_filename}: {e}") | ||
| else: | ||
| print(f"Markdown file not found, skipping Word conversion for {pdf_filename}") |
Copilot
AI
Aug 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The error handling for pypandoc conversion is too generic. Consider catching specific pypandoc exceptions (like RuntimeError for pandoc not found) to provide more helpful error messages to users.
| if os.path.exists(output_path): | |
| try: | |
| docx_path = os.path.join( | |
| EXPORT_DIR, pdf_filename.rsplit(".", 1)[0] + ".docx" | |
| ) | |
| pypandoc.convert_file(output_path, "docx", outputfile=docx_path) | |
| print(f"Converted to Word: {docx_path}") | |
| except Exception as e: | |
| logging.error(f"Word conversion failed for {pdf_filename}: {e}") | |
| print(f"Word conversion failed for {pdf_filename}: {e}") | |
| else: | |
| print(f"Markdown file not found, skipping Word conversion for {pdf_filename}") | |
| if os.path.exists(output_path): | |
| try: | |
| docx_path = os.path.join( | |
| EXPORT_DIR, pdf_filename.rsplit(".", 1)[0] + ".docx" | |
| ) | |
| pypandoc.convert_file(output_path, "docx", outputfile=docx_path) | |
| print(f"Converted to Word: {docx_path}") | |
| except RuntimeError as e: | |
| logging.error(f"Pandoc not found or failed for {pdf_filename}: {e}") | |
| print(f"Word conversion failed for {pdf_filename}: {e}\nPlease ensure that Pandoc is installed and available in your PATH. See https://pandoc.org/installing.html") | |
| except Exception as e: | |
| logging.error(f"Word conversion failed for {pdf_filename}: {e}") | |
| print(f"Word conversion failed for {pdf_filename}: {e}") | |
| else: | |
| print(f"Markdown file not found, skipping Word conversion for {pdf_filename}") |
| """List all PDF files in the DOC_DIR folder and its subdirectories.""" | ||
| if not os.path.isdir(DOC_DIR): | ||
| print(f"Error: Directory '{DOC_DIR}' not found.") | ||
| os.makedirs(DOC_DIR) |
Copilot
AI
Aug 25, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creating the directory and then immediately exiting prevents the user from adding files and running again in the same execution. Consider checking if the directory is empty instead of just checking if it exists, or provide a way to continue execution after directory creation.
…Lite for processed files tracking
This pull request adds automatic conversion of PDF files to both Markdown and Word (.docx) formats, improves batch processing robustness, and updates the documentation to reflect these enhancements and new requirements. The changes also introduce better error handling, automatic directory creation, and a more robust processed files database.
Batch processing and conversion improvements:
pypandoc, with error handling for the Word conversion step.docs_importand output subfolders) if they do not exist, improving usability and reducing setup errors.append_to_dbfunction is replaced withupdate_dbto allow updating or inserting records, ensuring accurate tracking of file statuses. [1] [2]Code and formatting consistency:
Documentation updates:
pypandocandpandoc), improved error handling, and automatic directory creation. The project structure and troubleshooting sections are updated accordingly. [1] [2] [3] [4] [5]