Skip to content

Conversation

@hmdqr
Copy link

@hmdqr hmdqr commented Aug 25, 2025

This pull request adds automatic conversion of PDF files to both Markdown and Word (.docx) formats, improves batch processing robustness, and updates the documentation to reflect these enhancements and new requirements. The changes also introduce better error handling, automatic directory creation, and a more robust processed files database.

Batch processing and conversion improvements:

  • The script now automatically converts each PDF to both Markdown and Word (.docx) formats using pypandoc, with error handling for the Word conversion step.
  • The script creates required directories (docs_import and output subfolders) if they do not exist, improving usability and reducing setup errors.
  • The processed files database logic is improved: the append_to_db function is replaced with update_db to allow updating or inserting records, ensuring accurate tracking of file statuses. [1] [2]

Code and formatting consistency:

  • Consistent use of double quotes for strings and minor formatting improvements throughout the script. [1] [2] [3] [4] [5]

Documentation updates:

  • The README now documents the dual output (Markdown and Word), new dependencies (pypandoc and pandoc), improved error handling, and automatic directory creation. The project structure and troubleshooting sections are updated accordingly. [1] [2] [3] [4] [5]

Copilot AI review requested due to automatic review settings August 25, 2025 23:39
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request enhances the OCR script by adding dual output format support (Markdown and Word) and improving batch processing capabilities. The changes introduce automatic Word document generation using pypandoc, better error handling, and automatic directory creation.

Key changes:

  • Dual output format support: automatically generates both Markdown and Word (.docx) files from PDF inputs
  • Enhanced database management: replaced append_to_db with update_db for better record tracking
  • Improved robustness: automatic directory creation and better error handling for Word conversion

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
README.md Updated documentation to reflect dual output support, new pypandoc dependency, and improved batch processing features
BatchPdfConv.py Enhanced script with Word conversion capability, improved database operations, and automatic directory creation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 61 to 84
def update_db(record):
"""Update or insert a processing record into the CSV database."""
records = []
found = False

if os.path.exists(DB_CSV):
with open(DB_CSV, newline="", encoding="utf-8") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
if row["filename"] == record["filename"]:
records.append(record)
found = True
else:
records.append(row)

if not found:
records.append(record)

with open(DB_CSV, "w", newline="", encoding="utf-8") as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=FIELDNAMES)
if not file_exists:
writer.writeheader()
writer.writerow(record)
writer.writeheader()
writer.writerows(records)


Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update_db function rewrites the entire CSV file every time it's called, which could lead to data loss if the process is interrupted during file writing. Consider using atomic file operations or a more robust database solution for production use.

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +156
if os.path.exists(output_path):
try:
docx_path = os.path.join(
EXPORT_DIR, pdf_filename.rsplit(".", 1)[0] + ".docx"
)
pypandoc.convert_file(output_path, "docx", outputfile=docx_path)
print(f"Converted to Word: {docx_path}")
except Exception as e:
logging.error(f"Word conversion failed for {pdf_filename}: {e}")
print(f"Word conversion failed for {pdf_filename}: {e}")
else:
print(f"Markdown file not found, skipping Word conversion for {pdf_filename}")
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error handling for pypandoc conversion is too generic. Consider catching specific pypandoc exceptions (like RuntimeError for pandoc not found) to provide more helpful error messages to users.

Suggested change
if os.path.exists(output_path):
try:
docx_path = os.path.join(
EXPORT_DIR, pdf_filename.rsplit(".", 1)[0] + ".docx"
)
pypandoc.convert_file(output_path, "docx", outputfile=docx_path)
print(f"Converted to Word: {docx_path}")
except Exception as e:
logging.error(f"Word conversion failed for {pdf_filename}: {e}")
print(f"Word conversion failed for {pdf_filename}: {e}")
else:
print(f"Markdown file not found, skipping Word conversion for {pdf_filename}")
if os.path.exists(output_path):
try:
docx_path = os.path.join(
EXPORT_DIR, pdf_filename.rsplit(".", 1)[0] + ".docx"
)
pypandoc.convert_file(output_path, "docx", outputfile=docx_path)
print(f"Converted to Word: {docx_path}")
except RuntimeError as e:
logging.error(f"Pandoc not found or failed for {pdf_filename}: {e}")
print(f"Word conversion failed for {pdf_filename}: {e}\nPlease ensure that Pandoc is installed and available in your PATH. See https://pandoc.org/installing.html")
except Exception as e:
logging.error(f"Word conversion failed for {pdf_filename}: {e}")
print(f"Word conversion failed for {pdf_filename}: {e}")
else:
print(f"Markdown file not found, skipping Word conversion for {pdf_filename}")

Copilot uses AI. Check for mistakes.
"""List all PDF files in the DOC_DIR folder and its subdirectories."""
if not os.path.isdir(DOC_DIR):
print(f"Error: Directory '{DOC_DIR}' not found.")
os.makedirs(DOC_DIR)
Copy link

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating the directory and then immediately exiting prevents the user from adding files and running again in the same execution. Consider checking if the directory is empty instead of just checking if it exists, or provide a way to continue execution after directory creation.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant