Enhance OCR script: dual output (Markdown & Word) and better batch handling #5

hmdqr · 2025-08-25T23:39:28Z

This pull request adds automatic conversion of PDF files to both Markdown and Word (.docx) formats, improves batch processing robustness, and updates the documentation to reflect these enhancements and new requirements. The changes also introduce better error handling, automatic directory creation, and a more robust processed files database.

Batch processing and conversion improvements:

The script now automatically converts each PDF to both Markdown and Word (.docx) formats using pypandoc, with error handling for the Word conversion step.
The script creates required directories (docs_import and output subfolders) if they do not exist, improving usability and reducing setup errors.
The processed files database logic is improved: the append_to_db function is replaced with update_db to allow updating or inserting records, ensuring accurate tracking of file statuses. [1] [2]

Code and formatting consistency:

Consistent use of double quotes for strings and minor formatting improvements throughout the script. [1] [2] [3] [4] [5]

Documentation updates:

The README now documents the dual output (Markdown and Word), new dependencies (pypandoc and pandoc), improved error handling, and automatic directory creation. The project structure and troubleshooting sections are updated accordingly. [1] [2] [3] [4] [5]

…and enhance error messages

…and usage details

Copilot

Pull Request Overview

This pull request enhances the OCR script by adding dual output format support (Markdown and Word) and improving batch processing capabilities. The changes introduce automatic Word document generation using pypandoc, better error handling, and automatic directory creation.

Key changes:

Dual output format support: automatically generates both Markdown and Word (.docx) files from PDF inputs
Enhanced database management: replaced append_to_db with update_db for better record tracking
Improved robustness: automatic directory creation and better error handling for Word conversion

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
README.md	Updated documentation to reflect dual output support, new pypandoc dependency, and improved batch processing features
BatchPdfConv.py	Enhanced script with Word conversion capability, improved database operations, and automatic directory creation

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-08-25T23:40:23Z

BatchPdfConv.py

+def update_db(record):
+    """Update or insert a processing record into the CSV database."""
+    records = []
+    found = False
+
+    if os.path.exists(DB_CSV):
+        with open(DB_CSV, newline="", encoding="utf-8") as csvfile:
+            reader = csv.DictReader(csvfile)
+            for row in reader:
+                if row["filename"] == record["filename"]:
+                    records.append(record)
+                    found = True
+                else:
+                    records.append(row)
+
+    if not found:
+        records.append(record)
+
+    with open(DB_CSV, "w", newline="", encoding="utf-8") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=FIELDNAMES)
-        if not file_exists:
-            writer.writeheader()
-        writer.writerow(record)
+        writer.writeheader()
+        writer.writerows(records)




The update_db function rewrites the entire CSV file every time it's called, which could lead to data loss if the process is interrupted during file writing. Consider using atomic file operations or a more robust database solution for production use.

Copilot · 2025-08-25T23:40:24Z

BatchPdfConv.py

+    if os.path.exists(output_path):
+        try:
+            docx_path = os.path.join(
+                EXPORT_DIR, pdf_filename.rsplit(".", 1)[0] + ".docx"
+            )
+            pypandoc.convert_file(output_path, "docx", outputfile=docx_path)
+            print(f"Converted to Word: {docx_path}")
+        except Exception as e:
+            logging.error(f"Word conversion failed for {pdf_filename}: {e}")
+            print(f"Word conversion failed for {pdf_filename}: {e}")
+    else:
+        print(f"Markdown file not found, skipping Word conversion for {pdf_filename}")


The error handling for pypandoc conversion is too generic. Consider catching specific pypandoc exceptions (like RuntimeError for pandoc not found) to provide more helpful error messages to users.

Suggested change

if os.path.exists(output_path):

try:

docx_path = os.path.join(

EXPORT_DIR, pdf_filename.rsplit(".", 1)[0] + ".docx"

)

pypandoc.convert_file(output_path, "docx", outputfile=docx_path)

print(f"Converted to Word: {docx_path}")

except Exception as e:

logging.error(f"Word conversion failed for {pdf_filename}: {e}")

print(f"Word conversion failed for {pdf_filename}: {e}")

else:

print(f"Markdown file not found, skipping Word conversion for {pdf_filename}")

if os.path.exists(output_path):

try:

docx_path = os.path.join(

EXPORT_DIR, pdf_filename.rsplit(".", 1)[0] + ".docx"

)

pypandoc.convert_file(output_path, "docx", outputfile=docx_path)

print(f"Converted to Word: {docx_path}")

except RuntimeError as e:

logging.error(f"Pandoc not found or failed for {pdf_filename}: {e}")

print(f"Word conversion failed for {pdf_filename}: {e}\nPlease ensure that Pandoc is installed and available in your PATH. See https://pandoc.org/installing.html")

except Exception as e:

logging.error(f"Word conversion failed for {pdf_filename}: {e}")

print(f"Word conversion failed for {pdf_filename}: {e}")

else:

print(f"Markdown file not found, skipping Word conversion for {pdf_filename}")

Copilot · 2025-08-25T23:40:24Z

BatchPdfConv.py

    """List all PDF files in the DOC_DIR folder and its subdirectories."""
    if not os.path.isdir(DOC_DIR):
-        print(f"Error: Directory '{DOC_DIR}' not found.")
+        os.makedirs(DOC_DIR)


Creating the directory and then immediately exiting prevents the user from adding files and running again in the same execution. Consider checking if the directory is empty instead of just checking if it exists, or provide a way to continue execution after directory creation.

…Lite for processed files tracking

hmdqr added 2 commits August 26, 2025 01:57

Refactor BatchPdfConv.py: Improve CSV handling, add Word conversion, …

db4720d

…and enhance error messages

Update README.md: Enhance output formats, installation instructions, …

c0f2aea

…and usage details

Copilot AI review requested due to automatic review settings August 25, 2025 23:39

Copilot AI reviewed Aug 25, 2025

View reviewed changes

Refactor BatchPdfConv.py and update README.md: Migrate from CSV to SQ…

7b5ac1f

…Lite for processed files tracking

Copilot AI mentioned this pull request Dec 20, 2025

Fix nixpacks deployment: add virtual environment for pip availability srab1980/Mistral-Arabic-OCR-test#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enhance OCR script: dual output (Markdown & Word) and better batch handling #5

Enhance OCR script: dual output (Markdown & Word) and better batch handling #5

Uh oh!

hmdqr commented Aug 25, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 25, 2025

Uh oh!

Copilot AI Aug 25, 2025

Uh oh!

Copilot AI Aug 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Enhance OCR script: dual output (Markdown & Word) and better batch handling #5

Are you sure you want to change the base?

Enhance OCR script: dual output (Markdown & Word) and better batch handling #5

Uh oh!

Conversation

hmdqr commented Aug 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant