Skip to content

Add converting pdf to markdown#2

Draft
23prime wants to merge 3 commits intomainfrom
feature/add-converting-pdf-to-markdown
Draft

Add converting pdf to markdown#2
23prime wants to merge 3 commits intomainfrom
feature/add-converting-pdf-to-markdown

Conversation

@23prime
Copy link
Copy Markdown
Owner

@23prime 23prime commented Mar 7, 2026

Checklist

  • Status checks are passing
  • Target branch is main

Summary

Reason for change

Changes

Notes

Copilot AI review requested due to automatic review settings March 7, 2026 03:47
@23prime 23prime marked this pull request as draft March 7, 2026 03:47
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 7, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7661ca32-1d97-4a58-bbda-40935c615714

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/add-converting-pdf-to-markdown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@23prime
Copy link
Copy Markdown
Owner Author

23prime commented Mar 7, 2026

Docling has a higher conversion accuracy, which calls into question the usefulness of this skill.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new skill called "converting-pdf-to-markdown" to the agent skills toolkit. It provides a multi-step workflow for converting PDF files to Markdown using Microsoft's markitdown library, with pymupdf for TOC extraction and heading injection.

Changes:

  • Adds four Python scripts: extract_toc.py (TOC/page count extraction), split_pdf.py (split large PDFs by TOC headings), convert_pdf.py (PDF-to-Markdown conversion with heading injection), and clean_linebreaks.py (post-processing to remove unnecessary line breaks).
  • Adds SKILL.md and SKILL.ja.md documenting a 5-step workflow (check TOC → decide split → convert → clean linebreaks → review).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
skills/converting-pdf-to-markdown/SKILL.md English documentation for the skill workflow and script usage
skills/converting-pdf-to-markdown/SKILL.ja.md Japanese translation of the skill documentation
skills/converting-pdf-to-markdown/scripts/extract_toc.py Script to extract TOC entries and page count from a PDF as JSON
skills/converting-pdf-to-markdown/scripts/split_pdf.py Script to split a PDF into multiple files based on TOC headings
skills/converting-pdf-to-markdown/scripts/convert_pdf.py Core conversion script using markitdown with TOC-based heading injection
skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py Post-processing script to join continuation lines in converted Markdown

if stripped[li].isspace():
li += 1
continue
if stripped[li] == title[ti] if ti < len(title) else False:
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: The character-by-character matching uses title[ti] but ti is an index into normalized_title (which has whitespace removed). When title contains spaces, the index ti will point to the wrong character in title. For example, with title = "Hello World" and normalized_title = "HelloWorld", at ti=5, normalized_title[5]='W' but title[5]=' ', causing the match to fail for any title with spaces.

Additionally, the ternary expression stripped[li] == title[ti] if ti < len(title) else False has a precedence issue — Python parses it as stripped[li] == (title[ti] if ti < len(title) else False) rather than the likely intended (stripped[li] == title[ti]) if ti < len(title) else False.

The fix should compare against normalized_title[ti] instead of title[ti], and add parentheses around the comparison to clarify precedence.

Suggested change
if stripped[li] == title[ti] if ti < len(title) else False:
if (stripped[li] == normalized_title[ti]) if ti < len(normalized_title) else False:

Copilot uses AI. Check for mistakes.
Comment on lines +45 to +64
def _find_toc_title_in_text(text: str, title: str, search_start: int = 0) -> int:
"""Find the position of a TOC title in the converted text.

Tries exact match first, then a normalized match (ignoring whitespace
differences).
"""
# Exact match
pos = text.find(title, search_start)
if pos != -1:
return pos

# Normalized match: collapse whitespace in both
pattern = r"\s*".join(re.escape(ch) for ch in title if not ch.isspace())
match = re.search(pattern, text[search_start:])
if match:
return search_start + match.start()

return -1


Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function _find_toc_title_in_text is defined but never called anywhere in this file or elsewhere in the codebase. This is dead code that should be removed unless it is intended for future use (in which case it should be documented as such).

Suggested change
def _find_toc_title_in_text(text: str, title: str, search_start: int = 0) -> int:
"""Find the position of a TOC title in the converted text.
Tries exact match first, then a normalized match (ignoring whitespace
differences).
"""
# Exact match
pos = text.find(title, search_start)
if pos != -1:
return pos
# Normalized match: collapse whitespace in both
pattern = r"\s*".join(re.escape(ch) for ch in title if not ch.isspace())
match = re.search(pattern, text[search_start:])
if match:
return search_start + match.start()
return -1

Copilot uses AI. Check for mistakes.
part.insert_pdf(doc, from_page=start_idx, to_page=end_page)

# Re-map and embed TOC entries that fall within this page range
page_offset = start_idx # original 0-indexed start
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable page_offset is assigned here but never used anywhere in the function. It should be removed to avoid confusion.

Suggested change
page_offset = start_idx # original 0-indexed start

Copilot uses AI. Check for mistakes.
total_pages = doc.page_count

for i, (title, start_page) in enumerate(entries):
# Determine end page (exclusive, 0-indexed)
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Determine end page (exclusive, 0-indexed)" but end_page is actually used as an inclusive 0-indexed value — insert_pdf(doc, from_page=start_idx, to_page=end_page) treats to_page as inclusive. The comment should say "inclusive, 0-indexed" to match the actual semantics.

Suggested change
# Determine end page (exclusive, 0-indexed)
# Determine end page (inclusive, 0-indexed)

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,185 @@
#!/usr/bin/env python3
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is missing the PEP 723 inline metadata block (# /// script / # ///) that the other three scripts in this directory all include, and that the SKILL.md documentation references when it states "Scripts declare their own dependencies via PEP 723 inline metadata." While this script only uses stdlib modules, adding the metadata block with requires-python would be consistent with the other scripts (see extract_toc.py, split_pdf.py, convert_pdf.py).

Suggested change
#!/usr/bin/env python3
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.10"
# ///

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants