Add converting pdf to markdown by 23prime · Pull Request #2 · 23prime/agent-skills

23prime · 2026-03-07T03:47:24Z

Checklist

Status checks are passing
Target branch is main

Summary

Reason for change

Changes

Notes

coderabbitai · 2026-03-07T03:47:36Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7661ca32-1d97-4a58-bbda-40935c615714

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/add-converting-pdf-to-markdown

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

23prime · 2026-03-07T03:49:06Z

Docling has a higher conversion accuracy, which calls into question the usefulness of this skill.

Copilot

Pull request overview

This PR adds a new skill called "converting-pdf-to-markdown" to the agent skills toolkit. It provides a multi-step workflow for converting PDF files to Markdown using Microsoft's markitdown library, with pymupdf for TOC extraction and heading injection.

Changes:

Adds four Python scripts: extract_toc.py (TOC/page count extraction), split_pdf.py (split large PDFs by TOC headings), convert_pdf.py (PDF-to-Markdown conversion with heading injection), and clean_linebreaks.py (post-processing to remove unnecessary line breaks).
Adds SKILL.md and SKILL.ja.md documenting a 5-step workflow (check TOC → decide split → convert → clean linebreaks → review).

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
skills/converting-pdf-to-markdown/SKILL.md	English documentation for the skill workflow and script usage
skills/converting-pdf-to-markdown/SKILL.ja.md	Japanese translation of the skill documentation
skills/converting-pdf-to-markdown/scripts/extract_toc.py	Script to extract TOC entries and page count from a PDF as JSON
skills/converting-pdf-to-markdown/scripts/split_pdf.py	Script to split a PDF into multiple files based on TOC headings
skills/converting-pdf-to-markdown/scripts/convert_pdf.py	Core conversion script using markitdown with TOC-based heading injection
skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py	Post-processing script to join continuation lines in converted Markdown

Copilot · 2026-03-07T03:52:02Z

skills/converting-pdf-to-markdown/scripts/convert_pdf.py

+                    if stripped[li].isspace():
+                        li += 1
+                        continue
+                    if stripped[li] == title[ti] if ti < len(title) else False:


Bug: The character-by-character matching uses title[ti] but ti is an index into normalized_title (which has whitespace removed). When title contains spaces, the index ti will point to the wrong character in title. For example, with title = "Hello World" and normalized_title = "HelloWorld", at ti=5, normalized_title[5]='W' but title[5]=' ', causing the match to fail for any title with spaces.

Additionally, the ternary expression stripped[li] == title[ti] if ti < len(title) else False has a precedence issue — Python parses it as stripped[li] == (title[ti] if ti < len(title) else False) rather than the likely intended (stripped[li] == title[ti]) if ti < len(title) else False.

The fix should compare against normalized_title[ti] instead of title[ti], and add parentheses around the comparison to clarify precedence.

Suggested change

if stripped[li] == title[ti] if ti < len(title) else False:

if (stripped[li] == normalized_title[ti]) if ti < len(normalized_title) else False:

Copilot · 2026-03-07T03:52:03Z

skills/converting-pdf-to-markdown/scripts/convert_pdf.py

+def _find_toc_title_in_text(text: str, title: str, search_start: int = 0) -> int:
+    """Find the position of a TOC title in the converted text.
+
+    Tries exact match first, then a normalized match (ignoring whitespace
+    differences).
+    """
+    # Exact match
+    pos = text.find(title, search_start)
+    if pos != -1:
+        return pos
+
+    # Normalized match: collapse whitespace in both
+    pattern = r"\s*".join(re.escape(ch) for ch in title if not ch.isspace())
+    match = re.search(pattern, text[search_start:])
+    if match:
+        return search_start + match.start()
+
+    return -1
+
+


The function _find_toc_title_in_text is defined but never called anywhere in this file or elsewhere in the codebase. This is dead code that should be removed unless it is intended for future use (in which case it should be documented as such).

Suggested change

def _find_toc_title_in_text(text: str, title: str, search_start: int = 0) -> int:

"""Find the position of a TOC title in the converted text.

Tries exact match first, then a normalized match (ignoring whitespace

differences).

"""

# Exact match

pos = text.find(title, search_start)

if pos != -1:

return pos

# Normalized match: collapse whitespace in both

pattern = r"\s*".join(re.escape(ch) for ch in title if not ch.isspace())

match = re.search(pattern, text[search_start:])

if match:

return search_start + match.start()

return -1

Copilot · 2026-03-07T03:52:03Z

skills/converting-pdf-to-markdown/scripts/split_pdf.py

+        part.insert_pdf(doc, from_page=start_idx, to_page=end_page)
+
+        # Re-map and embed TOC entries that fall within this page range
+        page_offset = start_idx  # original 0-indexed start


The variable page_offset is assigned here but never used anywhere in the function. It should be removed to avoid confusion.

Suggested change

page_offset = start_idx # original 0-indexed start

Copilot · 2026-03-07T03:52:03Z

skills/converting-pdf-to-markdown/scripts/split_pdf.py

+    total_pages = doc.page_count
+
+    for i, (title, start_page) in enumerate(entries):
+        # Determine end page (exclusive, 0-indexed)


The comment says "Determine end page (exclusive, 0-indexed)" but end_page is actually used as an inclusive 0-indexed value — insert_pdf(doc, from_page=start_idx, to_page=end_page) treats to_page as inclusive. The comment should say "inclusive, 0-indexed" to match the actual semantics.

Suggested change

# Determine end page (exclusive, 0-indexed)

# Determine end page (inclusive, 0-indexed)

Copilot · 2026-03-07T03:52:03Z

skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py

@@ -0,0 +1,185 @@
+#!/usr/bin/env python3


This script is missing the PEP 723 inline metadata block (# /// script / # ///) that the other three scripts in this directory all include, and that the SKILL.md documentation references when it states "Scripts declare their own dependencies via PEP 723 inline metadata." While this script only uses stdlib modules, adding the metadata block with requires-python would be consistent with the other scripts (see extract_toc.py, split_pdf.py, convert_pdf.py).

Suggested change

#!/usr/bin/env python3

#!/usr/bin/env python3

# /// script

# requires-python = ">=3.10"

# ///

23prime added 2 commits March 5, 2026 10:36

skills: add converting-pdf-to-markdown skill

d25bfd1

skills: add resolving-pr-conversations skill

ceaf9a2

Copilot AI review requested due to automatic review settings March 7, 2026 03:47

23prime marked this pull request as draft March 7, 2026 03:47

Merge branch 'main' into feature/add-converting-pdf-to-markdown

335b5dd

Copilot started reviewing on behalf of 23prime March 7, 2026 03:47 View session

Copilot AI reviewed Mar 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add converting pdf to markdown#2

Add converting pdf to markdown#2
23prime wants to merge 3 commits intomainfrom
feature/add-converting-pdf-to-markdown

23prime commented Mar 7, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Mar 7, 2026 •

edited

Loading

Review skipped

Uh oh!

23prime commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Copilot AI Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if stripped[li] == title[ti] if ti < len(title) else False:
	if (stripped[li] == normalized_title[ti]) if ti < len(normalized_title) else False:

	# Determine end page (exclusive, 0-indexed)
	# Determine end page (inclusive, 0-indexed)

Conversation

23prime commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Summary

Reason for change

Changes

Notes

Uh oh!

coderabbitai bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

23prime commented Mar 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

23prime commented Mar 7, 2026 •

edited

Loading

coderabbitai bot commented Mar 7, 2026 •

edited

Loading