Conversation
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Docling has a higher conversion accuracy, which calls into question the usefulness of this skill. |
There was a problem hiding this comment.
Pull request overview
This PR adds a new skill called "converting-pdf-to-markdown" to the agent skills toolkit. It provides a multi-step workflow for converting PDF files to Markdown using Microsoft's markitdown library, with pymupdf for TOC extraction and heading injection.
Changes:
- Adds four Python scripts:
extract_toc.py(TOC/page count extraction),split_pdf.py(split large PDFs by TOC headings),convert_pdf.py(PDF-to-Markdown conversion with heading injection), andclean_linebreaks.py(post-processing to remove unnecessary line breaks). - Adds
SKILL.mdandSKILL.ja.mddocumenting a 5-step workflow (check TOC → decide split → convert → clean linebreaks → review).
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| skills/converting-pdf-to-markdown/SKILL.md | English documentation for the skill workflow and script usage |
| skills/converting-pdf-to-markdown/SKILL.ja.md | Japanese translation of the skill documentation |
| skills/converting-pdf-to-markdown/scripts/extract_toc.py | Script to extract TOC entries and page count from a PDF as JSON |
| skills/converting-pdf-to-markdown/scripts/split_pdf.py | Script to split a PDF into multiple files based on TOC headings |
| skills/converting-pdf-to-markdown/scripts/convert_pdf.py | Core conversion script using markitdown with TOC-based heading injection |
| skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py | Post-processing script to join continuation lines in converted Markdown |
| if stripped[li].isspace(): | ||
| li += 1 | ||
| continue | ||
| if stripped[li] == title[ti] if ti < len(title) else False: |
There was a problem hiding this comment.
Bug: The character-by-character matching uses title[ti] but ti is an index into normalized_title (which has whitespace removed). When title contains spaces, the index ti will point to the wrong character in title. For example, with title = "Hello World" and normalized_title = "HelloWorld", at ti=5, normalized_title[5]='W' but title[5]=' ', causing the match to fail for any title with spaces.
Additionally, the ternary expression stripped[li] == title[ti] if ti < len(title) else False has a precedence issue — Python parses it as stripped[li] == (title[ti] if ti < len(title) else False) rather than the likely intended (stripped[li] == title[ti]) if ti < len(title) else False.
The fix should compare against normalized_title[ti] instead of title[ti], and add parentheses around the comparison to clarify precedence.
| if stripped[li] == title[ti] if ti < len(title) else False: | |
| if (stripped[li] == normalized_title[ti]) if ti < len(normalized_title) else False: |
| def _find_toc_title_in_text(text: str, title: str, search_start: int = 0) -> int: | ||
| """Find the position of a TOC title in the converted text. | ||
|
|
||
| Tries exact match first, then a normalized match (ignoring whitespace | ||
| differences). | ||
| """ | ||
| # Exact match | ||
| pos = text.find(title, search_start) | ||
| if pos != -1: | ||
| return pos | ||
|
|
||
| # Normalized match: collapse whitespace in both | ||
| pattern = r"\s*".join(re.escape(ch) for ch in title if not ch.isspace()) | ||
| match = re.search(pattern, text[search_start:]) | ||
| if match: | ||
| return search_start + match.start() | ||
|
|
||
| return -1 | ||
|
|
||
|
|
There was a problem hiding this comment.
The function _find_toc_title_in_text is defined but never called anywhere in this file or elsewhere in the codebase. This is dead code that should be removed unless it is intended for future use (in which case it should be documented as such).
| def _find_toc_title_in_text(text: str, title: str, search_start: int = 0) -> int: | |
| """Find the position of a TOC title in the converted text. | |
| Tries exact match first, then a normalized match (ignoring whitespace | |
| differences). | |
| """ | |
| # Exact match | |
| pos = text.find(title, search_start) | |
| if pos != -1: | |
| return pos | |
| # Normalized match: collapse whitespace in both | |
| pattern = r"\s*".join(re.escape(ch) for ch in title if not ch.isspace()) | |
| match = re.search(pattern, text[search_start:]) | |
| if match: | |
| return search_start + match.start() | |
| return -1 |
| part.insert_pdf(doc, from_page=start_idx, to_page=end_page) | ||
|
|
||
| # Re-map and embed TOC entries that fall within this page range | ||
| page_offset = start_idx # original 0-indexed start |
There was a problem hiding this comment.
The variable page_offset is assigned here but never used anywhere in the function. It should be removed to avoid confusion.
| page_offset = start_idx # original 0-indexed start |
| total_pages = doc.page_count | ||
|
|
||
| for i, (title, start_page) in enumerate(entries): | ||
| # Determine end page (exclusive, 0-indexed) |
There was a problem hiding this comment.
The comment says "Determine end page (exclusive, 0-indexed)" but end_page is actually used as an inclusive 0-indexed value — insert_pdf(doc, from_page=start_idx, to_page=end_page) treats to_page as inclusive. The comment should say "inclusive, 0-indexed" to match the actual semantics.
| # Determine end page (exclusive, 0-indexed) | |
| # Determine end page (inclusive, 0-indexed) |
| @@ -0,0 +1,185 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
This script is missing the PEP 723 inline metadata block (# /// script / # ///) that the other three scripts in this directory all include, and that the SKILL.md documentation references when it states "Scripts declare their own dependencies via PEP 723 inline metadata." While this script only uses stdlib modules, adding the metadata block with requires-python would be consistent with the other scripts (see extract_toc.py, split_pdf.py, convert_pdf.py).
| #!/usr/bin/env python3 | |
| #!/usr/bin/env python3 | |
| # /// script | |
| # requires-python = ">=3.10" | |
| # /// |
Checklist
mainSummary
Reason for change
Changes
Notes