-
Notifications
You must be signed in to change notification settings - Fork 0
Add converting pdf to markdown #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
23prime
wants to merge
3
commits into
main
Choose a base branch
from
feature/add-converting-pdf-to-markdown
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,71 @@ | ||
| --- | ||
| name: converting-pdf-to-markdown | ||
| description: markitdown を使って PDF ファイルを Markdown に変換する。PDF を Markdown に変換したい、PDF からテキストを抽出したい、PDF ドキュメントを読みやすい Markdown 形式に変換したい場合に使用する。 | ||
| translated_from: SKILL.md | ||
| --- | ||
|
|
||
| # PDF から Markdown への変換 | ||
|
|
||
| Microsoft の markitdown ライブラリを使った軽量な PDF → Markdown 変換。 | ||
|
|
||
| ## 前提条件 | ||
|
|
||
| - [uv](https://docs.astral.sh/uv/) がインストールされていること。 | ||
|
|
||
| スクリプトは PEP 723 インラインメタデータで依存関係を宣言しており、`uv run` で実行するため、 | ||
| 手動でのパッケージインストールは不要。 | ||
|
|
||
| ## ワークフロー | ||
|
|
||
| ### ステップ 1: ページ数と目次の確認 | ||
|
|
||
| 変換前に目次抽出スクリプトを実行する: | ||
|
|
||
| ```bash | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/extract_toc.py" input.pdf | ||
| ``` | ||
|
|
||
| 出力は `page_count`(ページ数)と `toc`(レベル・タイトル・ページを含むブックマークエントリ)を含む JSON。 | ||
|
|
||
| ### ステップ 2: 分割の要否を判断 | ||
|
|
||
| - **100 ページ以下**: ファイル全体を変換する。ステップ 3 へ進む。 | ||
| - **100 ページ超かつ目次あり**: 目次の構造をユーザーに提示し、トップレベルの見出しで分割することを推奨する。ユーザーの確認後、以下で分割する: | ||
|
|
||
| ```bash | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -o output_dir/ | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -l 2 # レベル 2 見出しで分割 | ||
| ``` | ||
|
|
||
| 番号付き PDF ファイル(例: `01-introduction.pdf`)がサブディレクトリに作成される。各パートをステップ 3 で個別に変換する。 | ||
|
|
||
| - **100 ページ超かつ目次なし**: 自動分割用のブックマークがない旨をユーザーに伝える。そのまま変換するか、手動でページ範囲を指定するよう促す。 | ||
|
|
||
| ### ステップ 3: 変換 | ||
|
|
||
| ```bash | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf -o output.md | ||
| ``` | ||
|
|
||
| デフォルトの出力先は同ディレクトリの `input.md`。カスタムパスを指定する場合は `-o` を使用する。 | ||
|
|
||
| ### ステップ 4: 改行の整理 | ||
|
|
||
| PDF のテキストは表示上の折り返しで不要な改行が入る。変換後は必ず整理スクリプトを実行する: | ||
|
|
||
| ```bash | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md -o cleaned.md | ||
| ``` | ||
|
|
||
| `-o` を省略した場合はファイルを上書き。見出し・リスト・コードブロック・テーブル・段落区切りを保持しながら、折り返し行を結合する。 | ||
|
|
||
| ### ステップ 5: レビュー | ||
|
|
||
| 出力結果を確認する: | ||
|
|
||
| - 見出しの崩れや書式のアーティファクト | ||
| - 不要なページ番号やヘッダー・フッター | ||
| - テーブルの書式の問題 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| --- | ||
| name: converting-pdf-to-markdown | ||
| description: Convert PDF files to Markdown using markitdown. Use when users want to convert a PDF to Markdown, extract text from PDFs, or transform PDF documents into readable Markdown format. | ||
| --- | ||
|
|
||
| # Converting PDF to Markdown | ||
|
|
||
| Lightweight PDF-to-Markdown conversion using Microsoft's markitdown library. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - [uv](https://docs.astral.sh/uv/) must be installed. | ||
|
|
||
| Scripts declare their own dependencies via PEP 723 inline metadata and are run with | ||
| `uv run`, so no manual package installation is needed. | ||
|
|
||
| ## Workflow | ||
|
|
||
| ### Step 1: Check page count and TOC | ||
|
|
||
| Before converting, run the TOC extraction script: | ||
|
|
||
| ```bash | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/extract_toc.py" input.pdf | ||
| ``` | ||
|
|
||
| Output is JSON with `page_count` and `toc` (bookmark entries with level, title, page). | ||
|
|
||
| ### Step 2: Decide whether to split | ||
|
|
||
| - **100 pages or fewer**: Convert the entire file. Go to Step 3. | ||
| - **Over 100 pages with TOC**: Present the TOC structure to the user and recommend splitting by top-level headings. After user confirmation, split with: | ||
|
|
||
| ```bash | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -o output_dir/ | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -l 2 # split on level-2 headings | ||
| ``` | ||
|
|
||
| This creates numbered PDF files (e.g., `01-introduction.pdf`) in a subdirectory. Convert each part separately with Step 3. | ||
|
|
||
| - **Over 100 pages without TOC**: Inform the user that there are no bookmarks for automatic splitting. Suggest converting as-is or ask for manual page ranges. | ||
|
|
||
| ### Step 3: Convert | ||
|
|
||
| ```bash | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf -o output.md | ||
| ``` | ||
|
|
||
| Default output is `input.md` in the same directory. Use `-o` to specify a custom path. | ||
|
|
||
| ### Step 4: Clean up line breaks | ||
|
|
||
| PDF text wraps at display boundaries, leaving unnecessary line breaks in the Markdown. Always run the cleanup script after conversion: | ||
|
|
||
| ```bash | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md | ||
| uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md -o cleaned.md | ||
| ``` | ||
|
|
||
| Without `-o`, the file is overwritten in-place. The script joins continuation lines while preserving headings, lists, code blocks, tables, and paragraph separators. | ||
|
|
||
| ### Step 5: Review | ||
|
|
||
| Check the output for remaining issues: | ||
|
|
||
| - Broken headings or formatting artifacts | ||
| - Unwanted page numbers or headers/footers | ||
| - Table formatting issues |
185 changes: 185 additions & 0 deletions
185
skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,185 @@ | ||
| #!/usr/bin/env python3 | ||
| """Remove unnecessary line breaks from converted Markdown. | ||
|
|
||
| PDFs often insert line breaks for display purposes that don't represent | ||
| actual paragraph breaks. This script joins continuation lines while | ||
| preserving intentional structure (headings, lists, code blocks, blank lines). | ||
|
|
||
| Handles two common patterns: | ||
| 1. Direct continuation: text line immediately followed by another text line. | ||
| 2. Single-blank-line continuation: PDF text where every line is separated | ||
| by a blank line. Distinguishes mid-sentence breaks from true paragraph | ||
| breaks by checking whether the preceding line ends mid-sentence. | ||
| """ | ||
|
|
||
| import argparse | ||
| import re | ||
| import sys | ||
| from pathlib import Path | ||
|
|
||
|
|
||
| def is_structural_line(line: str) -> bool: | ||
| """Check if a line is a Markdown structural element that should not be joined.""" | ||
| stripped = line.strip() | ||
| if not stripped: | ||
| return True # blank line (paragraph separator) | ||
| if stripped.startswith("#"): | ||
| return True # heading | ||
| if re.match(r"^[-*+]\s", stripped): | ||
| return True # unordered list item | ||
| if re.match(r"^\d+\.\s", stripped): | ||
| return True # ordered list item | ||
| if stripped.startswith(("```", "~~~")): | ||
| return True # code fence | ||
| if stripped.startswith("|"): | ||
| return True # table row | ||
| if stripped.startswith(">"): | ||
| return True # blockquote | ||
| if re.match(r"^-{3,}$|^\*{3,}$|^_{3,}$", stripped): | ||
| return True # horizontal rule | ||
| return False | ||
|
|
||
|
|
||
| def _is_page_number(line: str) -> bool: | ||
| """Check if a line is just a page number.""" | ||
| return bool(re.match(r"^\s*\d{1,4}\s*$", line)) | ||
|
|
||
|
|
||
| def _looks_like_continuation(prev: str, curr: str) -> bool: | ||
| """Determine if curr is a continuation of prev (mid-sentence break). | ||
|
|
||
| Returns True when prev appears to end mid-sentence, meaning the blank | ||
| line between them was just a PDF display artifact, not a paragraph break. | ||
| """ | ||
| prev_stripped = prev.rstrip() | ||
| curr_stripped = curr.strip() | ||
|
|
||
| if not prev_stripped or not curr_stripped: | ||
| return False | ||
|
|
||
| # Previous line ends with a sentence-ending punctuation → paragraph break | ||
| if re.search(r"[。..!!??)\)」』】〉》\]:]$", prev_stripped): | ||
| return False | ||
|
|
||
| # Previous line looks like a short heading/title (no sentence-ending punct, | ||
| # relatively short, and not ending with a particle or connective) | ||
| # e.g. "(1)本ガイドラインの目的", "①対象範囲", "第1章 概要" | ||
| if len(prev_stripped) <= 60 and re.match( | ||
| r"^[((①-⑳❶-❿第0-90-9]", prev_stripped | ||
| ): | ||
| return False | ||
|
|
||
| # Current line starts with a structural/new-paragraph marker | ||
| if re.match(r"^[((「『【〈《①-⑳❶-❿※◆●▶■□▸◇★☆]", curr_stripped): | ||
| return False | ||
|
|
||
| # Current line starts with a heading-like pattern (e.g. "第1章", "1.") | ||
| if re.match(r"^(第[0-9一-九]+|[0-9]+[..]|[0-9]+[..])", curr_stripped): | ||
| return False | ||
|
|
||
| # Current line looks like a figure/table caption | ||
| if re.match(r"^(図表|図|表)\s*\d", curr_stripped): | ||
| return False | ||
|
|
||
| # Otherwise, prev ended mid-sentence → join | ||
| return True | ||
|
|
||
|
|
||
| def clean(text: str) -> str: | ||
| lines = text.split("\n") | ||
| result: list[str] = [] | ||
| in_code_block = False | ||
| i = 0 | ||
| n = len(lines) | ||
|
|
||
| while i < n: | ||
| line = lines[i] | ||
| stripped = line.strip() | ||
|
|
||
| # Toggle code block state | ||
| if stripped.startswith(("```", "~~~")): | ||
| in_code_block = not in_code_block | ||
| result.append(line) | ||
| i += 1 | ||
| continue | ||
|
|
||
| # Never modify lines inside code blocks | ||
| if in_code_block: | ||
| result.append(line) | ||
| i += 1 | ||
| continue | ||
|
|
||
| # Remove standalone page numbers | ||
| if _is_page_number(line): | ||
| i += 1 | ||
| continue | ||
|
|
||
| # Blank line: check if it's a mid-sentence break or real paragraph break | ||
| if not stripped: | ||
| # Look ahead: blank line followed by a text line | ||
| if ( | ||
| i + 1 < n | ||
| and lines[i + 1].strip() | ||
| and not is_structural_line(lines[i + 1]) | ||
| and not _is_page_number(lines[i + 1]) | ||
| and result | ||
| and result[-1].strip() | ||
| and not is_structural_line(result[-1]) | ||
| ): | ||
| # Decide based on whether previous line ended mid-sentence | ||
| if _looks_like_continuation(result[-1], lines[i + 1]): | ||
| # Skip this blank line — next iteration will join the text | ||
| i += 1 | ||
| continue | ||
|
|
||
| result.append(line) | ||
| i += 1 | ||
| continue | ||
|
|
||
| # Structural lines are kept as-is | ||
| if is_structural_line(line): | ||
| result.append(line) | ||
| i += 1 | ||
| continue | ||
|
|
||
| # Text line: join to previous if it's a continuation | ||
| if result and result[-1].strip() and not is_structural_line(result[-1]): | ||
| if _looks_like_continuation(result[-1], stripped): | ||
| result[-1] = result[-1].rstrip() + stripped | ||
| else: | ||
| result.append(line) | ||
| else: | ||
| result.append(line) | ||
|
|
||
| i += 1 | ||
|
|
||
| return "\n".join(result) | ||
|
|
||
|
|
||
| def main() -> None: | ||
| parser = argparse.ArgumentParser( | ||
| description="Remove unnecessary line breaks from Markdown converted from PDF" | ||
| ) | ||
| parser.add_argument("file", help="Markdown file to clean") | ||
| parser.add_argument("-o", "--output", help="Output file (default: overwrite input)") | ||
| args = parser.parse_args() | ||
|
|
||
| path = Path(args.file) | ||
| if not path.exists(): | ||
| print(f"Error: File not found: {path}", file=sys.stderr) | ||
| sys.exit(1) | ||
|
|
||
| text = path.read_text(encoding="utf-8") | ||
| cleaned = clean(text) | ||
|
|
||
| out = Path(args.output) if args.output else path | ||
| out.write_text(cleaned, encoding="utf-8") | ||
|
|
||
| if args.output: | ||
| print(f"Cleaned: {path} -> {out}") | ||
| else: | ||
| print(f"Cleaned: {path} (in-place)") | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| main() | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is missing the PEP 723 inline metadata block (
# /// script/# ///) that the other three scripts in this directory all include, and that the SKILL.md documentation references when it states "Scripts declare their own dependencies via PEP 723 inline metadata." While this script only uses stdlib modules, adding the metadata block withrequires-pythonwould be consistent with the other scripts (seeextract_toc.py,split_pdf.py,convert_pdf.py).