Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions skills/converting-pdf-to-markdown/SKILL.ja.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
name: converting-pdf-to-markdown
description: markitdown を使って PDF ファイルを Markdown に変換する。PDF を Markdown に変換したい、PDF からテキストを抽出したい、PDF ドキュメントを読みやすい Markdown 形式に変換したい場合に使用する。
translated_from: SKILL.md
---

# PDF から Markdown への変換

Microsoft の markitdown ライブラリを使った軽量な PDF → Markdown 変換。

## 前提条件

- [uv](https://docs.astral.sh/uv/) がインストールされていること。

スクリプトは PEP 723 インラインメタデータで依存関係を宣言しており、`uv run` で実行するため、
手動でのパッケージインストールは不要。

## ワークフロー

### ステップ 1: ページ数と目次の確認

変換前に目次抽出スクリプトを実行する:

```bash
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/extract_toc.py" input.pdf
```

出力は `page_count`(ページ数)と `toc`(レベル・タイトル・ページを含むブックマークエントリ)を含む JSON。

### ステップ 2: 分割の要否を判断

- **100 ページ以下**: ファイル全体を変換する。ステップ 3 へ進む。
- **100 ページ超かつ目次あり**: 目次の構造をユーザーに提示し、トップレベルの見出しで分割することを推奨する。ユーザーの確認後、以下で分割する:

```bash
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -o output_dir/
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -l 2 # レベル 2 見出しで分割
```

番号付き PDF ファイル(例: `01-introduction.pdf`)がサブディレクトリに作成される。各パートをステップ 3 で個別に変換する。

- **100 ページ超かつ目次なし**: 自動分割用のブックマークがない旨をユーザーに伝える。そのまま変換するか、手動でページ範囲を指定するよう促す。

### ステップ 3: 変換

```bash
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf -o output.md
```

デフォルトの出力先は同ディレクトリの `input.md`。カスタムパスを指定する場合は `-o` を使用する。

### ステップ 4: 改行の整理

PDF のテキストは表示上の折り返しで不要な改行が入る。変換後は必ず整理スクリプトを実行する:

```bash
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md -o cleaned.md
```

`-o` を省略した場合はファイルを上書き。見出し・リスト・コードブロック・テーブル・段落区切りを保持しながら、折り返し行を結合する。

### ステップ 5: レビュー

出力結果を確認する:

- 見出しの崩れや書式のアーティファクト
- 不要なページ番号やヘッダー・フッター
- テーブルの書式の問題
70 changes: 70 additions & 0 deletions skills/converting-pdf-to-markdown/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
name: converting-pdf-to-markdown
description: Convert PDF files to Markdown using markitdown. Use when users want to convert a PDF to Markdown, extract text from PDFs, or transform PDF documents into readable Markdown format.
---

# Converting PDF to Markdown

Lightweight PDF-to-Markdown conversion using Microsoft's markitdown library.

## Prerequisites

- [uv](https://docs.astral.sh/uv/) must be installed.

Scripts declare their own dependencies via PEP 723 inline metadata and are run with
`uv run`, so no manual package installation is needed.

## Workflow

### Step 1: Check page count and TOC

Before converting, run the TOC extraction script:

```bash
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/extract_toc.py" input.pdf
```

Output is JSON with `page_count` and `toc` (bookmark entries with level, title, page).

### Step 2: Decide whether to split

- **100 pages or fewer**: Convert the entire file. Go to Step 3.
- **Over 100 pages with TOC**: Present the TOC structure to the user and recommend splitting by top-level headings. After user confirmation, split with:

```bash
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -o output_dir/
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -l 2 # split on level-2 headings
```

This creates numbered PDF files (e.g., `01-introduction.pdf`) in a subdirectory. Convert each part separately with Step 3.

- **Over 100 pages without TOC**: Inform the user that there are no bookmarks for automatic splitting. Suggest converting as-is or ask for manual page ranges.

### Step 3: Convert

```bash
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf -o output.md
```

Default output is `input.md` in the same directory. Use `-o` to specify a custom path.

### Step 4: Clean up line breaks

PDF text wraps at display boundaries, leaving unnecessary line breaks in the Markdown. Always run the cleanup script after conversion:

```bash
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md
uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md -o cleaned.md
```

Without `-o`, the file is overwritten in-place. The script joins continuation lines while preserving headings, lists, code blocks, tables, and paragraph separators.

### Step 5: Review

Check the output for remaining issues:

- Broken headings or formatting artifacts
- Unwanted page numbers or headers/footers
- Table formatting issues
185 changes: 185 additions & 0 deletions skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,185 @@
#!/usr/bin/env python3
Copy link

Copilot AI Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is missing the PEP 723 inline metadata block (# /// script / # ///) that the other three scripts in this directory all include, and that the SKILL.md documentation references when it states "Scripts declare their own dependencies via PEP 723 inline metadata." While this script only uses stdlib modules, adding the metadata block with requires-python would be consistent with the other scripts (see extract_toc.py, split_pdf.py, convert_pdf.py).

Suggested change
#!/usr/bin/env python3
#!/usr/bin/env python3
# /// script
# requires-python = ">=3.10"
# ///

Copilot uses AI. Check for mistakes.
"""Remove unnecessary line breaks from converted Markdown.

PDFs often insert line breaks for display purposes that don't represent
actual paragraph breaks. This script joins continuation lines while
preserving intentional structure (headings, lists, code blocks, blank lines).

Handles two common patterns:
1. Direct continuation: text line immediately followed by another text line.
2. Single-blank-line continuation: PDF text where every line is separated
by a blank line. Distinguishes mid-sentence breaks from true paragraph
breaks by checking whether the preceding line ends mid-sentence.
"""

import argparse
import re
import sys
from pathlib import Path


def is_structural_line(line: str) -> bool:
"""Check if a line is a Markdown structural element that should not be joined."""
stripped = line.strip()
if not stripped:
return True # blank line (paragraph separator)
if stripped.startswith("#"):
return True # heading
if re.match(r"^[-*+]\s", stripped):
return True # unordered list item
if re.match(r"^\d+\.\s", stripped):
return True # ordered list item
if stripped.startswith(("```", "~~~")):
return True # code fence
if stripped.startswith("|"):
return True # table row
if stripped.startswith(">"):
return True # blockquote
if re.match(r"^-{3,}$|^\*{3,}$|^_{3,}$", stripped):
return True # horizontal rule
return False


def _is_page_number(line: str) -> bool:
"""Check if a line is just a page number."""
return bool(re.match(r"^\s*\d{1,4}\s*$", line))


def _looks_like_continuation(prev: str, curr: str) -> bool:
"""Determine if curr is a continuation of prev (mid-sentence break).

Returns True when prev appears to end mid-sentence, meaning the blank
line between them was just a PDF display artifact, not a paragraph break.
"""
prev_stripped = prev.rstrip()
curr_stripped = curr.strip()

if not prev_stripped or not curr_stripped:
return False

# Previous line ends with a sentence-ending punctuation → paragraph break
if re.search(r"[。..!!??)\)」』】〉》\]:]$", prev_stripped):
return False

# Previous line looks like a short heading/title (no sentence-ending punct,
# relatively short, and not ending with a particle or connective)
# e.g. "(1)本ガイドラインの目的", "①対象範囲", "第1章 概要"
if len(prev_stripped) <= 60 and re.match(
r"^[((①-⑳❶-❿第0-90-9]", prev_stripped
):
return False

# Current line starts with a structural/new-paragraph marker
if re.match(r"^[((「『【〈《①-⑳❶-❿※◆●▶■□▸◇★☆]", curr_stripped):
return False

# Current line starts with a heading-like pattern (e.g. "第1章", "1.")
if re.match(r"^(第[0-9一-九]+|[0-9]+[..]|[0-9]+[..])", curr_stripped):
return False

# Current line looks like a figure/table caption
if re.match(r"^(図表|図|表)\s*\d", curr_stripped):
return False

# Otherwise, prev ended mid-sentence → join
return True


def clean(text: str) -> str:
lines = text.split("\n")
result: list[str] = []
in_code_block = False
i = 0
n = len(lines)

while i < n:
line = lines[i]
stripped = line.strip()

# Toggle code block state
if stripped.startswith(("```", "~~~")):
in_code_block = not in_code_block
result.append(line)
i += 1
continue

# Never modify lines inside code blocks
if in_code_block:
result.append(line)
i += 1
continue

# Remove standalone page numbers
if _is_page_number(line):
i += 1
continue

# Blank line: check if it's a mid-sentence break or real paragraph break
if not stripped:
# Look ahead: blank line followed by a text line
if (
i + 1 < n
and lines[i + 1].strip()
and not is_structural_line(lines[i + 1])
and not _is_page_number(lines[i + 1])
and result
and result[-1].strip()
and not is_structural_line(result[-1])
):
# Decide based on whether previous line ended mid-sentence
if _looks_like_continuation(result[-1], lines[i + 1]):
# Skip this blank line — next iteration will join the text
i += 1
continue

result.append(line)
i += 1
continue

# Structural lines are kept as-is
if is_structural_line(line):
result.append(line)
i += 1
continue

# Text line: join to previous if it's a continuation
if result and result[-1].strip() and not is_structural_line(result[-1]):
if _looks_like_continuation(result[-1], stripped):
result[-1] = result[-1].rstrip() + stripped
else:
result.append(line)
else:
result.append(line)

i += 1

return "\n".join(result)


def main() -> None:
parser = argparse.ArgumentParser(
description="Remove unnecessary line breaks from Markdown converted from PDF"
)
parser.add_argument("file", help="Markdown file to clean")
parser.add_argument("-o", "--output", help="Output file (default: overwrite input)")
args = parser.parse_args()

path = Path(args.file)
if not path.exists():
print(f"Error: File not found: {path}", file=sys.stderr)
sys.exit(1)

text = path.read_text(encoding="utf-8")
cleaned = clean(text)

out = Path(args.output) if args.output else path
out.write_text(cleaned, encoding="utf-8")

if args.output:
print(f"Cleaned: {path} -> {out}")
else:
print(f"Cleaned: {path} (in-place)")


if __name__ == "__main__":
main()
Loading
Loading