23prime · 23prime · Mar 5, 2026 · Mar 7, 2026 · Mar 7, 2026 · Copilot
@@ -0,0 +1,71 @@
+---
+name: converting-pdf-to-markdown
+description: markitdown を使って PDF ファイルを Markdown に変換する。PDF を Markdown に変換したい、PDF からテキストを抽出したい、PDF ドキュメントを読みやすい Markdown 形式に変換したい場合に使用する。
+translated_from: SKILL.md
+---
+
+# PDF から Markdown への変換
+
+Microsoft の markitdown ライブラリを使った軽量な PDF → Markdown 変換。
+
+## 前提条件
+
+- [uv](https://docs.astral.sh/uv/) がインストールされていること。
+
+スクリプトは PEP 723 インラインメタデータで依存関係を宣言しており、`uv run` で実行するため、
+手動でのパッケージインストールは不要。
+
+## ワークフロー
+
+### ステップ 1: ページ数と目次の確認
+
+変換前に目次抽出スクリプトを実行する:
+
+```bash
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/extract_toc.py" input.pdf
+```
+
+出力は `page_count`（ページ数）と `toc`（レベル・タイトル・ページを含むブックマークエントリ）を含む JSON。
+
+### ステップ 2: 分割の要否を判断
+
+- **100 ページ以下**: ファイル全体を変換する。ステップ 3 へ進む。
+- **100 ページ超かつ目次あり**: 目次の構造をユーザーに提示し、トップレベルの見出しで分割することを推奨する。ユーザーの確認後、以下で分割する:
+
+```bash
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -o output_dir/
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -l 2  # レベル 2 見出しで分割
+```
+
+番号付き PDF ファイル（例: `01-introduction.pdf`）がサブディレクトリに作成される。各パートをステップ 3 で個別に変換する。
+
+- **100 ページ超かつ目次なし**: 自動分割用のブックマークがない旨をユーザーに伝える。そのまま変換するか、手動でページ範囲を指定するよう促す。
+
+### ステップ 3: 変換
+
+```bash
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf -o output.md
+```
+
+デフォルトの出力先は同ディレクトリの `input.md`。カスタムパスを指定する場合は `-o` を使用する。
+
+### ステップ 4: 改行の整理
+
+PDF のテキストは表示上の折り返しで不要な改行が入る。変換後は必ず整理スクリプトを実行する:
+
+```bash
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md -o cleaned.md
+```
+
+`-o` を省略した場合はファイルを上書き。見出し・リスト・コードブロック・テーブル・段落区切りを保持しながら、折り返し行を結合する。
+
+### ステップ 5: レビュー
+
+出力結果を確認する:
+
+- 見出しの崩れや書式のアーティファクト
+- 不要なページ番号やヘッダー・フッター
+- テーブルの書式の問題
@@ -0,0 +1,70 @@
+---
+name: converting-pdf-to-markdown
+description: Convert PDF files to Markdown using markitdown. Use when users want to convert a PDF to Markdown, extract text from PDFs, or transform PDF documents into readable Markdown format.
+---
+
+# Converting PDF to Markdown
+
+Lightweight PDF-to-Markdown conversion using Microsoft's markitdown library.
+
+## Prerequisites
+
+- [uv](https://docs.astral.sh/uv/) must be installed.
+
+Scripts declare their own dependencies via PEP 723 inline metadata and are run with
+`uv run`, so no manual package installation is needed.
+
+## Workflow
+
+### Step 1: Check page count and TOC
+
+Before converting, run the TOC extraction script:
+
+```bash
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/extract_toc.py" input.pdf
+```
+
+Output is JSON with `page_count` and `toc` (bookmark entries with level, title, page).
+
+### Step 2: Decide whether to split
+
+- **100 pages or fewer**: Convert the entire file. Go to Step 3.
+- **Over 100 pages with TOC**: Present the TOC structure to the user and recommend splitting by top-level headings. After user confirmation, split with:
+
+```bash
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -o output_dir/
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/split_pdf.py" input.pdf -l 2  # split on level-2 headings
+```
+
+This creates numbered PDF files (e.g., `01-introduction.pdf`) in a subdirectory. Convert each part separately with Step 3.
+
+- **Over 100 pages without TOC**: Inform the user that there are no bookmarks for automatic splitting. Suggest converting as-is or ask for manual page ranges.
+
+### Step 3: Convert
+
+```bash
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/convert_pdf.py" input.pdf -o output.md
+```
+
+Default output is `input.md` in the same directory. Use `-o` to specify a custom path.
+
+### Step 4: Clean up line breaks
+
+PDF text wraps at display boundaries, leaving unnecessary line breaks in the Markdown. Always run the cleanup script after conversion:
+
+```bash
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md
+uv run "${CLAUDE_CONFIG_DIR:-$HOME/.claude}/skills/converting-pdf-to-markdown/scripts/clean_linebreaks.py" output.md -o cleaned.md
+```
+
+Without `-o`, the file is overwritten in-place. The script joins continuation lines while preserving headings, lists, code blocks, tables, and paragraph separators.
+
+### Step 5: Review
+
+Check the output for remaining issues:
+
+- Broken headings or formatting artifacts
+- Unwanted page numbers or headers/footers
+- Table formatting issues
@@ -0,0 +1,185 @@
+#!/usr/bin/env python3
-#!/usr/bin/env python3
+#!/usr/bin/env python3
+# /// script
+# requires-python = ">=3.10"
+# ///
-#!/usr/bin/env python3
+#!/usr/bin/env python3
+# /// script
+# requires-python = ">=3.10"
+# ///
+"""Remove unnecessary line breaks from converted Markdown.
+
+PDFs often insert line breaks for display purposes that don't represent
+actual paragraph breaks. This script joins continuation lines while
+preserving intentional structure (headings, lists, code blocks, blank lines).
+
+Handles two common patterns:
+1. Direct continuation: text line immediately followed by another text line.
+2. Single-blank-line continuation: PDF text where every line is separated
+   by a blank line. Distinguishes mid-sentence breaks from true paragraph
+   breaks by checking whether the preceding line ends mid-sentence.
+"""
+
+import argparse
+import re
+import sys
+from pathlib import Path
+
+
+def is_structural_line(line: str) -> bool:
+    """Check if a line is a Markdown structural element that should not be joined."""
+    stripped = line.strip()
+    if not stripped:
+        return True  # blank line (paragraph separator)
+    if stripped.startswith("#"):
+        return True  # heading
+    if re.match(r"^[-*+]\s", stripped):
+        return True  # unordered list item
+    if re.match(r"^\d+\.\s", stripped):
+        return True  # ordered list item
+    if stripped.startswith(("```", "~~~")):
+        return True  # code fence
+    if stripped.startswith("|"):
+        return True  # table row
+    if stripped.startswith(">"):
+        return True  # blockquote
+    if re.match(r"^-{3,}$|^\*{3,}$|^_{3,}$", stripped):
+        return True  # horizontal rule
+    return False
+
+
+def _is_page_number(line: str) -> bool:
+    """Check if a line is just a page number."""
+    return bool(re.match(r"^\s*\d{1,4}\s*$", line))
+
+
+def _looks_like_continuation(prev: str, curr: str) -> bool:
+    """Determine if curr is a continuation of prev (mid-sentence break).
+
+    Returns True when prev appears to end mid-sentence, meaning the blank
+    line between them was just a PDF display artifact, not a paragraph break.
+    """
+    prev_stripped = prev.rstrip()
+    curr_stripped = curr.strip()
+
+    if not prev_stripped or not curr_stripped:
+        return False
+
+    # Previous line ends with a sentence-ending punctuation → paragraph break
+    if re.search(r"[。．.！!？?）\)」』】〉》\]:]$", prev_stripped):
+        return False
+
+    # Previous line looks like a short heading/title (no sentence-ending punct,
+    # relatively short, and not ending with a particle or connective)
+    # e.g. "（１）本ガイドラインの目的", "①対象範囲", "第１章 概要"
+    if len(prev_stripped) <= 60 and re.match(
+        r"^[（(①-⑳❶-❿第０-９0-9]", prev_stripped
+    ):
+        return False
+
+    # Current line starts with a structural/new-paragraph marker
+    if re.match(r"^[（(「『【〈《①-⑳❶-❿※◆●▶■□▸◇★☆]", curr_stripped):
+        return False
+
+    # Current line starts with a heading-like pattern (e.g. "第１章", "１．")
+    if re.match(r"^(第[０-９一-九]+|[０-９]+[．.]|[0-9]+[．.])", curr_stripped):
+        return False
+
+    # Current line looks like a figure/table caption
+    if re.match(r"^(図表|図|表)\s*\d", curr_stripped):
+        return False
+
+    # Otherwise, prev ended mid-sentence → join
+    return True
+
+
+def clean(text: str) -> str:
+    lines = text.split("\n")
+    result: list[str] = []
+    in_code_block = False
+    i = 0
+    n = len(lines)
+
+    while i < n:
+        line = lines[i]
+        stripped = line.strip()
+
+        # Toggle code block state
+        if stripped.startswith(("```", "~~~")):
+            in_code_block = not in_code_block
+            result.append(line)
+            i += 1
+            continue
+
+        # Never modify lines inside code blocks
+        if in_code_block:
+            result.append(line)
+            i += 1
+            continue
+
+        # Remove standalone page numbers
+        if _is_page_number(line):
+            i += 1
+            continue
+
+        # Blank line: check if it's a mid-sentence break or real paragraph break
+        if not stripped:
+            # Look ahead: blank line followed by a text line
+            if (
+                i + 1 < n
+                and lines[i + 1].strip()
+                and not is_structural_line(lines[i + 1])
+                and not _is_page_number(lines[i + 1])
+                and result
+                and result[-1].strip()
+                and not is_structural_line(result[-1])
+            ):
+                # Decide based on whether previous line ended mid-sentence
+                if _looks_like_continuation(result[-1], lines[i + 1]):
+                    # Skip this blank line — next iteration will join the text
+                    i += 1
+                    continue
+
+            result.append(line)
+            i += 1
+            continue
+
+        # Structural lines are kept as-is
+        if is_structural_line(line):
+            result.append(line)
+            i += 1
+            continue
+
+        # Text line: join to previous if it's a continuation
+        if result and result[-1].strip() and not is_structural_line(result[-1]):
+            if _looks_like_continuation(result[-1], stripped):
+                result[-1] = result[-1].rstrip() + stripped
+            else:
+                result.append(line)
+        else:
+            result.append(line)
+
+        i += 1
+
+    return "\n".join(result)
+
+
+def main() -> None:
+    parser = argparse.ArgumentParser(
+        description="Remove unnecessary line breaks from Markdown converted from PDF"
+    )
+    parser.add_argument("file", help="Markdown file to clean")
+    parser.add_argument("-o", "--output", help="Output file (default: overwrite input)")
+    args = parser.parse_args()
+
+    path = Path(args.file)
+    if not path.exists():
+        print(f"Error: File not found: {path}", file=sys.stderr)
+        sys.exit(1)
+
+    text = path.read_text(encoding="utf-8")
+    cleaned = clean(text)
+
+    out = Path(args.output) if args.output else path
+    out.write_text(cleaned, encoding="utf-8")
+
+    if args.output:
+        print(f"Cleaned: {path} -> {out}")
+    else:
+        print(f"Cleaned: {path} (in-place)")
+
+
+if __name__ == "__main__":
+    main()