Skip to content

Latest commit

 

History

History
272 lines (216 loc) · 6.84 KB

File metadata and controls

272 lines (216 loc) · 6.84 KB

PDF — Complete Guide

Overview

For advanced features, JavaScript libraries, and detailed examples, see pdf/reference.md. For filling PDF forms, read pdf/forms.md and follow its instructions.

Quick Start

from pypdf import PdfReader, PdfWriter

reader = PdfReader("document.pdf")
print(f"Pages: {len(reader.pages)}")

text = ""
for page in reader.pages:
    text += page.extract_text()

Python Libraries

pypdf — Basic Operations

Merge PDFs

from pypdf import PdfWriter, PdfReader

writer = PdfWriter()
for pdf_file in ["doc1.pdf", "doc2.pdf", "doc3.pdf"]:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

with open("merged.pdf", "wb") as output:
    writer.write(output)

Split PDF

reader = PdfReader("input.pdf")
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f"page_{i+1}.pdf", "wb") as output:
        writer.write(output)

Extract Metadata

reader = PdfReader("document.pdf")
meta = reader.metadata
print(f"Title: {meta.title}")
print(f"Author: {meta.author}")
print(f"Subject: {meta.subject}")
print(f"Creator: {meta.creator}")

Rotate Pages

reader = PdfReader("input.pdf")
writer = PdfWriter()
page = reader.pages[0]
page.rotate(90)  # Rotate 90 degrees clockwise
writer.add_page(page)
with open("rotated.pdf", "wb") as output:
    writer.write(output)

pdfplumber — Text and Table Extraction

Extract Text with Layout

import pdfplumber

with pdfplumber.open("document.pdf") as pdf:
    for page in pdf.pages:
        text = page.extract_text()
        print(text)

Extract Tables

with pdfplumber.open("document.pdf") as pdf:
    for i, page in enumerate(pdf.pages):
        tables = page.extract_tables()
        for j, table in enumerate(tables):
            print(f"Table {j+1} on page {i+1}:")
            for row in table:
                print(row)

Advanced Table Extraction (to Excel)

import pandas as pd

with pdfplumber.open("document.pdf") as pdf:
    all_tables = []
    for page in pdf.pages:
        tables = page.extract_tables()
        for table in tables:
            if table:
                df = pd.DataFrame(table[1:], columns=table[0])
                all_tables.append(df)

if all_tables:
    combined_df = pd.concat(all_tables, ignore_index=True)
    combined_df.to_excel("extracted_tables.xlsx", index=False)

reportlab — Create PDFs

Basic PDF Creation

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

c = canvas.Canvas("hello.pdf", pagesize=letter)
width, height = letter
c.drawString(100, height - 100, "Hello World!")
c.line(100, height - 140, 400, height - 140)
c.save()

Create PDF with Multiple Pages

from reportlab.lib.pagesizes import letter
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet

doc = SimpleDocTemplate("report.pdf", pagesize=letter)
styles = getSampleStyleSheet()
story = []

story.append(Paragraph("Report Title", styles['Title']))
story.append(Spacer(1, 12))
story.append(Paragraph("Body content.", styles['Normal']))
story.append(PageBreak())
story.append(Paragraph("Page 2", styles['Heading1']))

doc.build(story)

Subscripts and Superscripts

IMPORTANT: Never use Unicode subscript/superscript characters (₀₁₂₃, ⁰¹²³) in ReportLab PDFs. Built-in fonts don't include these glyphs — they render as solid black boxes.

Use ReportLab's XML markup tags in Paragraph objects instead:

chemical = Paragraph("H<sub>2</sub>O", styles['Normal'])
squared = Paragraph("x<super>2</super> + y<super>2</super>", styles['Normal'])

For canvas-drawn text (not Paragraph objects), manually adjust font size and position.

Command-Line Tools

pdftotext (poppler-utils)

pdftotext input.pdf output.txt
pdftotext -layout input.pdf output.txt   # preserve layout
pdftotext -f 1 -l 5 input.pdf output.txt  # pages 1-5

qpdf

qpdf --empty --pages file1.pdf file2.pdf -- merged.pdf
qpdf input.pdf --pages . 1-5 -- pages1-5.pdf
qpdf input.pdf --pages . 6-10 -- pages6-10.pdf
qpdf input.pdf output.pdf --rotate=+90:1  # rotate page 1
qpdf --password=mypassword --decrypt encrypted.pdf decrypted.pdf

pdftk (if available)

pdftk file1.pdf file2.pdf cat output merged.pdf
pdftk input.pdf burst
pdftk input.pdf rotate 1east output rotated.pdf

Common Tasks

Extract Text from Scanned PDFs

import pytesseract
from pdf2image import convert_from_path

images = convert_from_path('scanned.pdf')
text = ""
for i, image in enumerate(images):
    text += f"Page {i+1}:\n"
    text += pytesseract.image_to_string(image)
    text += "\n\n"

Add Watermark

from pypdf import PdfReader, PdfWriter

watermark = PdfReader("watermark.pdf").pages[0]
reader = PdfReader("document.pdf")
writer = PdfWriter()

for page in reader.pages:
    page.merge_page(watermark)
    writer.add_page(page)

with open("watermarked.pdf", "wb") as output:
    writer.write(output)

Extract Images

pdfimages -j input.pdf output_prefix
# Creates output_prefix-000.jpg, output_prefix-001.jpg, etc.

Password Protection

from pypdf import PdfReader, PdfWriter

reader = PdfReader("input.pdf")
writer = PdfWriter()
for page in reader.pages:
    writer.add_page(page)
writer.encrypt("userpassword", "ownerpassword")
with open("encrypted.pdf", "wb") as output:
    writer.write(output)

Quick Reference

Task Best Tool
Merge PDFs pypdf
Split PDFs pypdf
Extract text pdfplumber
Extract tables pdfplumber
Create PDFs reportlab
Command line merge qpdf
OCR scanned PDFs pytesseract
Fill PDF forms See pdf/forms.md

Visual QA (Render and Verify)

Before delivering any created or modified PDF, render pages to PNG and inspect visually. This catches layout bugs that code-level checks miss.

# Install Poppler if needed: brew install poppler
pdftoppm -png input.pdf /tmp/pdf-preview/page

# Or render a specific page range
pdftoppm -png -f 1 -l 3 input.pdf /tmp/pdf-preview/page

Then inspect PNGs with the image tool. Do not deliver until:

  • Text is readable, not clipped or overlapping
  • Tables are aligned with consistent column widths
  • Headers, footers, and page numbers render correctly
  • Charts and images are sharp and properly placed
  • Margins and spacing are consistent across pages

If pdftoppm is unavailable:

import subprocess
subprocess.run(["sips", "-s", "format", "png", "input.pdf", "--out", "/tmp/preview.png"])

Further Reference

  • For advanced pypdfium2 usage: pdf/reference.md
  • For JavaScript libraries (pdf-lib): pdf/reference.md
  • For form filling: pdf/forms.md