SwiftText

A collection of text utilities that has its origin in getting text out of various sources for the use of LLM agents.

Overview

SwiftText provides Swift libraries and command-line tools for extracting text from various document formats. The extracted text is optimized for use with Large Language Models (LLMs) and AI agents.

Modules

SwiftTextHTML

Extracts text or Markdown from HTML using:

HTMLParser (libxml2-backed)

Features:

Plain text extraction
Markdown conversion (links, lists, tables, code)

SwiftTextOCR

Extracts text from images using:

Vision OCR - Text recognition for bitmap content

Features:

Preserves logical line structure and reading order
Maintains vertical spacing between paragraphs
High-resolution OCR (300 DPI) for accurate text recognition
Optional Markdown output using Vision document segmentation (iOS 26+, macOS 26+)

SwiftTextPDF

Extracts text from PDFs using a combination of:

PDFKit text selection - For PDFs with embedded text layers
Vision OCR - Automatic fallback for scanned documents or PDFs without selectable text

Features:

Handles multi-page documents with page break markers
Preserves logical line structure and reading order
Maintains vertical spacing between paragraphs

SwiftTextDOCX

Extracts text and basic structure from DOCX archives using:

ZIPFoundation to read the Word archive
XMLParser to parse document, styles, and numbering

Features:

Plain text paragraph extraction
Markdown output with headings, emphasis, and lists

Installation

Add SwiftText to your Swift package dependencies:

dependencies: [
    .package(url: "https://github.com/your-repo/SwiftText.git", branch: "main")
]

Then pick either specific products or the umbrella module.

Individual products (import only what you need):

	.target(
	name: "YourTarget",
	dependencies: [
		.product(name: "SwiftTextHTML", package: "SwiftText"),
		.product(name: "SwiftTextOCR", package: "SwiftText"),
		.product(name: "SwiftTextPDF", package: "SwiftText"),
		.product(name: "SwiftTextDOCX", package: "SwiftText")
	]
)

Umbrella module (single import), with traits:

.package(
	url: "https://github.com/your-repo/SwiftText.git",
	branch: "main",
	traits: [.defaults, "HTML", "PDF", "DOCX"]
),
.target(
	name: "YourTarget",
	dependencies: [
		.product(name: "SwiftText", package: "SwiftText")
	]
)

SwiftText defaults to OCR only. Enable traits as needed:

traits: [.defaults, "HTML", "PDF", "DOCX"]

Usage

Library Usage

HTML (SwiftTextHTML)

import SwiftTextHTML

let url = URL(string: "https://example.com")!
let (data, _) = try await URLSession.shared.data(from: url)
let document = try await HTMLDocument(data: data, baseURL: url)
let markdown = document.markdown()

PDF (SwiftTextPDF)

import PDFKit
import SwiftTextPDF

// Load a PDF document
let pdfURL = URL(fileURLWithPath: "/path/to/document.pdf")
guard let document = PDFDocument(url: pdfURL) else {
	fatalError("Could not load PDF")
}

// Extract all text as a single string
let text = document.extractText()
print(text)

// For more control, access TextLine objects directly
let textLines = document.textLines()
for textLine in textLines {
	print("Position: \(textLine.yPosition), Text: \(textLine.combinedText)")
}

PDF Markdown (SwiftTextOCR + SwiftTextPDF, iOS/macOS 26+)

import PDFKit
import SwiftTextOCR
import SwiftTextPDF

let pdfURL = URL(fileURLWithPath: "/path/to/document.pdf")
guard let document = PDFDocument(url: pdfURL) else {
	fatalError("Could not load PDF")
}

if #available(iOS 26.0, tvOS 26.0, macOS 26.0, *) {
	let allLines = document.textLines()
	var allBlocks: [DocumentBlock] = []

	for pageIndex in 0..<document.pageCount {
		guard let page = document.page(at: pageIndex) else { continue }
		let semantics = try await page.documentSemantics(dpi: 300)
		let layoutSize = page.bounds(for: .mediaBox).size
		let grouped = TextLineSemanticComposer.composeBlocks(
			from: page.textLines(),
			semantics: semantics,
			layoutSize: layoutSize
		)
		allBlocks.append(contentsOf: grouped)
	}

	let markdown = DocumentBlockMarkdownRenderer.markdown(
		from: allBlocks,
		textLines: allLines.map {
			let bounds = $0.fragments.reduce($0.fragments.first?.bounds ?? .zero) { $0.union($1.bounds) }
			return DocumentBlock.TextLine(text: $0.combinedText, bounds: bounds)
		}
	)
	print(markdown)
}

Images (SwiftTextOCR)

import SwiftTextOCR

let textLines = cgImage.textLines(imageSize: CGSize(width: cgImage.width, height: cgImage.height))
let text = textLines.string()

DOCX

import SwiftTextDOCX

let url = URL(fileURLWithPath: "/path/to/document.docx")
let docx = try DocxFile(url: url)

let plainText = docx.plainText()
let markdown = docx.markdown()

Command Line Tool

Build and run the CLI:

swift build
swift run swifttext ocr /path/to/document.pdf

Options:

ocr --markdown/-m (Vision segmentation), --save-images <dir>, --output-path <file>/-o
html --markdown/-m, --save-images <dir>, --output-path <file>/-o, --webkit, --via-pdf
docx --markdown/-m (headings and lists), --output-path <file>/-o, --save-images
overlay --output-path <file>/-o, --dpi <value>, --raw

Examples:

# Extract formatted text from a PDF
swifttext ocr ~/Documents/report.pdf

# Using a relative path
swifttext ocr ../folder/file.pdf

# Save OCR output to a file
swifttext ocr --output-path ./output.txt ~/Documents/report.pdf

# Save images while producing Markdown from a PDF
swifttext ocr --markdown --save-images ./images ~/Documents/report.pdf

# Extract plain text from a Word document
swifttext docx ~/Documents/contract.docx

# Extract Markdown from a Word document
swifttext docx --markdown ~/Documents/contract.docx

# Extract Markdown from HTML (optionally load via WebKit)
swifttext html --markdown https://example.com
swifttext html --markdown --webkit https://example.com

# Save Word output to a file
swifttext docx --output-path ./contract.txt ~/Documents/contract.docx

# Extract embedded images to the output directory or current directory
swifttext docx --save-images ~/Documents/contract.docx

# Render an overlay PDF for inspection
swifttext overlay --dpi 300 ~/Documents/report.pdf

Requirements

Swift 5.9+
Platforms: macOS, iOS, tvOS, watchOS (any version that supports PDFKit)

Note:

PDF text extraction (via PDFKit) works on any platform that supports PDFKit
OCR fallback requires iOS 13.0+, tvOS 13.0+, or macOS 10.15+ (automatically enabled when available via availability checks)
OCR Markdown segmentation requires iOS 26.0+, tvOS 26.0+, or macOS 26.0+

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
.swiftpm		.swiftpm
.vscode		.vscode
Sources		Sources
Tests		Tests
.gitignore		.gitignore
AGENTS.md		AGENTS.md
LICENSE		LICENSE
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SwiftText

Overview

Modules

SwiftTextHTML

SwiftTextOCR

SwiftTextPDF

SwiftTextDOCX

Installation

Usage

Library Usage

HTML (SwiftTextHTML)

PDF (SwiftTextPDF)

PDF Markdown (SwiftTextOCR + SwiftTextPDF, iOS/macOS 26+)

Images (SwiftTextOCR)

DOCX

Command Line Tool

Requirements

License

About

Uh oh!

Releases

Packages

Languages

License

Cocoanetics/SwiftText

Folders and files

Latest commit

History

Repository files navigation

SwiftText

Overview

Modules

SwiftTextHTML

SwiftTextOCR

SwiftTextPDF

SwiftTextDOCX

Installation

Usage

Library Usage

HTML (SwiftTextHTML)

PDF (SwiftTextPDF)

PDF Markdown (SwiftTextOCR + SwiftTextPDF, iOS/macOS 26+)

Images (SwiftTextOCR)

DOCX

Command Line Tool

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages