Swift port of miso-belica/jusText
A Swift port of the jusText boilerplate removal library. Extracts the main article content from HTML pages by classifying text blocks as good (content) or bad (boilerplate) using a combination of heuristics: link density, stopword density, block length, and context-sensitive neighbour analysis.
Bundled with stopword lists for 100 languages.
Add the package to your Package.swift:
dependencies: [
.package(url: "https://github.com/mrowlinson/jusText-swift.git", from: "1.0.0"),
],
targets: [
.target(name: "MyTarget", dependencies: ["jusText"]),
]import jusText
let html = // ... your HTML string ...
let paragraphs = try justext(htmlText: html, language: "English")
for p in paragraphs where !p.isBoilerplate {
print(p.text)
}let stoplist: Set<String> = ["the", "a", "and", "of", "in"]
let paragraphs = try justext(htmlText: html, stoplist: stoplist)var options = ClassifierOptions()
options.maxLinkDensity = 0.2 // blocks with more links than this → bad
options.lengthLow = 70 // chars; below this → short
options.lengthHigh = 200 // chars; above this + high stopwords → good
options.stopwordsLow = 0.30 // stopword density threshold (low)
options.stopwordsHigh = 0.32 // stopword density threshold (high)
options.noHeadings = false // set true to ignore heading context
let paragraphs = try justext(htmlText: html, language: "English", options: options)let languages = getStoplists()
// Set of 100 language names, e.g. "English", "German", "French", "Spanish", …jusText classifies each block of text extracted from the HTML DOM using a two-pass algorithm.
Pass 1 — context-free classification
Each paragraph is classified independently based on:
| Condition | Class |
|---|---|
| Link density > threshold | bad |
| Contains © symbol | bad |
Inside a <select> |
bad |
Length < lengthLow and has link chars |
bad |
Length < lengthLow, no links |
short |
Stopword density ≥ stopwordsHigh and long |
good |
Stopword density ≥ stopwordsHigh, short |
neargood |
Stopword density ≥ stopwordsLow |
neargood |
| Otherwise | bad |
Pass 2 — context-sensitive revision
shortblocks surrounded bygoodneighbours → promoted togoodshortblocks surrounded bybadneighbours → staybadneargoodblocks with at least onegoodneighbour → promoted togood- Headings near
goodcontent → promoted togood
Each Paragraph in the returned array has:
| Property | Type | Description |
|---|---|---|
text |
String |
Normalised text content |
classType |
ParagraphClass |
.good, .bad, .short, .nearGood |
isBoilerplate |
Bool |
true if not .good |
heading |
Bool |
true if the block is a heading element |
linksDensity() |
Double |
Fraction of chars inside <a> tags |
stopwordsDensity(_:) |
Double |
Fraction of words that are stopwords |
domPath |
String |
Dot-separated DOM path, e.g. body.article.p |
- Swift 6.2+
- macOS 13+ / iOS 16+
- Depends on SwiftSoup for HTML parsing
Algorithm and stopword lists by Jan Pomikálek. This is an independent Swift port.