A "soft" 404 page is a page that is served with 200 status, but is really a page that says that content is not available.
Note: you may want to check out https://github.com/dogancanbakir/soft-404 which has support for newer library and python versions, and has a model which was independently re-trained.
Contents
pip install soft404
The easiest way is to use the soft404.probability function:
>>> import soft404
>>> soft404.probability('<h1>Page not found</h1>')
0.9736860086882132
You can also create a classifier explicitly:
>>> from soft404 import Soft404Classifier
>>> clf = Soft404Classifier()
>>> clf.predict('<h1>Page not found</h1>')
0.9736860086882132
You can use the package from the command line:
# Predict from an HTML file python -m soft404 page.html # Predict from inline HTML python -m soft404 --html '<h1>Page not found</h1>' # Show help python -m soft404 --help
To install with development dependencies:
pip install -e ".[dev]"
This installs additional tools needed for training and data processing, including console scripts:
soft404-train- Train a new classifiersoft404-convert- Convert HTML pages to text format for training
Classifier is trained on 120k pages from 25k domains, with 404 page ratio of about 1/3. With 10-fold cross-validation, PR AUC (average precision) is 0.990 ± 0.003, and ROC AUC is 0.995 ± 0.002.
Install dev requirements:
pip install -e ".[dev]"
Run the crawler for a while (results will appear in pages.jl.gz file):
cd crawler scrapy crawl spider -o gzip:pages.jl -s JOBDIR=job
Note: Training requires development dependencies. Install with pip install -e ".[dev]"
First, extract text and structure from html:
soft404-convert pages.jl.gz items
This will produce two files, items.meta.jl.gz and items.items.jl.gz.
Next, train the classifier:
soft404-train items
Vectorizer takes a while to run, but it's result is cached (the filename where it is cached will be printed on the next run). If you are happy with results, save the classifier:
soft404-train items --save soft404/clf.joblib
License is MIT.