soft404: a classifier for detecting soft 404 pages

A "soft" 404 page is a page that is served with 200 status, but is really a page that says that content is not available.

Note: you may want to check out https://github.com/dogancanbakir/soft-404 which has support for newer library and python versions, and has a model which was independently re-trained.

Contents

Installation
Usage
- Command-line usage
Development
License

Installation

pip install soft404

Usage

The easiest way is to use the soft404.probability function:

>>> import soft404
>>> soft404.probability('<h1>Page not found</h1>')
0.9736860086882132

You can also create a classifier explicitly:

>>> from soft404 import Soft404Classifier
>>> clf = Soft404Classifier()
>>> clf.predict('<h1>Page not found</h1>')
0.9736860086882132

Command-line usage

You can use the package from the command line:

# Predict from an HTML file
python -m soft404 page.html

# Predict from inline HTML
python -m soft404 --html '<h1>Page not found</h1>'

# Show help
python -m soft404 --help

Development

Installing for development

To install with development dependencies:

pip install -e ".[dev]"

This installs additional tools needed for training and data processing, including console scripts:

soft404-train - Train a new classifier
soft404-convert - Convert HTML pages to text format for training

Model Training

Classifier is trained on 120k pages from 25k domains, with 404 page ratio of about 1/3. With 10-fold cross-validation, PR AUC (average precision) is 0.990 ± 0.003, and ROC AUC is 0.995 ± 0.002.

Getting data for training

Install dev requirements:

pip install -e ".[dev]"

Run the crawler for a while (results will appear in pages.jl.gz file):

cd crawler
scrapy crawl spider -o gzip:pages.jl -s JOBDIR=job

Training

Note: Training requires development dependencies. Install with pip install -e ".[dev]"

First, extract text and structure from html:

soft404-convert pages.jl.gz items

This will produce two files, items.meta.jl.gz and items.items.jl.gz. Next, train the classifier:

soft404-train items

Vectorizer takes a while to run, but it's result is cached (the filename where it is cached will be printed on the next run). If you are happy with results, save the classifier:

soft404-train items --save soft404/clf.joblib

License

License is MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 127 Commits
.github		.github
crawler		crawler
notebooks		notebooks
soft404		soft404
tests		tests
.bish-index		.bish-index
.bish.sqlite		.bish.sqlite
.gitignore		.gitignore
.travis.yml		.travis.yml
AGENTS.md		AGENTS.md
CHANGES.rst		CHANGES.rst
IMPROVEMENTS.md		IMPROVEMENTS.md
MANIFEST.in		MANIFEST.in
README.rst		README.rst
bfg-1.15.0.jar		bfg-1.15.0.jar
requirements.txt		requirements.txt
requirements_dev.txt		requirements_dev.txt
rules.json		rules.json
setup.cfg		setup.cfg
setup.py		setup.py
tox.ini		tox.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

soft404: a classifier for detecting soft 404 pages

Installation

Usage

Command-line usage

Development

Installing for development

Model Training

Getting data for training

Training

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Uh oh!

Languages

TeamHG-Memex/soft404

Folders and files

Latest commit

History

Repository files navigation

soft404: a classifier for detecting soft 404 pages

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Uh oh!

Languages