PubGuard is a lightweight, CPU-optimized document classifier that screens PDF text to determine whether it represents a genuine scientific publication. It rejects junk (flyers, invoices, non-scholarly PDFs), review articles, posters, and standalone abstracts before expensive downstream processing.
Runs in 3.3ms per document, no GPU needed.
| Head | Classes | Accuracy | What it detects |
|---|---|---|---|
| doc_type | 5 | 96.1% | scientific_paper · literature_review · poster · abstract_only · junk |
| ai_detect | 2 | 84.5% | human · ai_generated |
| toxicity | 2 | 84.2% | clean · toxic |
Each head is a single linear layer stored as a .npz file (5–12 KB). Inference is pure numpy — no torch needed.
pip install git+https://github.com/jimnoneill/pubguard.gitWith training dependencies:
pip install "pubguard[train] @ git+https://github.com/jimnoneill/pubguard.git"Or install locally for development:
git clone https://github.com/jimnoneill/pubguard.git
cd pubguard
pip install -e ".[train]"from pubguard import PubGuard
guard = PubGuard()
guard.initialize()
verdict = guard.screen("Introduction: We present a novel deep learning approach...")
print(verdict)
# {
# 'doc_type': {'label': 'scientific_paper', 'score': 0.994},
# 'ai_generated': {'label': 'human', 'score': 0.875},
# 'toxicity': {'label': 'clean', 'score': 0.999},
# 'pass': True
# }import fitz # PyMuPDF
doc = fitz.open("paper.pdf")
text = " ".join(page.get_text() for page in doc)
doc.close()
verdict = guard.screen(text[:8000])
if verdict["pass"]:
print("Valid scientific publication — proceed with analysis")
else:
print(f"Rejected: {verdict['doc_type']['label']}")verdicts = guard.screen_batch(["text1", "text2", "text3"])Only scientific_paper passes the gate. Everything else — review articles, posters, standalone abstracts, junk — is blocked. The PubVerse pipeline processes original research publications only.
Note: meta-analyses and systematic reviews are classified as scientific_paper (they contain original analysis). Only narrative/scoping reviews are classified as literature_review.
scientific_paper → ✅ PASS
literature_review → ❌ BLOCKED (review article, not original research)
poster → ❌ BLOCKED (classified, but not a publication)
abstract_only → ❌ BLOCKED
junk → ❌ BLOCKED
AI detection and toxicity are informational by default — reported but not blocking.
Drop into any bash pipeline:
# Extract text from PDF
PDF_TEXT=$(python3 -c "import fitz; d=fitz.open('$PDF'); print(' '.join(p.get_text() for p in d)[:8000])")
# Screen it
echo "$PDF_TEXT" | python3 scripts/pubguard_gate.py 2>/dev/null
if [ $? -ne 0 ]; then
echo "REJECTED — not a valid scientific publication"
exit 1
fipip install -e ".[train]"HuggingFace-only mode (no local PDFs needed):
python scripts/train_pubguard.py --data-dir ./pubguard_data --n-per-class 15000Train on a real PDF corpus (adds literature_review class from PubMed labels + OpenAlex):
python scripts/train_pubguard.py --pdf-corpus /path/to/pdf/corpus --data-dir ./pubguard_dataReuse cached PubMed labels on subsequent runs:
python scripts/train_pubguard.py --pdf-corpus /path/to/pdf/corpus --skip-pubmedEmbeds with model2vec, trains sklearn LogisticRegression heads. Completes in ~1 minute on CPU.
| Head | Sources |
|---|---|
| doc_type | Real PDF corpus (PubMed-labeled via NCBI E-utilities), OpenAlex reviews (type:review), armanc/scientific_papers, gfissore/arxiv-abstracts-2021, ag_news, real poster PDFs from posters.science corpus |
| ai_detect | liamdugan/raid, NicolaiSivesind/ChatGPT-Research-Abstracts |
| toxicity | google/civil_comments, skg/toxigen-data |
The scientific_paper and literature_review classes are trained on real PDF-extracted text with PubMed publication-type labels. The literature_review class is supplemented with review abstracts from OpenAlex. The poster class uses real scientific poster PDFs from the posters.science corpus via PosterSentry.
┌─────────────┐
│ PDF text │
└──────┬──────┘
│
model2vec encode ──► emb ∈ R^512
│
├─────────────────┬─────────────────┐
▼ ▼ ▼
┌───────────┐ ┌───────────┐ ┌───────────┐
│ doc_type │ │ ai_detect │ │ toxicity │
│ [emb+feat]│ │ [emb] │ │ [emb] │
│ →softmax5 │ │ →softmax2 │ │ →softmax2 │
└───────────┘ └───────────┘ └───────────┘
Same embedding backbone as the OpenAlex Topic Classifier — shares the cached model2vec weights.
pubguard/
├── src/pubguard/
│ ├── __init__.py # PubGuard, PubGuardConfig exports
│ ├── classifier.py # PubGuard class — screen(), screen_batch()
│ ├── config.py # Configuration + model path resolution
│ ├── text.py # Text cleaning + structural feature extraction
│ ├── train.py # Training pipeline (sklearn LogisticRegression)
│ ├── data.py # Dataset download + preparation
│ ├── errors.py # PV-XXXX error code system
│ └── cli.py # CLI interface
├── scripts/
│ ├── pubguard_gate.py # Bash pipeline integration (exit 0/1)
│ └── train_pubguard.py # Training entry point
├── ERRORS.md # Error code reference guide
├── PubGuard.png # Logo
└── pyproject.toml # pip-installable package
| Resource | Link |
|---|---|
| Trained model | jimnoneill/pubguard-classifier |
| Training data | jimnoneill/pubguard-training-data |
MIT License — See LICENSE for details.
@software{pubguard_2026,
title = {PubGuard: Multi-Head Scientific Publication Gatekeeper},
author = {O'Neill, James},
year = {2026},
url = {https://github.com/jimnoneill/pubguard}
}