🔍 SlangHunter

Automated Semantic Risk Detection for Trust & Safety Teams

From manual keyword blocklists to contextual legal-risk scoring — detecting fraud-indicative slang that basic filters miss.

⚡ See It In Action — 30 Seconds

git clone https://github.com/<your-username>/slanghunter.git
cd slanghunter && pip install -r requirements.txt
python demo.py

  📥 INPUT    → Rolex Submariner 1:1 replica, AAA quality... $175
  ⚙️ PROCESS  → 🔑 10 keywords · 🧬 5 patterns · 💲 ⚠ suspicious range
  🔴 VERDICT  → 🚫 BLOCKED · Score: 100% · 16 flags · Law: 18 U.S.C. § 2320

4 mock Mercari listings → 3 crime categories → explainable verdicts with legal citations.

The Problem

Trust & Safety (T&S) teams at online marketplaces face a losing battle:

Current Approach	Why It Fails
Human moderators review listings one by one	Doesn't scale — millions of new listings per day
Basic keyword blocklists (`"cocaine"`, `"gun"`)	Scammers evade them in seconds: `"c0ca!ne"`, `"🔫"`
Regex filters on exact patterns	Every new evasion requires a manual rule update

The result: fraudulent listings for drugs, counterfeit goods, and money laundering schemes hide in plain sight using character substitution (p3rcs), emoji encoding (🍃💨), deliberate misspelling (m0ney fl1p), and contextual misdirection ("flour" listed at $100/gram).

The Solution

SlangHunter is a prototype detection engine that goes beyond keyword matching:

                    ┌─────────────────────────────────────┐
   Plain keyword    │  "xanax" → blocked ✅               │
   filter           │  "x@n@x" → passes through ❌       │
                    │  "x 4 n 4 x" → passes through ❌   │
                    └─────────────────────────────────────┘

                    ┌─────────────────────────────────────┐
   SlangHunter      │  "xanax" → 🔴 CRITICAL ✅          │
   engine           │  "x@n@x" → 🔴 CRITICAL ✅          │
                    │  "x 4 n 4 x" → 🟡 WARNING ✅       │
                    │  + price context amplifies score    │
                    │  + legal statute citation included  │
                    └─────────────────────────────────────┘

Key Differentiators

Semantic Detection — Compiled regex patterns catch character substitution, emoji encoding, and deliberate spacing (p3rcs, m 0 l l y, ca$h app, 🍃💨🔌).
Contextual Price Analysis — Price is an amplifier, not a standalone signal. A bookshelf at $45 is fine; "kush" at $45 is suspicious. The engine requires textual evidence before price context can boost the score.
Explainable Verdicts — Every flag traces back to a specific U.S. federal statute. A compliance auditor can ask "why did you flag this?" and get a legal citation, not just a confidence number.
Data ≠ Logic — The knowledge base (crime categories, keywords, patterns, legal references) is a dictionary. The engine is the loop that reads it. If a law changes tomorrow, you update the dictionary — you never rewrite the motor.

How It Works

Raw Listing ──▶ _normalize_text() ──▶ _scan_keywords() ──▶ _scan_patterns()
                                                                  │
                                                                  ▼
                                                     _check_price_context()
                                                                  │
                                                                  ▼
                                                      _calculate_score()
                                                                  │
                                                                  ▼
                                              ┌───────────────────────────────┐
                                              │  Verdict:                     │
                                              │   • risk_score (0.0 → 1.0)   │
                                              │   • flags[]                   │
                                              │   • reasoning (legal refs)    │
                                              │   • matched_categories[]      │
                                              └───────────────────────────────┘

Quick Start

# 1. Clone & install (30 seconds)
git clone https://github.com/<your-username>/slanghunter.git
cd slanghunter
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 2. Run the demo — this is the star of the show ⭐
python demo.py

# 3. Run 90 tests across 15 test classes
pytest tests/ -v

Requirements: Python 3.10+ · Git

Usage Examples

Basic Analysis

from src.slanghunter import SlangHunter

hunter = SlangHunter()

# Analyze a suspicious listing
verdict = hunter.analyze(
    text="got them p3rcs 💊 real pharma hmu",
    price=30.00
)

print(verdict["risk_score"])          # 0.8
print(verdict["flags"])               # ['drugs:pat:p3rcs', 'drugs:price_context']
print(verdict["matched_categories"])  # ['drugs']
print(verdict["reasoning"])           # [DRUGS] ... Legal basis: 21 U.S.C. § 841

Traffic-Light Report

from src.slanghunter import SlangHunter

hunter = SlangHunter()

# Generate a human-readable report for ops / legal teams
report = hunter.generate_report(
    text="Jordan 1 Retro - 1:1 replica, comes in original box 🔥",
    price=65.00
)
print(report)

Output:

============================================================
  🔴  SLANGHUNTER VERDICT: CRITICAL
============================================================
  Listing : Jordan 1 Retro - 1:1 replica, comes in original box 🔥
  Price   : $65.00

  Risk Score : [██████████████████████████████] 100%
  Risk Level : 🔴  CRITICAL
  Action     : AUTOMATIC BLOCK — Escalate to Legal

  ┌─ FLAGS ────────────────────────────────────────────────
  │  ⚑  surikae:kw:1:1
  │  ⚑  surikae:kw:replica
  │  ⚑  surikae:kw:comes in original box
  │  ⚑  surikae:pat:1:1
  │  ⚑  surikae:pat:🔥
  │  ⚑  surikae:price_context
  └─────────────────────────────────────────────────────────

  ┌─ REASONING (Traceability) ─────────────────────────────
  │  [SURIKAE]
  │    Keywords matched: '1:1', 'replica', 'comes in original box'
  │    Slang patterns matched: '1:1', '🔥'
  │    Price falls within suspicious range.
  │    Legal basis: 18 U.S.C. § 2320 — Trafficking in Counterfeit Goods
  └─────────────────────────────────────────────────────────

  Categories : SURIKAE
============================================================

Clean Listing (No False Positive)

verdict = hunter.analyze(
    text="Vintage wooden bookshelf, great condition",
    price=45.00
)
print(verdict["risk_score"])  # 0.0  — price alone never triggers a flag

Architecture

slanghunter/
│
├── demo.py                    # 🌟 Live Mercari feed simulation (python demo.py)
├── src/
│   ├── __init__.py            # Package metadata & exports
│   ├── __main__.py            # CLI demo entry point (python -m src)
│   └── slanghunter.py         # Core engine + RiskLevel enum
│         │
│         ├── SlangHunter           # Main class
│         │   ├── __init__()        # Builds risk_database
│         │   ├── _normalize_text() # Lowercase + whitespace cleanup
│         │   ├── _scan_keywords()  # Word-boundary keyword matching
│         │   ├── _scan_patterns()  # Compiled regex pattern scanning
│         │   ├── _check_price_context()  # Suspicious price range check
│         │   ├── _calculate_score()      # Cumulative weighted scoring
│         │   ├── _build_reasoning()      # Legal citation builder
│         │   ├── analyze()         # Main API → verdict dict
│         │   ├── classify_risk()   # Score → RiskLevel enum
│         │   ├── generate_report() # Full human-readable report
│         │   └── print_report()    # Print + return verdict
│         │
│         └── RiskLevel(Enum)       # 🔴 CRITICAL / 🟡 WARNING / 🟢 SAFE
│
├── tests/
│   ├── __init__.py
│   └── test_slanghunter.py    # 90 tests across 15 test classes
│
├── .gitignore
├── LICENSE                    # MIT License
├── README.md                  # This file
└── requirements.txt           # flake8 + pytest

The Knowledge Base

Three crime categories, each with four dimensions:

Category	Keywords	Regex Patterns	Price Threshold	Legal Basis
Drugs	35	8 compiled patterns	$0 – $80	21 U.S.C. § 841
Money Laundering	39	7 compiled patterns	$0 – $50	18 U.S.C. § 1956
Surikae (すり替え)	35	7 compiled patterns	$30 – $250	18 U.S.C. § 2320
Total	109	22	—	—

Surikae (すり替え) is the Japanese term for "bait-and-switch" — selling counterfeit or misrepresented goods under the guise of authentic products.

Why a Dictionary, Not Hardcoded Logic?

# ❌ Fragile — logic and data are tangled
if "xanax" in text or "p3rc" in text:
    return "drugs"

# ✅ Maintainable — data drives the engine
self.risk_database = {
    "drugs": {
        "keywords": ["xanax", "percocet", ...],
        "slang_patterns": [re.compile(r"p[3e]rc[s0]?", re.IGNORECASE), ...],
        "risk_threshold": {"min": 0.0, "max": 80.0},
        "legal_reference": {"statute": "21 U.S.C. § 841", ...},
    }
}

If Mexico updates its money-laundering statute tomorrow, you change one string in the dictionary. The engine never knows and never cares.

Scoring System

Signal	Weight	Example
Each keyword match	+0.15	`"lean"` found → +0.15
Each regex pattern match	+0.25	`"p3rcs"` via regex → +0.25
Price in suspicious range	+0.20	$25 + text evidence → +0.20
Combo bonus (text + price)	+0.10	Both present → extra +0.10

Score is clamped to [0.0, 1.0].
Price is an amplifier, not a detector — a $45 bookshelf scores 0.0.
Final score is the max across all categories (a listing is as risky as its most dangerous match).

Risk Levels

Level	Threshold	Emoji	Action
CRITICAL	Score > 80%	🔴	Automatic block → Escalate to Legal
WARNING	Score > 40%	🟡	Manual review → T&S analyst queue
SAFE	Score ≤ 40%	🟢	Approved → No action required

Demo Output

Run python demo.py to launch the full Mercari feed simulation. Each listing flows through three auditor-grade phases:

Phase 1 — 📥 INPUT (what the moderator sees)

  ┌─ 📥 INPUT ──────────────────────────────────────────────┐
  │  Title:      Herbal Supplement Pack 🍃💨                   │
  │  Seller:     rxplug_verified                              │
  │  Category:   Health & Wellness                            │
  │  Price:      $35.00                                       │
  │                                                           │
  │  Description:                                             │
  │    Premium p3rcs and lean combo 💊 real pharma scripts...   │
  └───────────────────────────────────────────────────────────┘

Phase 2 — ⚙️ PROCESSING (step-by-step engine pipeline)

  ├─ 🔤 Normalizing text ······· lowercase + collapse whitespace
  ├─ 🔑 Scanning 109 keywords ··························· 4 hits
  ├─ 🧬 Matching 22 regex patterns ······················ 3 hits
  ├─ 💲 Checking price context ($35.00) ·· ⚠ in suspicious range
  ├─ 📊 Calculating risk score ···························· 100%

Phase 3 — VERDICT (the decision)

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    🔴  VERDICT: 🚫 BLOCKED
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    Risk Score : [██████████████████████████████] 100%
    Flags      : 8 indicators  ·  Categories: DRUGS
    📜 Legal basis: 21 U.S.C. § 841 — Controlled Substances Act
    👉 ACTION: AUTOMATIC BLOCK — Escalate to Legal

Dashboard Summary

#	Case	Listing ID	Verdict	Categories
1	Control — Legitimate product	MER-2026-00417	🟢 SAFE · ✅ APPROVED (0 %)	—
2	Slang — Drug trafficking	MER-2026-01893	🔴 CRITICAL · 🚫 BLOCKED (100 %)	DRUGS
3	Anomaly — Money laundering	MER-2026-03201	🔴 CRITICAL · 🚫 BLOCKED (100 %)	MONEY_LAUNDERING
4	Fraud — Surikae counterfeit	MER-2026-05742	🔴 CRITICAL · 🚫 BLOCKED (100 %)	SURIKAE

Also available: python -m src for a quick 8-case CLI demo without the 3-phase animation.

Roadmap

Phase 1 — Project scaffolding & repository structure
Phase 2 — Knowledge base architecture (risk_database)
Phase 3 — Inference engine (normalize → scan → score → verdict)
Phase 4 — Report interface & traffic-light system
Phase 5 — Documentation, narrative & portfolio polish
Phase 5.5 — Live simulation demo (demo.py) & repo update
Phase 6 — REST API wrapper (FastAPI + Pydantic models)
Phase 7 — Batch processing & CSV/JSON ingestion
Phase 8 — Dashboard & analytics module

Contributing

This project is in prototype phase. Contributions, ideas, and feedback are welcome.

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes using Conventional Commits
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Legal Disclaimer

⚠️ This software is a prototype built for educational and demonstration purposes only.

SlangHunter is designed to showcase programmatic legal-risk analysis techniques and is not intended for production deployment without proper legal review, regulatory approval, and human oversight.

The crime categories, keywords, and legal references included are illustrative examples drawn from publicly available U.S. federal statutes. They do not constitute legal advice. The author assumes no liability for decisions made based on this tool's output.

If you're building something like this for real: hire a lawyer, not just an engineer. Better yet — hire a Legal Engineer who can do both. 😉

License

This project is licensed under the MIT License — see the LICENSE file for details.

SlangHunter — Built with 🧠 by a Legal Engineer who believes compliance can be automated.
109 keywords · 22 regex patterns · 3 crime categories · 90 tests · 0 linter warnings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔍 SlangHunter

⚡ See It In Action — 30 Seconds

📋 Table of Contents

The Problem

The Solution

Key Differentiators

How It Works

Quick Start

Usage Examples

Basic Analysis

Traffic-Light Report

Clean Listing (No False Positive)

Architecture

The Knowledge Base

Why a Dictionary, Not Hardcoded Logic?

Scoring System

Risk Levels

Demo Output

Phase 1 — 📥 INPUT (what the moderator sees)

Phase 2 — ⚙️ PROCESSING (step-by-step engine pipeline)

Phase 3 — VERDICT (the decision)

Dashboard Summary

Roadmap

Contributing

Legal Disclaimer

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🔍 SlangHunter

⚡ See It In Action — 30 Seconds

📋 Table of Contents

The Problem

The Solution

Key Differentiators

How It Works

Quick Start

Usage Examples

Basic Analysis

Traffic-Light Report

Clean Listing (No False Positive)

Architecture

The Knowledge Base

Why a Dictionary, Not Hardcoded Logic?

Scoring System

Risk Levels

Demo Output

Phase 1 — 📥 INPUT (what the moderator sees)

Phase 2 — ⚙️ PROCESSING (step-by-step engine pipeline)

Phase 3 — VERDICT (the decision)

Dashboard Summary

Roadmap

Contributing

Legal Disclaimer

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages