Automated Semantic Risk Detection for Trust & Safety Teams
From manual keyword blocklists to contextual legal-risk scoring β detecting fraud-indicative slang that basic filters miss.
git clone https://github.com/<your-username>/slanghunter.git
cd slanghunter && pip install -r requirements.txt
python demo.py π₯ INPUT β Rolex Submariner 1:1 replica, AAA quality... $175
βοΈ PROCESS β π 10 keywords Β· 𧬠5 patterns Β· π² β suspicious range
π΄ VERDICT β π« BLOCKED Β· Score: 100% Β· 16 flags Β· Law: 18 U.S.C. Β§ 2320
4 mock Mercari listings β 3 crime categories β explainable verdicts with legal citations.
- See It In Action
- The Problem
- The Solution
- How It Works
- Quick Start
- Usage Examples
- Architecture
- Project Structure
- The Knowledge Base
- Scoring System
- Demo Output
- Roadmap
- Contributing
- Legal Disclaimer
- License
Trust & Safety (T&S) teams at online marketplaces face a losing battle:
| Current Approach | Why It Fails |
|---|---|
| Human moderators review listings one by one | Doesn't scale β millions of new listings per day |
Basic keyword blocklists ("cocaine", "gun") |
Scammers evade them in seconds: "c0ca!ne", "π«" |
| Regex filters on exact patterns | Every new evasion requires a manual rule update |
The result: fraudulent listings for drugs, counterfeit goods, and money laundering schemes hide in plain sight using character substitution (p3rcs), emoji encoding (ππ¨), deliberate misspelling (m0ney fl1p), and contextual misdirection ("flour" listed at $100/gram).
SlangHunter is a prototype detection engine that goes beyond keyword matching:
βββββββββββββββββββββββββββββββββββββββ
Plain keyword β "xanax" β blocked β
β
filter β "x@n@x" β passes through β β
β "x 4 n 4 x" β passes through β β
βββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββ
SlangHunter β "xanax" β π΄ CRITICAL β
β
engine β "x@n@x" β π΄ CRITICAL β
β
β "x 4 n 4 x" β π‘ WARNING β
β
β + price context amplifies score β
β + legal statute citation included β
βββββββββββββββββββββββββββββββββββββββ
-
Semantic Detection β Compiled regex patterns catch character substitution, emoji encoding, and deliberate spacing (
p3rcs,m 0 l l y,ca$h app,ππ¨π). -
Contextual Price Analysis β Price is an amplifier, not a standalone signal. A bookshelf at $45 is fine; "kush" at $45 is suspicious. The engine requires textual evidence before price context can boost the score.
-
Explainable Verdicts β Every flag traces back to a specific U.S. federal statute. A compliance auditor can ask "why did you flag this?" and get a legal citation, not just a confidence number.
-
Data β Logic β The knowledge base (crime categories, keywords, patterns, legal references) is a dictionary. The engine is the loop that reads it. If a law changes tomorrow, you update the dictionary β you never rewrite the motor.
Raw Listing βββΆ _normalize_text() βββΆ _scan_keywords() βββΆ _scan_patterns()
β
βΌ
_check_price_context()
β
βΌ
_calculate_score()
β
βΌ
βββββββββββββββββββββββββββββββββ
β Verdict: β
β β’ risk_score (0.0 β 1.0) β
β β’ flags[] β
β β’ reasoning (legal refs) β
β β’ matched_categories[] β
βββββββββββββββββββββββββββββββββ
# 1. Clone & install (30 seconds)
git clone https://github.com/<your-username>/slanghunter.git
cd slanghunter
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt
# 2. Run the demo β this is the star of the show β
python demo.py
# 3. Run 90 tests across 15 test classes
pytest tests/ -vRequirements: Python 3.10+ Β· Git
from src.slanghunter import SlangHunter
hunter = SlangHunter()
# Analyze a suspicious listing
verdict = hunter.analyze(
text="got them p3rcs π real pharma hmu",
price=30.00
)
print(verdict["risk_score"]) # 0.8
print(verdict["flags"]) # ['drugs:pat:p3rcs', 'drugs:price_context']
print(verdict["matched_categories"]) # ['drugs']
print(verdict["reasoning"]) # [DRUGS] ... Legal basis: 21 U.S.C. Β§ 841from src.slanghunter import SlangHunter
hunter = SlangHunter()
# Generate a human-readable report for ops / legal teams
report = hunter.generate_report(
text="Jordan 1 Retro - 1:1 replica, comes in original box π₯",
price=65.00
)
print(report)Output:
============================================================
π΄ SLANGHUNTER VERDICT: CRITICAL
============================================================
Listing : Jordan 1 Retro - 1:1 replica, comes in original box π₯
Price : $65.00
Risk Score : [ββββββββββββββββββββββββββββββ] 100%
Risk Level : π΄ CRITICAL
Action : AUTOMATIC BLOCK β Escalate to Legal
ββ FLAGS ββββββββββββββββββββββββββββββββββββββββββββββββ
β β surikae:kw:1:1
β β surikae:kw:replica
β β surikae:kw:comes in original box
β β surikae:pat:1:1
β β surikae:pat:π₯
β β surikae:price_context
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ REASONING (Traceability) βββββββββββββββββββββββββββββ
β [SURIKAE]
β Keywords matched: '1:1', 'replica', 'comes in original box'
β Slang patterns matched: '1:1', 'π₯'
β Price falls within suspicious range.
β Legal basis: 18 U.S.C. Β§ 2320 β Trafficking in Counterfeit Goods
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Categories : SURIKAE
============================================================
verdict = hunter.analyze(
text="Vintage wooden bookshelf, great condition",
price=45.00
)
print(verdict["risk_score"]) # 0.0 β price alone never triggers a flagslanghunter/
β
βββ demo.py # π Live Mercari feed simulation (python demo.py)
βββ src/
β βββ __init__.py # Package metadata & exports
β βββ __main__.py # CLI demo entry point (python -m src)
β βββ slanghunter.py # Core engine + RiskLevel enum
β β
β βββ SlangHunter # Main class
β β βββ __init__() # Builds risk_database
β β βββ _normalize_text() # Lowercase + whitespace cleanup
β β βββ _scan_keywords() # Word-boundary keyword matching
β β βββ _scan_patterns() # Compiled regex pattern scanning
β β βββ _check_price_context() # Suspicious price range check
β β βββ _calculate_score() # Cumulative weighted scoring
β β βββ _build_reasoning() # Legal citation builder
β β βββ analyze() # Main API β verdict dict
β β βββ classify_risk() # Score β RiskLevel enum
β β βββ generate_report() # Full human-readable report
β β βββ print_report() # Print + return verdict
β β
β βββ RiskLevel(Enum) # π΄ CRITICAL / π‘ WARNING / π’ SAFE
β
βββ tests/
β βββ __init__.py
β βββ test_slanghunter.py # 90 tests across 15 test classes
β
βββ .gitignore
βββ LICENSE # MIT License
βββ README.md # This file
βββ requirements.txt # flake8 + pytest
Three crime categories, each with four dimensions:
| Category | Keywords | Regex Patterns | Price Threshold | Legal Basis |
|---|---|---|---|---|
| Drugs | 35 | 8 compiled patterns | $0 β $80 | 21 U.S.C. Β§ 841 |
| Money Laundering | 39 | 7 compiled patterns | $0 β $50 | 18 U.S.C. Β§ 1956 |
| Surikae (γγζΏγ) | 35 | 7 compiled patterns | $30 β $250 | 18 U.S.C. Β§ 2320 |
| Total | 109 | 22 | β | β |
Surikae (γγζΏγ) is the Japanese term for "bait-and-switch" β selling counterfeit or misrepresented goods under the guise of authentic products.
# β Fragile β logic and data are tangled
if "xanax" in text or "p3rc" in text:
return "drugs"
# β
Maintainable β data drives the engine
self.risk_database = {
"drugs": {
"keywords": ["xanax", "percocet", ...],
"slang_patterns": [re.compile(r"p[3e]rc[s0]?", re.IGNORECASE), ...],
"risk_threshold": {"min": 0.0, "max": 80.0},
"legal_reference": {"statute": "21 U.S.C. Β§ 841", ...},
}
}If Mexico updates its money-laundering statute tomorrow, you change one string in the dictionary. The engine never knows and never cares.
| Signal | Weight | Example |
|---|---|---|
| Each keyword match | +0.15 | "lean" found β +0.15 |
| Each regex pattern match | +0.25 | "p3rcs" via regex β +0.25 |
| Price in suspicious range | +0.20 | $25 + text evidence β +0.20 |
| Combo bonus (text + price) | +0.10 | Both present β extra +0.10 |
- Score is clamped to [0.0, 1.0].
- Price is an amplifier, not a detector β a $45 bookshelf scores 0.0.
- Final score is the max across all categories (a listing is as risky as its most dangerous match).
| Level | Threshold | Emoji | Action |
|---|---|---|---|
| CRITICAL | Score > 80% | π΄ | Automatic block β Escalate to Legal |
| WARNING | Score > 40% | π‘ | Manual review β T&S analyst queue |
| SAFE | Score β€ 40% | π’ | Approved β No action required |
Run python demo.py to launch the full Mercari feed simulation. Each listing flows through three auditor-grade phases:
ββ π₯ INPUT βββββββββββββββββββββββββββββββββββββββββββββββ
β Title: Herbal Supplement Pack ππ¨ β
β Seller: rxplug_verified β
β Category: Health & Wellness β
β Price: $35.00 β
β β
β Description: β
β Premium p3rcs and lean combo π real pharma scripts... β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββ π€ Normalizing text Β·Β·Β·Β·Β·Β·Β· lowercase + collapse whitespace
ββ π Scanning 109 keywords Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β· 4 hits
ββ 𧬠Matching 22 regex patterns Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β· 3 hits
ββ π² Checking price context ($35.00) Β·Β· β in suspicious range
ββ π Calculating risk score Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β· 100%
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π΄ VERDICT: π« BLOCKED
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Risk Score : [ββββββββββββββββββββββββββββββ] 100%
Flags : 8 indicators Β· Categories: DRUGS
π Legal basis: 21 U.S.C. Β§ 841 β Controlled Substances Act
π ACTION: AUTOMATIC BLOCK β Escalate to Legal
| # | Case | Listing ID | Verdict | Categories |
|---|---|---|---|---|
| 1 | Control β Legitimate product | MER-2026-00417 | π’ SAFE Β· β APPROVED (0 %) | β |
| 2 | Slang β Drug trafficking | MER-2026-01893 | π΄ CRITICAL Β· π« BLOCKED (100 %) | DRUGS |
| 3 | Anomaly β Money laundering | MER-2026-03201 | π΄ CRITICAL Β· π« BLOCKED (100 %) | MONEY_LAUNDERING |
| 4 | Fraud β Surikae counterfeit | MER-2026-05742 | π΄ CRITICAL Β· π« BLOCKED (100 %) | SURIKAE |
Also available:
python -m srcfor a quick 8-case CLI demo without the 3-phase animation.
- Phase 1 β Project scaffolding & repository structure
- Phase 2 β Knowledge base architecture (
risk_database) - Phase 3 β Inference engine (normalize β scan β score β verdict)
- Phase 4 β Report interface & traffic-light system
- Phase 5 β Documentation, narrative & portfolio polish
- Phase 5.5 β Live simulation demo (
demo.py) & repo update - Phase 6 β REST API wrapper (FastAPI + Pydantic models)
- Phase 7 β Batch processing & CSV/JSON ingestion
- Phase 8 β Dashboard & analytics module
This project is in prototype phase. Contributions, ideas, and feedback are welcome.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes using Conventional Commits
- Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
β οΈ This software is a prototype built for educational and demonstration purposes only.SlangHunter is designed to showcase programmatic legal-risk analysis techniques and is not intended for production deployment without proper legal review, regulatory approval, and human oversight.
The crime categories, keywords, and legal references included are illustrative examples drawn from publicly available U.S. federal statutes. They do not constitute legal advice. The author assumes no liability for decisions made based on this tool's output.
If you're building something like this for real: hire a lawyer, not just an engineer. Better yet β hire a Legal Engineer who can do both. π
This project is licensed under the MIT License β see the LICENSE file for details.
SlangHunter β Built with π§ by a Legal Engineer who believes compliance can be automated.
109 keywords Β· 22 regex patterns Β· 3 crime categories Β· 90 tests Β· 0 linter warnings