Skip to content

L2santos29/slanghunter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

8 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ” SlangHunter

Status Python Tests PEP 8 License

Automated Semantic Risk Detection for Trust & Safety Teams

From manual keyword blocklists to contextual legal-risk scoring β€” detecting fraud-indicative slang that basic filters miss.


⚑ See It In Action β€” 30 Seconds

git clone https://github.com/<your-username>/slanghunter.git
cd slanghunter && pip install -r requirements.txt
python demo.py
  πŸ“₯ INPUT    β†’ Rolex Submariner 1:1 replica, AAA quality... $175
  βš™οΈ PROCESS  β†’ πŸ”‘ 10 keywords Β· 🧬 5 patterns Β· πŸ’² ⚠ suspicious range
  πŸ”΄ VERDICT  β†’ 🚫 BLOCKED Β· Score: 100% Β· 16 flags Β· Law: 18 U.S.C. Β§ 2320

4 mock Mercari listings β†’ 3 crime categories β†’ explainable verdicts with legal citations.


πŸ“‹ Table of Contents


The Problem

Trust & Safety (T&S) teams at online marketplaces face a losing battle:

Current Approach Why It Fails
Human moderators review listings one by one Doesn't scale β€” millions of new listings per day
Basic keyword blocklists ("cocaine", "gun") Scammers evade them in seconds: "c0ca!ne", "πŸ”«"
Regex filters on exact patterns Every new evasion requires a manual rule update

The result: fraudulent listings for drugs, counterfeit goods, and money laundering schemes hide in plain sight using character substitution (p3rcs), emoji encoding (πŸƒπŸ’¨), deliberate misspelling (m0ney fl1p), and contextual misdirection ("flour" listed at $100/gram).

The Solution

SlangHunter is a prototype detection engine that goes beyond keyword matching:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   Plain keyword    β”‚  "xanax" β†’ blocked βœ…               β”‚
   filter           β”‚  "x@n@x" β†’ passes through ❌       β”‚
                    β”‚  "x 4 n 4 x" β†’ passes through ❌   β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   SlangHunter      β”‚  "xanax" β†’ πŸ”΄ CRITICAL βœ…          β”‚
   engine           β”‚  "x@n@x" β†’ πŸ”΄ CRITICAL βœ…          β”‚
                    β”‚  "x 4 n 4 x" β†’ 🟑 WARNING βœ…       β”‚
                    β”‚  + price context amplifies score    β”‚
                    β”‚  + legal statute citation included  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Key Differentiators

  1. Semantic Detection β€” Compiled regex patterns catch character substitution, emoji encoding, and deliberate spacing (p3rcs, m 0 l l y, ca$h app, πŸƒπŸ’¨πŸ”Œ).

  2. Contextual Price Analysis β€” Price is an amplifier, not a standalone signal. A bookshelf at $45 is fine; "kush" at $45 is suspicious. The engine requires textual evidence before price context can boost the score.

  3. Explainable Verdicts β€” Every flag traces back to a specific U.S. federal statute. A compliance auditor can ask "why did you flag this?" and get a legal citation, not just a confidence number.

  4. Data β‰  Logic β€” The knowledge base (crime categories, keywords, patterns, legal references) is a dictionary. The engine is the loop that reads it. If a law changes tomorrow, you update the dictionary β€” you never rewrite the motor.

How It Works

Raw Listing ──▢ _normalize_text() ──▢ _scan_keywords() ──▢ _scan_patterns()
                                                                  β”‚
                                                                  β–Ό
                                                     _check_price_context()
                                                                  β”‚
                                                                  β–Ό
                                                      _calculate_score()
                                                                  β”‚
                                                                  β–Ό
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚  Verdict:                     β”‚
                                              β”‚   β€’ risk_score (0.0 β†’ 1.0)   β”‚
                                              β”‚   β€’ flags[]                   β”‚
                                              β”‚   β€’ reasoning (legal refs)    β”‚
                                              β”‚   β€’ matched_categories[]      β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Quick Start

# 1. Clone & install (30 seconds)
git clone https://github.com/<your-username>/slanghunter.git
cd slanghunter
python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

# 2. Run the demo β€” this is the star of the show ⭐
python demo.py

# 3. Run 90 tests across 15 test classes
pytest tests/ -v

Requirements: Python 3.10+ Β· Git

Usage Examples

Basic Analysis

from src.slanghunter import SlangHunter

hunter = SlangHunter()

# Analyze a suspicious listing
verdict = hunter.analyze(
    text="got them p3rcs πŸ’Š real pharma hmu",
    price=30.00
)

print(verdict["risk_score"])          # 0.8
print(verdict["flags"])               # ['drugs:pat:p3rcs', 'drugs:price_context']
print(verdict["matched_categories"])  # ['drugs']
print(verdict["reasoning"])           # [DRUGS] ... Legal basis: 21 U.S.C. Β§ 841

Traffic-Light Report

from src.slanghunter import SlangHunter

hunter = SlangHunter()

# Generate a human-readable report for ops / legal teams
report = hunter.generate_report(
    text="Jordan 1 Retro - 1:1 replica, comes in original box πŸ”₯",
    price=65.00
)
print(report)

Output:

============================================================
  πŸ”΄  SLANGHUNTER VERDICT: CRITICAL
============================================================
  Listing : Jordan 1 Retro - 1:1 replica, comes in original box πŸ”₯
  Price   : $65.00

  Risk Score : [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 100%
  Risk Level : πŸ”΄  CRITICAL
  Action     : AUTOMATIC BLOCK β€” Escalate to Legal

  β”Œβ”€ FLAGS ────────────────────────────────────────────────
  β”‚  βš‘  surikae:kw:1:1
  β”‚  βš‘  surikae:kw:replica
  β”‚  βš‘  surikae:kw:comes in original box
  β”‚  βš‘  surikae:pat:1:1
  β”‚  βš‘  surikae:pat:πŸ”₯
  β”‚  βš‘  surikae:price_context
  └─────────────────────────────────────────────────────────

  β”Œβ”€ REASONING (Traceability) ─────────────────────────────
  β”‚  [SURIKAE]
  β”‚    Keywords matched: '1:1', 'replica', 'comes in original box'
  β”‚    Slang patterns matched: '1:1', 'πŸ”₯'
  β”‚    Price falls within suspicious range.
  β”‚    Legal basis: 18 U.S.C. Β§ 2320 β€” Trafficking in Counterfeit Goods
  └─────────────────────────────────────────────────────────

  Categories : SURIKAE
============================================================

Clean Listing (No False Positive)

verdict = hunter.analyze(
    text="Vintage wooden bookshelf, great condition",
    price=45.00
)
print(verdict["risk_score"])  # 0.0  β€” price alone never triggers a flag

Architecture

slanghunter/
β”‚
β”œβ”€β”€ demo.py                    # 🌟 Live Mercari feed simulation (python demo.py)
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py            # Package metadata & exports
β”‚   β”œβ”€β”€ __main__.py            # CLI demo entry point (python -m src)
β”‚   └── slanghunter.py         # Core engine + RiskLevel enum
β”‚         β”‚
β”‚         β”œβ”€β”€ SlangHunter           # Main class
β”‚         β”‚   β”œβ”€β”€ __init__()        # Builds risk_database
β”‚         β”‚   β”œβ”€β”€ _normalize_text() # Lowercase + whitespace cleanup
β”‚         β”‚   β”œβ”€β”€ _scan_keywords()  # Word-boundary keyword matching
β”‚         β”‚   β”œβ”€β”€ _scan_patterns()  # Compiled regex pattern scanning
β”‚         β”‚   β”œβ”€β”€ _check_price_context()  # Suspicious price range check
β”‚         β”‚   β”œβ”€β”€ _calculate_score()      # Cumulative weighted scoring
β”‚         β”‚   β”œβ”€β”€ _build_reasoning()      # Legal citation builder
β”‚         β”‚   β”œβ”€β”€ analyze()         # Main API β†’ verdict dict
β”‚         β”‚   β”œβ”€β”€ classify_risk()   # Score β†’ RiskLevel enum
β”‚         β”‚   β”œβ”€β”€ generate_report() # Full human-readable report
β”‚         β”‚   └── print_report()    # Print + return verdict
β”‚         β”‚
β”‚         └── RiskLevel(Enum)       # πŸ”΄ CRITICAL / 🟑 WARNING / 🟒 SAFE
β”‚
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── test_slanghunter.py    # 90 tests across 15 test classes
β”‚
β”œβ”€β”€ .gitignore
β”œβ”€β”€ LICENSE                    # MIT License
β”œβ”€β”€ README.md                  # This file
└── requirements.txt           # flake8 + pytest

The Knowledge Base

Three crime categories, each with four dimensions:

Category Keywords Regex Patterns Price Threshold Legal Basis
Drugs 35 8 compiled patterns $0 – $80 21 U.S.C. Β§ 841
Money Laundering 39 7 compiled patterns $0 – $50 18 U.S.C. Β§ 1956
Surikae (γ™γ‚Šζ›Ώγˆ) 35 7 compiled patterns $30 – $250 18 U.S.C. Β§ 2320
Total 109 22 β€” β€”

Surikae (γ™γ‚Šζ›Ώγˆ) is the Japanese term for "bait-and-switch" β€” selling counterfeit or misrepresented goods under the guise of authentic products.

Why a Dictionary, Not Hardcoded Logic?

# ❌ Fragile β€” logic and data are tangled
if "xanax" in text or "p3rc" in text:
    return "drugs"

# βœ… Maintainable β€” data drives the engine
self.risk_database = {
    "drugs": {
        "keywords": ["xanax", "percocet", ...],
        "slang_patterns": [re.compile(r"p[3e]rc[s0]?", re.IGNORECASE), ...],
        "risk_threshold": {"min": 0.0, "max": 80.0},
        "legal_reference": {"statute": "21 U.S.C. Β§ 841", ...},
    }
}

If Mexico updates its money-laundering statute tomorrow, you change one string in the dictionary. The engine never knows and never cares.

Scoring System

Signal Weight Example
Each keyword match +0.15 "lean" found β†’ +0.15
Each regex pattern match +0.25 "p3rcs" via regex β†’ +0.25
Price in suspicious range +0.20 $25 + text evidence β†’ +0.20
Combo bonus (text + price) +0.10 Both present β†’ extra +0.10
  • Score is clamped to [0.0, 1.0].
  • Price is an amplifier, not a detector β€” a $45 bookshelf scores 0.0.
  • Final score is the max across all categories (a listing is as risky as its most dangerous match).

Risk Levels

Level Threshold Emoji Action
CRITICAL Score > 80% πŸ”΄ Automatic block β†’ Escalate to Legal
WARNING Score > 40% 🟑 Manual review β†’ T&S analyst queue
SAFE Score ≀ 40% 🟒 Approved β†’ No action required

Demo Output

Run python demo.py to launch the full Mercari feed simulation. Each listing flows through three auditor-grade phases:

Phase 1 β€” πŸ“₯ INPUT (what the moderator sees)

  β”Œβ”€ πŸ“₯ INPUT ──────────────────────────────────────────────┐
  β”‚  Title:      Herbal Supplement Pack πŸƒπŸ’¨                   β”‚
  β”‚  Seller:     rxplug_verified                              β”‚
  β”‚  Category:   Health & Wellness                            β”‚
  β”‚  Price:      $35.00                                       β”‚
  β”‚                                                           β”‚
  β”‚  Description:                                             β”‚
  β”‚    Premium p3rcs and lean combo πŸ’Š real pharma scripts...   β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Phase 2 β€” βš™οΈ PROCESSING (step-by-step engine pipeline)

  β”œβ”€ πŸ”€ Normalizing text Β·Β·Β·Β·Β·Β·Β· lowercase + collapse whitespace
  β”œβ”€ πŸ”‘ Scanning 109 keywords Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β· 4 hits
  β”œβ”€ 🧬 Matching 22 regex patterns Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β· 3 hits
  β”œβ”€ πŸ’² Checking price context ($35.00) Β·Β· ⚠ in suspicious range
  β”œβ”€ πŸ“Š Calculating risk score Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·Β· 100%

Phase 3 β€” VERDICT (the decision)

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    πŸ”΄  VERDICT: 🚫 BLOCKED
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    Risk Score : [β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ] 100%
    Flags      : 8 indicators  Β·  Categories: DRUGS
    πŸ“œ Legal basis: 21 U.S.C. Β§ 841 β€” Controlled Substances Act
    πŸ‘‰ ACTION: AUTOMATIC BLOCK β€” Escalate to Legal

Dashboard Summary

# Case Listing ID Verdict Categories
1 Control β€” Legitimate product MER-2026-00417 🟒 SAFE Β· βœ… APPROVED (0 %) β€”
2 Slang β€” Drug trafficking MER-2026-01893 πŸ”΄ CRITICAL Β· 🚫 BLOCKED (100 %) DRUGS
3 Anomaly β€” Money laundering MER-2026-03201 πŸ”΄ CRITICAL Β· 🚫 BLOCKED (100 %) MONEY_LAUNDERING
4 Fraud β€” Surikae counterfeit MER-2026-05742 πŸ”΄ CRITICAL Β· 🚫 BLOCKED (100 %) SURIKAE

Also available: python -m src for a quick 8-case CLI demo without the 3-phase animation.

Roadmap

  • Phase 1 β€” Project scaffolding & repository structure
  • Phase 2 β€” Knowledge base architecture (risk_database)
  • Phase 3 β€” Inference engine (normalize β†’ scan β†’ score β†’ verdict)
  • Phase 4 β€” Report interface & traffic-light system
  • Phase 5 β€” Documentation, narrative & portfolio polish
  • Phase 5.5 β€” Live simulation demo (demo.py) & repo update
  • Phase 6 β€” REST API wrapper (FastAPI + Pydantic models)
  • Phase 7 β€” Batch processing & CSV/JSON ingestion
  • Phase 8 β€” Dashboard & analytics module

Contributing

This project is in prototype phase. Contributions, ideas, and feedback are welcome.

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes using Conventional Commits
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Legal Disclaimer

⚠️ This software is a prototype built for educational and demonstration purposes only.

SlangHunter is designed to showcase programmatic legal-risk analysis techniques and is not intended for production deployment without proper legal review, regulatory approval, and human oversight.

The crime categories, keywords, and legal references included are illustrative examples drawn from publicly available U.S. federal statutes. They do not constitute legal advice. The author assumes no liability for decisions made based on this tool's output.

If you're building something like this for real: hire a lawyer, not just an engineer. Better yet β€” hire a Legal Engineer who can do both. πŸ˜‰

License

This project is licensed under the MIT License β€” see the LICENSE file for details.


SlangHunter β€” Built with 🧠 by a Legal Engineer who believes compliance can be automated.
109 keywords Β· 22 regex patterns Β· 3 crime categories Β· 90 tests Β· 0 linter warnings

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages