Skip to content

Latest commit

 

History

History
217 lines (158 loc) · 5.74 KB

File metadata and controls

217 lines (158 loc) · 5.74 KB

PR-Sentinel

An automated framework that detects, scores, and mitigates spammy pull requests on GitHub repositories in real time.

Features

  • 🎯 Automated Spam Detection: Analyzes PR metadata, file diffs, and text for spam indicators
  • 🔍 Multi-Factor Analysis:
    • Trivial README edits
    • Minimal code changes
    • Generic AI-generated text patterns
    • Suspicious patterns in descriptions
  • 🤖 Auto-Moderation: Automatically comments on and closes spam PRs
  • 💾 Lightweight Storage: JSON-based tracking without database requirements
  • FastAPI Webhook: High-performance webhook listener for GitHub events
  • 📊 Configurable Thresholds: Customizable spam detection scoring

Architecture

PR-Sentinel consists of several key components:

  1. FastAPI Webhook Listener (main.py): Receives GitHub pull_request events
  2. Spam Detector (spam_detector.py): Analyzes PRs using multiple heuristics
  3. GitHub Client (github_client.py): Interacts with GitHub API using PyGithub
  4. Storage (storage.py): Lightweight JSON-based PR tracking
  5. Configuration (config.py): Centralized settings management

Installation

Prerequisites

  • Python 3.8 or higher
  • GitHub Personal Access Token with repo permissions
  • (Optional) GitHub Webhook Secret for secure webhooks

Setup

  1. Clone the repository:
git clone https://github.com/Anorak001/PR-Sentinel.git
cd PR-Sentinel
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure environment variables:
cp .env.example .env
# Edit .env with your GitHub token and settings

Required environment variables:

  • GITHUB_TOKEN: Your GitHub Personal Access Token
  • GITHUB_WEBHOOK_SECRET: (Optional) Secret for webhook verification
  • SPAM_SCORE_THRESHOLD: Score threshold for spam detection (default: 70)

Usage

Running the Server

Start the FastAPI webhook listener:

python main.py

Or using uvicorn directly:

uvicorn main:app --host 0.0.0.0 --port 8000

The server will start on http://0.0.0.0:8000

Setting Up GitHub Webhook

  1. Go to your repository settings → Webhooks → Add webhook
  2. Set Payload URL to: http://your-server:8000/webhook
  3. Set Content type to: application/json
  4. (Optional) Set Secret to match your GITHUB_WEBHOOK_SECRET
  5. Select "Let me select individual events" and choose:
    • ✅ Pull requests
  6. Click "Add webhook"

API Endpoints

  • GET / - Service information
  • GET /health - Health check
  • GET /stats - Statistics about tracked PRs
  • POST /webhook - GitHub webhook endpoint

Spam Detection Logic

PR-Sentinel uses a weighted scoring system to detect spam:

Detection Criteria

  1. Trivial README Edits (Weight: 30)

    • Only README files modified
    • Less than 10-50 lines changed
  2. Minimal Code Changes (Weight: 25)

    • Very small number of line changes (≤5 lines: full score)
    • Small changes (≤15 lines: partial score)
  3. Generic AI Text (Weight: 35)

    • Detects common AI-generated text patterns
    • Phrases like "as an AI", "it's worth noting", "delve into", etc.
  4. Suspicious Patterns (Weight: 10)

    • "typo fix", "minor fix" with minimal description
    • Very short PR descriptions

Scoring

  • Scores range from 0-100
  • Default threshold: 70 (configurable)
  • Scores above threshold trigger auto-moderation

Auto-Moderation Actions

When a PR exceeds the spam threshold:

  1. Comment: Posts an automated comment explaining the detection
  2. Close: Automatically closes the PR
  3. Track: Stores the PR data for future reference

Configuration

Edit config.py or use environment variables to customize:

# Spam Detection Thresholds
SPAM_SCORE_THRESHOLD = 70  # Adjust sensitivity

# Detection Weights
WEIGHTS = {
    "trivial_readme": 30,
    "minimal_changes": 25,
    "generic_ai_text": 35,
    "suspicious_patterns": 10
}

# Storage
MAX_TRACKED_PRS = 100  # Number of PRs to keep in memory

Storage

PR-Sentinel uses a simple JSON file (pr_tracking.json) to store recent PR data:

{
  "prs": [
    {
      "repo": "owner/repo",
      "pr_number": 123,
      "user": "username",
      "spam_score": 75.0,
      "is_spam": true,
      "details": {...},
      "tracked_at": "2025-01-01T00:00:00"
    }
  ]
}

The storage automatically:

  • Keeps only the most recent PRs (default: 100)
  • Tracks spam scores and detection reasons
  • Records timestamps for all actions

Development

Project Structure

PR-Sentinel/
├── main.py              # FastAPI webhook listener
├── spam_detector.py     # Spam detection logic
├── github_client.py     # GitHub API client
├── storage.py           # JSON storage
├── config.py            # Configuration
├── requirements.txt     # Dependencies
├── .env.example         # Environment template
├── .gitignore          # Git ignore rules
└── README.md           # Documentation

Running Tests

The system is designed to be lightweight and doesn't require extensive testing infrastructure. Manual testing can be done by:

  1. Starting the server
  2. Triggering test webhooks from GitHub
  3. Checking the /stats endpoint for results

Security Considerations

  • Webhook Verification: Always use GITHUB_WEBHOOK_SECRET in production
  • Token Security: Never commit your GITHUB_TOKEN to version control
  • Rate Limiting: GitHub API has rate limits; the system handles them gracefully
  • False Positives: Monitor the /stats endpoint and adjust thresholds as needed

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

MIT License - see LICENSE file for details.

Support

For issues, questions, or contributions, please open an issue on GitHub.