PR-Sentinel Features

Core Functionality

1. FastAPI Webhook Listener

High-performance webhook server using FastAPI and Uvicorn
Signature verification for secure GitHub webhook handling
Event filtering - only processes pull_request events with actions opened and reopened
RESTful API endpoints for monitoring and health checks

2. Multi-Factor Spam Detection

A. Trivial README Edits (Weight: 30/100)

Detects PRs that only modify README files
Scores higher for minimal changes (< 10 lines)
Reduces score for larger README changes (10-50 lines)
Recognizes common documentation file patterns (README.md, README.rst, etc.)

B. Minimal Code Changes (Weight: 25/100)

Flags PRs with very few line changes
Full score for ≤5 lines changed
Partial score for 6-15 lines changed
Considers both additions and deletions

C. Generic AI Text Detection (Weight: 35/100)

Scans PR title, body, and commit messages
Detects common AI-generated phrases:
- "as an AI"
- "it's worth noting"
- "delve into"
- "paradigm", "landscape", "realm"
- Many more indicators
Higher score for multiple indicators found
Case-insensitive matching

D. Suspicious Patterns (Weight: 10/100)

Identifies low-effort PR descriptions:
- "typo fix"
- "minor fix"
- "quick fix"
Requires short description (< 50 chars) for positive match
Catches common spam PR titles

3. Automated Moderation

When spam score exceeds threshold (default: 70/100):

Auto-Comment
- Professional, informative message
- Includes spam score
- Lists specific detection reasons
- Provides guidance for legitimate contributions
Auto-Close
- Automatically closes the PR
- Prevents spam from cluttering repository
- Maintainers can reopen if false positive
Tracking
- Logs all actions to JSON storage
- Maintains history for analysis
- Helps improve detection over time

4. Lightweight JSON Storage

No database required - uses simple JSON file
Automatic rotation - keeps only recent PRs (configurable, default: 100)
Structured data - tracks repo, PR number, user, score, reasons, timestamps
Easy to backup - just copy the JSON file
Human-readable - can be manually reviewed/edited if needed

5. GitHub API Integration (PyGithub)

Rich PR data - fetches complete PR information including:
- File changes and diffs
- Commit messages
- PR metadata
- User information
Fallback mechanism - uses webhook data if API fails
Rate limit aware - handles GitHub API limits gracefully
Secure authentication - uses Personal Access Token

6. Flexible Configuration

All settings customizable via environment variables:

GITHUB_TOKEN              # GitHub authentication
GITHUB_WEBHOOK_SECRET     # Webhook security
SPAM_SCORE_THRESHOLD      # Detection sensitivity (default: 70)
PR_TRACKING_FILE          # Storage location
MAX_TRACKED_PRS           # Storage size limit (default: 100)
HOST                      # Server host (default: 0.0.0.0)
PORT                      # Server port (default: 8000)

API Endpoints

GET /

Service information
Status check
API version

GET /health

Health check endpoint
Returns 200 OK if service is running
Useful for monitoring and load balancers

GET /stats

Statistics dashboard
Total PRs tracked
Spam PRs detected
Recent PR history (last 10)
JSON response format

POST /webhook

Main GitHub webhook endpoint
Verifies signature
Processes pull_request events
Returns analysis results

Deployment Options

Local Development

python main.py
# or
./run.sh

Docker

docker-compose up -d

Cloud Platforms

Heroku
AWS EC2
Google Cloud Run
DigitalOcean App Platform
Railway

See DEPLOYMENT.md for detailed guides.

Customization Options

Adjust Detection Sensitivity

# More lenient (fewer false positives)
SPAM_SCORE_THRESHOLD = 80

# More strict (catches more spam)
SPAM_SCORE_THRESHOLD = 60

Custom Detection Weights

WEIGHTS = {
    "trivial_readme": 40,      # Prioritize README checks
    "minimal_changes": 30,     # Prioritize code quality
    "generic_ai_text": 25,     # Reduce AI detection
    "suspicious_patterns": 5    # Lower pattern matching
}

Add Custom Patterns

AI_TEXT_INDICATORS = [
    # Add your observed spam patterns
    "your custom pattern",
    "another pattern"
]

SUSPICIOUS_PATTERNS = [
    # Add suspicious phrases
    "first contribution",
    "test pr"
]

See CUSTOMIZATION.md for detailed customization guide.

Security Features

1. Webhook Signature Verification

HMAC-SHA256 signature validation
Prevents unauthorized webhook calls
Configurable via GITHUB_WEBHOOK_SECRET

2. Token Security

Environment variable storage
No hardcoded credentials
.gitignore excludes sensitive files

3. Rate Limiting Ready

Designed for rate limiting middleware
Handles GitHub API limits
Graceful error handling

4. Input Validation

Validates webhook payloads
Type checking with Pydantic
Sanitizes user input

Performance Characteristics

Speed

Fast analysis - typically < 100ms per PR
Async-ready - FastAPI supports async operations
Minimal overhead - lightweight JSON storage

Scalability

Single instance - handles 100s of PRs/day
Multiple instances - can run behind load balancer
Database-ready - easy to upgrade storage layer

Resource Usage

Memory - ~50-100MB typical
CPU - Low usage, spikes during analysis
Storage - < 1MB for JSON file (100 PRs)

Monitoring & Observability

Logs

Structured logging with timestamps
Info, warning, and error levels
Tracks all PR analysis
Records auto-moderation actions

Statistics API

curl https://your-domain.com/stats

Returns:

{
  "total_tracked": 50,
  "spam_detected": 5,
  "recent_prs": [...]
}

Health Checks

curl https://your-domain.com/health

Good for:

Load balancer health checks
Monitoring systems (Datadog, New Relic, etc.)
Uptime monitoring

Testing

Unit Tests

python test_spam_detector.py

Tests cover:

Trivial README detection
Minimal changes detection
AI text detection
Legitimate PR handling
Combined spam indicators

Manual Testing

# Start server
python main.py

# Send test webhook
curl -X POST http://localhost:8000/webhook \
  -H "Content-Type: application/json" \
  -H "X-GitHub-Event: pull_request" \
  -d @test_payload.json

# Check stats
curl http://localhost:8000/stats

Future Enhancement Ideas

While not implemented yet, these would be valuable additions:

Machine Learning - Train ML model on labeled spam/legitimate PRs
User Reputation - Track user history and adjust scoring
Repository-Specific Rules - Different thresholds per repo
Whitelist/Blacklist - Skip or prioritize certain users
Dashboard UI - Web interface for management
Webhook for Alerts - Send notifications to Slack/Discord
A/B Testing - Test different detection strategies
Analytics - Detailed reporting on spam patterns
API for Manual Review - Endpoints to review/override decisions
GitHub App - Convert to GitHub App for better integration

Limitations

Current Limitations

GitHub API rate limits - 5000 requests/hour with token
JSON storage - Limited to ~100 PRs by default
Single-threaded - One request at a time (can be scaled)
English-focused - AI text patterns are English-only
Heuristic-based - Not ML-powered (yet)

Known Edge Cases

False positives possible on legitimate small PRs
Bot accounts may trigger false positives
Non-English PRs may not detect AI text correctly
Very active repos may need database storage

Mitigation

Adjust thresholds for your repository
Whitelist trusted contributors
Monitor false positive rate
Use dry-run mode for testing

Success Metrics

Track these to measure effectiveness:

True Positives - Spam correctly identified and closed
False Positives - Legitimate PRs incorrectly flagged
False Negatives - Spam that got through
Time Saved - Maintainer hours saved from spam review
Response Time - Speed of spam detection and closure

License

MIT License - see LICENSE file for details.

Support

Documentation: README.md, DEPLOYMENT.md, CUSTOMIZATION.md
Issues: https://github.com/Anorak001/PR-Sentinel/issues
Discussions: Use GitHub Discussions for questions

FilesExpand file tree

FEATURES.md

Latest commit

History