- High-performance webhook server using FastAPI and Uvicorn
- Signature verification for secure GitHub webhook handling
- Event filtering - only processes
pull_requestevents with actionsopenedandreopened - RESTful API endpoints for monitoring and health checks
- Detects PRs that only modify README files
- Scores higher for minimal changes (< 10 lines)
- Reduces score for larger README changes (10-50 lines)
- Recognizes common documentation file patterns (README.md, README.rst, etc.)
- Flags PRs with very few line changes
- Full score for ≤5 lines changed
- Partial score for 6-15 lines changed
- Considers both additions and deletions
- Scans PR title, body, and commit messages
- Detects common AI-generated phrases:
- "as an AI"
- "it's worth noting"
- "delve into"
- "paradigm", "landscape", "realm"
- Many more indicators
- Higher score for multiple indicators found
- Case-insensitive matching
- Identifies low-effort PR descriptions:
- "typo fix"
- "minor fix"
- "quick fix"
- Requires short description (< 50 chars) for positive match
- Catches common spam PR titles
When spam score exceeds threshold (default: 70/100):
-
Auto-Comment
- Professional, informative message
- Includes spam score
- Lists specific detection reasons
- Provides guidance for legitimate contributions
-
Auto-Close
- Automatically closes the PR
- Prevents spam from cluttering repository
- Maintainers can reopen if false positive
-
Tracking
- Logs all actions to JSON storage
- Maintains history for analysis
- Helps improve detection over time
- No database required - uses simple JSON file
- Automatic rotation - keeps only recent PRs (configurable, default: 100)
- Structured data - tracks repo, PR number, user, score, reasons, timestamps
- Easy to backup - just copy the JSON file
- Human-readable - can be manually reviewed/edited if needed
- Rich PR data - fetches complete PR information including:
- File changes and diffs
- Commit messages
- PR metadata
- User information
- Fallback mechanism - uses webhook data if API fails
- Rate limit aware - handles GitHub API limits gracefully
- Secure authentication - uses Personal Access Token
All settings customizable via environment variables:
GITHUB_TOKEN # GitHub authentication
GITHUB_WEBHOOK_SECRET # Webhook security
SPAM_SCORE_THRESHOLD # Detection sensitivity (default: 70)
PR_TRACKING_FILE # Storage location
MAX_TRACKED_PRS # Storage size limit (default: 100)
HOST # Server host (default: 0.0.0.0)
PORT # Server port (default: 8000)- Service information
- Status check
- API version
- Health check endpoint
- Returns 200 OK if service is running
- Useful for monitoring and load balancers
- Statistics dashboard
- Total PRs tracked
- Spam PRs detected
- Recent PR history (last 10)
- JSON response format
- Main GitHub webhook endpoint
- Verifies signature
- Processes pull_request events
- Returns analysis results
python main.py
# or
./run.shdocker-compose up -d- Heroku
- AWS EC2
- Google Cloud Run
- DigitalOcean App Platform
- Railway
See DEPLOYMENT.md for detailed guides.
# More lenient (fewer false positives)
SPAM_SCORE_THRESHOLD = 80
# More strict (catches more spam)
SPAM_SCORE_THRESHOLD = 60WEIGHTS = {
"trivial_readme": 40, # Prioritize README checks
"minimal_changes": 30, # Prioritize code quality
"generic_ai_text": 25, # Reduce AI detection
"suspicious_patterns": 5 # Lower pattern matching
}AI_TEXT_INDICATORS = [
# Add your observed spam patterns
"your custom pattern",
"another pattern"
]
SUSPICIOUS_PATTERNS = [
# Add suspicious phrases
"first contribution",
"test pr"
]See CUSTOMIZATION.md for detailed customization guide.
- HMAC-SHA256 signature validation
- Prevents unauthorized webhook calls
- Configurable via
GITHUB_WEBHOOK_SECRET
- Environment variable storage
- No hardcoded credentials
.gitignoreexcludes sensitive files
- Designed for rate limiting middleware
- Handles GitHub API limits
- Graceful error handling
- Validates webhook payloads
- Type checking with Pydantic
- Sanitizes user input
- Fast analysis - typically < 100ms per PR
- Async-ready - FastAPI supports async operations
- Minimal overhead - lightweight JSON storage
- Single instance - handles 100s of PRs/day
- Multiple instances - can run behind load balancer
- Database-ready - easy to upgrade storage layer
- Memory - ~50-100MB typical
- CPU - Low usage, spikes during analysis
- Storage - < 1MB for JSON file (100 PRs)
- Structured logging with timestamps
- Info, warning, and error levels
- Tracks all PR analysis
- Records auto-moderation actions
curl https://your-domain.com/statsReturns:
{
"total_tracked": 50,
"spam_detected": 5,
"recent_prs": [...]
}curl https://your-domain.com/healthGood for:
- Load balancer health checks
- Monitoring systems (Datadog, New Relic, etc.)
- Uptime monitoring
python test_spam_detector.pyTests cover:
- Trivial README detection
- Minimal changes detection
- AI text detection
- Legitimate PR handling
- Combined spam indicators
# Start server
python main.py
# Send test webhook
curl -X POST http://localhost:8000/webhook \
-H "Content-Type: application/json" \
-H "X-GitHub-Event: pull_request" \
-d @test_payload.json
# Check stats
curl http://localhost:8000/statsWhile not implemented yet, these would be valuable additions:
- Machine Learning - Train ML model on labeled spam/legitimate PRs
- User Reputation - Track user history and adjust scoring
- Repository-Specific Rules - Different thresholds per repo
- Whitelist/Blacklist - Skip or prioritize certain users
- Dashboard UI - Web interface for management
- Webhook for Alerts - Send notifications to Slack/Discord
- A/B Testing - Test different detection strategies
- Analytics - Detailed reporting on spam patterns
- API for Manual Review - Endpoints to review/override decisions
- GitHub App - Convert to GitHub App for better integration
- GitHub API rate limits - 5000 requests/hour with token
- JSON storage - Limited to ~100 PRs by default
- Single-threaded - One request at a time (can be scaled)
- English-focused - AI text patterns are English-only
- Heuristic-based - Not ML-powered (yet)
- False positives possible on legitimate small PRs
- Bot accounts may trigger false positives
- Non-English PRs may not detect AI text correctly
- Very active repos may need database storage
- Adjust thresholds for your repository
- Whitelist trusted contributors
- Monitor false positive rate
- Use dry-run mode for testing
Track these to measure effectiveness:
- True Positives - Spam correctly identified and closed
- False Positives - Legitimate PRs incorrectly flagged
- False Negatives - Spam that got through
- Time Saved - Maintainer hours saved from spam review
- Response Time - Speed of spam detection and closure
MIT License - see LICENSE file for details.
- Documentation: README.md, DEPLOYMENT.md, CUSTOMIZATION.md
- Issues: https://github.com/Anorak001/PR-Sentinel/issues
- Discussions: Use GitHub Discussions for questions