Skip to content

Latest commit

 

History

History
794 lines (595 loc) · 14.6 KB

File metadata and controls

794 lines (595 loc) · 14.6 KB

Sentry Alert Rules & Configuration

Overview

This document covers the comprehensive alert configuration for Sentry error tracking and monitoring of JudgeFinder.io.

Sentry Setup

Project Configuration

DSN (Data Source Name):

https://[key]@sentry.io/[project-id]

Environment Variables:

SENTRY_DSN=<your-dsn>
NEXT_PUBLIC_SENTRY_DSN=<your-dsn>  # For client-side errors
SENTRY_TRACES_SAMPLE_RATE=0.1      # 10% of transactions
SENTRY_REPLAYS_SESSION_SAMPLE_RATE=0   # Disabled by default
SENTRY_REPLAYS_ON_ERROR_SAMPLE_RATE=1  # 100% on errors

Alert Destinations

  1. Email: notifications@judgefinder.io
  2. Slack: #monitoring channel
  3. PagerDuty: Critical incidents (optional)

Alert Rules

Rule 1: Error Rate > 1% (CRITICAL)

Trigger Conditions:

  • When event.type == "error"
  • Frequency: > 1% of all events in last 5 minutes
  • Affected environments: production, staging

Notification Actions:

  • Send to: Slack #monitoring + Email
  • Mention: @channel in Slack
  • Create incident: Yes
  • Page on-call: Yes

Configuration:

name: Error Rate > 1%
filter:
  environment: [production, staging]
  event.type: error
  percentage: 1.0
  timeframe: 5m
actions:
  - slack:
      channel: '#monitoring'
      mention: '@channel'
  - email:
      recipient: 'notifications@judgefinder.io'
  - create_incident: true
  - pagerduty:
      severity: critical

Response SLA: 15 minutes


Rule 2: Unhandled Exceptions (CRITICAL)

Trigger Conditions:

  • When exception is caught
  • Handled: false
  • First occurrence: Yes
  • Environment: production

Notification Actions:

  • Send to: Email + Slack (immediate)
  • Alert frequency: Once per exception type per hour
  • Create Sentry issue: Yes

Configuration:

name: Unhandled Exceptions
filter:
  environment: production
  exception.handled: false
  first_occurrence: true
actions:
  - slack:
      channel: '#alerts'
      mention: '@developers'
  - email:
      recipient: ['admin@judgefinder.io', 'ops@judgefinder.io']
  - create_issue: true
  - escalate_after: 10m
    to: pagerduty

Response SLA: 5 minutes


Rule 3: API Response Time > 2 Seconds (WARNING)

Trigger Conditions:

  • When measurement.duration > 2000 (milliseconds)
  • Endpoint: /api/*
  • Frequency: > 10 occurrences in 10 minutes
  • Environment: production

Notification Actions:

  • Send to: Slack #performance
  • Trigger performance review
  • Do not alert for first occurrence

Configuration:

name: API Response Time > 2s
filter:
  environment: production
  transaction: /api/*
  measurement.duration: '>2000'
  frequency: '>10:10m'
actions:
  - slack:
      channel: '#performance'
  - metric_alert: response_time_warning
  - log_to_cloudwatch: true

Response SLA: 30 minutes


Rule 4: Database Query > 5 Seconds (WARNING)

Trigger Conditions:

  • When database operation duration > 5000ms
  • Database: Supabase/PostgreSQL
  • Frequency: > 5 occurrences in 15 minutes

Notification Actions:

  • Log to monitoring dashboard
  • Alert engineering team
  • No paging (not critical)

Configuration:

name: Slow Database Query
filter:
  span.op: db.query
  measurement.duration: '>5000'
  frequency: '>5:15m'
actions:
  - slack:
      channel: '#performance'
      message: 'Slow database query detected'
  - log_to_cloudwatch: true
  - create_monitoring_ticket: true

Response SLA: 1 hour


Rule 5: Memory Usage > 80% (WARNING)

Trigger Conditions:

  • When heap memory usage > 80%
  • Frequency: Sustained for 5 minutes
  • Environment: production

Notification Actions:

  • Alert ops team
  • Log memory profile
  • Trigger performance review

Configuration:

name: Memory Usage High
filter:
  environment: production
  memory.heap_usage_percentage: '>80'
  duration: 5m
actions:
  - slack:
      channel: '#ops'
      message: 'High memory usage on production'
  - email:
      recipient: 'ops@judgefinder.io'
  - capture_heap_dump: true

Response SLA: 30 minutes


Rule 6: Authentication Failures (MEDIUM)

Trigger Conditions:

  • When auth-related exceptions occur
  • Exception contains: "auth", "token", "permission"
  • Frequency: > 20 occurrences in 10 minutes

Notification Actions:

  • Alert security team
  • Log all details
  • Review for potential breach

Configuration:

name: Authentication Failures
filter:
  event.type: error
  tags.error_type: [auth, authentication, token]
  frequency: '>20:10m'
actions:
  - slack:
      channel: '#security'
      mention: '@security'
  - email:
      recipient: 'security@judgefinder.io'
  - log_to_cloudwatch:
      log_group: security_events
  - escalate_to: incident_commander

Response SLA: 10 minutes


Rule 7: Third-Party API Failures (MEDIUM)

Trigger Conditions:

  • CourtListener API failures
  • Stripe API failures
  • External service timeouts
  • Frequency: > 5 in 5 minutes

Notification Actions:

  • Alert integration team
  • Log API response
  • Monitor for degradation

Configuration:

name: External API Failures
filter:
  tags.service: [courtlistener, stripe, external]
  event.type: error
  frequency: '>5:5m'
actions:
  - slack:
      channel: '#integrations'
  - email:
      recipient: ['integrations@judgefinder.io']
  - create_monitoring_alert: true
  - auto_disable_feature: false

Response SLA: 30 minutes


Rule 8: Payment Processing Errors (CRITICAL)

Trigger Conditions:

  • Stripe payment failures
  • Transaction declined
  • Webhook processing fails
  • Frequency: > 1 in 5 minutes

Notification Actions:

  • Immediate paging
  • Alert payment ops
  • Create incident
  • Capture transaction details

Configuration:

name: Payment Processing Errors
filter:
  tags.feature: payments
  tags.service: stripe
  event.type: error
  frequency: '>1:5m'
actions:
  - pagerduty:
      severity: critical
      title: 'Payment processing failure'
  - slack:
      channel: '#payments'
      mention: '@payments-team'
  - email:
      recipient: ['payments@judgefinder.io', 'finance@judgefinder.io']
  - create_incident: true
  - capture_transaction_details: true

Response SLA: 5 minutes


Rule 9: Deployment Errors (HIGH)

Trigger Conditions:

  • New release deployment
  • Error rate spike > 5x baseline
  • First hour after deployment
  • Environment: production

Notification Actions:

  • Alert deployment team
  • Suggest rollback
  • Create deployment incident

Configuration:

name: Post-Deployment Error Spike
filter:
  environment: production
  release: release.*
  time_since_release: '0-60m'
  error_rate_increase: '>5x'
actions:
  - slack:
      channel: '#deployments'
      message: 'Error spike detected post-deployment'
      suggest_rollback: true
  - email:
      recipient: 'devops@judgefinder.io'
  - create_deployment_incident: true

Response SLA: 15 minutes


Rule 10: JavaScript Console Errors (LOW)

Trigger Conditions:

  • Client-side JavaScript errors
  • Frequency: > 100 in 1 hour
  • Environment: production
  • Not ignored errors

Notification Actions:

  • Log to performance dashboard
  • Weekly summary email
  • No real-time alert

Configuration:

name: JavaScript Console Errors
filter:
  environment: production
  platform: javascript
  event.type: error
  frequency: '>100:1h'
  ignore_tags: [expected_error]
actions:
  - slack:
      channel: '#frontend'
      frequency: weekly
  - email:
      recipient: 'frontend-team@judgefinder.io'
      frequency: daily_digest
  - log_to_cloudwatch: true

Response SLA: No SLA (informational)


Alert Management

Enable/Disable Rules

Via Web Console:

  1. Go to: Sentry Project > Alerts
  2. Toggle rules on/off
  3. Click Save

Via API:

# Disable rule
curl -X PUT https://sentry.io/api/0/projects/{org}/{project}/rules/{rule_id}/ \
  -H "Authorization: Bearer {auth_token}" \
  -d '{"enabled": false}'

# Enable rule
curl -X PUT https://sentry.io/api/0/projects/{org}/{project}/rules/{rule_id}/ \
  -H "Authorization: Bearer {auth_token}" \
  -d '{"enabled": true}'

Suppress/Ignore Errors

Ignore specific exception:

filter:
  error.message: "Network error"
  action: ignore

Ignore error by pattern:

filter:
  error.message: "*network*"
  environment: staging
  action: ignore

Resurrecting ignored errors:

  1. Go to: Sentry Project > Settings > Ignore
  2. Click "Restore" for the error
  3. Confirm

Alert Grouping

Smart grouping enabled:

  • Groups similar errors by:
    • Exception type
    • Stack trace
    • Error message fingerprint

Custom grouping:

group_by:
  - exception.type
  - tags.service
  - tags.endpoint

Sentry Workflow Integration

Slack Integration

Setup:

  1. Go to: Sentry Project > Integrations > Slack
  2. Authorize Sentry app to Slack workspace
  3. Select channels for notifications
  4. Configure notification frequency

Slack Commands:

@sentry-app ignore
@sentry-app resolve
@sentry-app assign @username
@sentry-app status

Email Notifications

Configure:

  1. Go to: Settings > Notifications > Project Alerts
  2. Add recipients:

Digest Email:

  • Frequency: Daily
  • Time: 9:00 AM UTC
  • Include: All alerts from past 24h

GitHub Integration

Auto-create issues:

  1. Go to: Integrations > GitHub
  2. Authorize repository
  3. Configure:
    • Create issue on: Error > threshold
    • Issue template: Use standard template
    • Auto-assign: @dev-team

Issue Template:

## Error: {error.title}

**Severity:** {error.level}
**First seen:** {error.first_seen}
**Last seen:** {error.last_seen}
**Occurrences:** {error.count}

### Stack Trace

\`\`\`
{error.stack_trace}
\`\`\`

### Context

- User: {user.username}
- URL: {request.url}
- Environment: {environment}

### Action Items

- [ ] Investigate root cause
- [ ] Create fix
- [ ] Test fix
- [ ] Deploy to production

Monitoring Dashboard

Sentry Dashboard Setup

Custom dashboard:

// Metrics to display
{
  "widgets": [
    {
      "title": "Error Rate (Last 24h)",
      "type": "stat",
      "query": "event.type:error"
    },
    {
      "title": "Top 10 Errors",
      "type": "table",
      "query": "event.type:error",
      "sort": "-frequency"
    },
    {
      "title": "Error Trend",
      "type": "line_chart",
      "query": "event.type:error",
      "period": "1h"
    },
    {
      "title": "Slowest Transactions",
      "type": "table",
      "query": "measurement.duration:>1000"
    },
    {
      "title": "User Impact",
      "type": "stat",
      "query": "affected_users"
    }
  ]
}

Key Metrics to Track

  1. Error Volume

    • Total errors per day
    • Error growth rate
    • Critical vs. warning errors
  2. Response Time

    • p50, p95, p99 latencies
    • Slow transaction endpoints
    • Database query performance
  3. User Impact

    • Unique users affected
    • Error blast radius
    • Session impact
  4. Error Sources

    • Frontend vs. backend
    • Top error types
    • New errors introduced

Alert Testing

Test an Alert Rule

# Create a test event
curl -X POST https://[key]@sentry.io/api/[project_id]/store/ \
  -H "Content-Type: application/json" \
  -d '{
    "message": "Test alert",
    "level": "error",
    "exception": {
      "values": [
        {
          "type": "TestException",
          "value": "This is a test error"
        }
      ]
    }
  }'

Verify Alert Delivery

  1. Create test error in staging
  2. Verify Slack message sent
  3. Verify email received
  4. Confirm PagerDuty incident (if configured)
  5. Review in Sentry dashboard

Best Practices

Alert Fatigue Prevention

  1. Set appropriate thresholds:

    • Avoid alerting on every error
    • Use frequency-based rules
    • Set time windows for context
  2. Use alert grouping:

    • Group similar errors
    • Reduce noise
    • Focus on unique issues
  3. Regular rule reviews:

    • Monthly: Check rule effectiveness
    • Disable unused rules
    • Adjust thresholds based on data

Incident Response

Upon alert:

  1. Acknowledge alert in Slack
  2. Create incident ticket
  3. Assign to on-call engineer
  4. Begin investigation
  5. Implement fix
  6. Verify resolution
  7. Document post-mortem

Performance Monitoring

Track these metrics:

  • Frontend performance (LCP, FID, CLS)
  • API endpoint latency
  • Database query duration
  • Third-party API response times
  • Memory and CPU usage

Integration with Monitoring Stack

Connect to UptimeRobot

When UptimeRobot detects downtime:

  1. Automatically create Sentry alert
  2. Tag with service and endpoint
  3. Correlate with error spike
  4. Provide context for response

Connect to CloudWatch

Stream Sentry errors to CloudWatch:

Sentry -> Webhook -> Lambda -> CloudWatch

Configuration:

webhook:
  url: https://lambda.amazonaws.com/webhook
  events: [error, transaction]
  include_details: true

Cost Optimization

Event Quota Management

Sentry Plan: Business tier - $9/month per project

Quota allocation:

  • Errors: 10M events/month
  • Transactions: 50M events/month
  • Replays: 1000 sessions/month

Cost optimization:

  1. Use beforeSend to filter events
  2. Reduce sample rates in non-critical envs
  3. Ignore known harmless errors
  4. Archive old issues

Sample Configuration

// sentry.client.config.ts
Sentry.init({
  beforeSend(event, hint) {
    // Ignore known errors
    if (event.message?.includes('Network error')) {
      return null
    }

    // Ignore in non-production
    if (process.env.NODE_ENV !== 'production') {
      return null
    }

    return event
  },
  tracesSampleRate: process.env.NODE_ENV === 'production' ? 0.1 : 1.0,
})

Troubleshooting

Events not appearing in Sentry

  1. Verify DSN is correct
  2. Check browser console for errors
  3. Verify beforeSend isn't filtering
  4. Check network requests (DevTools)
  5. Review Sentry project settings

Duplicate alerts

  1. Check rule conditions
  2. Verify grouping is working
  3. Adjust frequency thresholds
  4. Review filter conditions

Missing context

  1. Enable breadcrumbs
  2. Attach user context
  3. Add custom tags
  4. Include HTTP request details

Next Steps

  1. Set up Slack integration
  2. Configure critical rules (Rules 1, 2, 8)
  3. Enable warning rules (Rules 3-7)
  4. Create dashboard
  5. Configure team notifications
  6. Document runbook for each rule
  7. Schedule monthly review

Resources