Skip to content

Latest commit

 

History

History
520 lines (418 loc) · 10.3 KB

File metadata and controls

520 lines (418 loc) · 10.3 KB

Monitoring & Observability

Comprehensive monitoring and observability setup for VideoGen Messenger.

Overview

Multi-layered monitoring approach covering:

  • Application performance
  • Infrastructure health
  • Business metrics
  • User experience
  • Security events

Monitoring Stack

AWS CloudWatch

  • Logs: Centralized log aggregation
  • Metrics: Infrastructure and application metrics
  • Alarms: Automated alerting
  • Dashboards: Real-time visualization

Sentry

  • Error Tracking: Application errors and exceptions
  • Performance Monitoring: Transaction traces
  • Release Tracking: Version tracking
  • Issue Management: Error grouping and assignment

New Relic (Optional)

  • APM: Application Performance Monitoring
  • Infrastructure: Server monitoring
  • Browser: Real User Monitoring (RUM)
  • Synthetics: Uptime monitoring

Application Metrics

Key Performance Indicators (KPIs)

Response Time:

  • Target: p50 < 200ms, p95 < 1s, p99 < 2s
  • Monitor: API endpoint latency
  • Alert: p95 > 2s for 5 minutes

Error Rate:

  • Target: < 0.1%
  • Monitor: 4xx and 5xx responses
  • Alert: Error rate > 1% for 5 minutes

Throughput:

  • Monitor: Requests per second
  • Alert: Sudden drops > 50%

Availability:

  • Target: 99.9% uptime
  • Monitor: Health check endpoint
  • Alert: Health check failure

Custom Metrics

// Example: Track generation metrics
const metrics = {
  generationRequests: new Counter('generation_requests_total'),
  generationDuration: new Histogram('generation_duration_seconds'),
  generationErrors: new Counter('generation_errors_total'),
  activeJobs: new Gauge('active_generation_jobs')
};

// Increment counters
metrics.generationRequests.inc();

// Record duration
const timer = metrics.generationDuration.startTimer();
// ... do work ...
timer();

// Track active jobs
metrics.activeJobs.set(queueSize);

Infrastructure Metrics

ECS Metrics

CPU Utilization:

aws cloudwatch put-metric-alarm \
  --alarm-name high-cpu \
  --metric-name CPUUtilization \
  --namespace AWS/ECS \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold

Memory Utilization:

aws cloudwatch put-metric-alarm \
  --alarm-name high-memory \
  --metric-name MemoryUtilization \
  --namespace AWS/ECS \
  --statistic Average \
  --period 300 \
  --threshold 80 \
  --comparison-operator GreaterThanThreshold

RDS Metrics

Database Connections:

  • Monitor: DatabaseConnections
  • Alert: > 90% of max connections

Read/Write Latency:

  • Monitor: ReadLatency, WriteLatency
  • Alert: > 100ms sustained

CPU & Memory:

  • Monitor: CPUUtilization, FreeableMemory
  • Alert: CPU > 80%, Memory < 20%

ElastiCache Metrics

Cache Hit Rate:

  • Monitor: CacheHitRate
  • Target: > 80%
  • Alert: < 60% for 10 minutes

Evictions:

  • Monitor: Evictions
  • Alert: Spike in evictions

Network:

  • Monitor: NetworkBytesIn, NetworkBytesOut
  • Alert: Approaching max bandwidth

Logging

Log Levels

// Use appropriate log levels
logger.error('Critical error', { error });  // Production issues
logger.warn('Warning', { data });           // Potential issues
logger.info('Info', { data });              // Important events
logger.debug('Debug', { data });            // Detailed debugging

Structured Logging

logger.info('Video generated', {
  jobId: 'job-123',
  userId: 'user-456',
  duration: 5,
  quality: 'hd',
  provider: 'veo3',
  generationTime: 45.2,
  timestamp: new Date().toISOString()
});

Log Aggregation

CloudWatch Logs Insights Queries:

-- Error rate by endpoint
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by endpoint
| sort count desc

-- Slow queries
fields @timestamp, duration, endpoint
| filter duration > 1000
| sort duration desc
| limit 20

-- Generation success rate
fields @timestamp, status
| filter operation = "generation"
| stats count() by status

Log Retention

  • Production: 30 days
  • Staging: 7 days
  • Development: 3 days

Dashboards

CloudWatch Dashboard

{
  "widgets": [
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/ECS", "CPUUtilization"],
          [".", "MemoryUtilization"]
        ],
        "period": 300,
        "stat": "Average",
        "region": "us-east-1",
        "title": "ECS Resources"
      }
    },
    {
      "type": "metric",
      "properties": {
        "metrics": [
          ["AWS/ApplicationELB", "TargetResponseTime"],
          [".", "RequestCount"],
          [".", "HTTPCode_Target_5XX_Count"]
        ],
        "period": 300,
        "stat": "Average",
        "title": "API Performance"
      }
    }
  ]
}

Key Dashboards

  1. Overview Dashboard:

    • Request rate
    • Error rate
    • Response time
    • System health
  2. Infrastructure Dashboard:

    • CPU/Memory usage
    • Database metrics
    • Cache metrics
    • Queue depth
  3. Business Dashboard:

    • Active users
    • Videos generated
    • Search queries
    • Popular content
  4. Security Dashboard:

    • Failed auth attempts
    • Rate limit violations
    • Unusual patterns
    • API abuse

Alerting

Alert Channels

  • PagerDuty: Critical production issues
  • Slack: General alerts and warnings
  • Email: Non-urgent notifications
  • SNS: AWS service alerts

Alert Rules

Critical Alerts (PagerDuty):

  • Service down (health check fails)
  • Error rate > 5%
  • Database connection failures
  • Queue processing stopped

Warning Alerts (Slack):

  • High latency (p95 > 2s)
  • High CPU/Memory (> 80%)
  • Low cache hit rate (< 60%)
  • Disk space low (< 20%)

Info Alerts (Email):

  • Deployment completed
  • Scheduled tasks completed
  • Daily/weekly reports

Alert Configuration

# Create SNS topic
aws sns create-topic --name videogen-alerts

# Subscribe to topic
aws sns subscribe \
  --topic-arn arn:aws:sns:region:account:videogen-alerts \
  --protocol email \
  --notification-endpoint alerts@example.com

# Create alarm with SNS action
aws cloudwatch put-metric-alarm \
  --alarm-name api-errors-high \
  --alarm-description "High API error rate" \
  --metric-name 5XXError \
  --namespace AWS/ApplicationELB \
  --statistic Sum \
  --period 300 \
  --threshold 10 \
  --comparison-operator GreaterThanThreshold \
  --alarm-actions arn:aws:sns:region:account:videogen-alerts

Health Checks

Application Health Endpoint

// GET /health
{
  "status": "ok",
  "timestamp": "2024-01-15T10:30:00Z",
  "uptime": 86400,
  "services": {
    "database": "ok",
    "redis": "ok",
    "elasticsearch": "ok",
    "s3": "ok"
  }
}

Service Health Checks

async function healthCheck() {
  const health = {
    status: 'ok',
    services: {}
  };

  // Database
  try {
    await db.query('SELECT 1');
    health.services.database = 'ok';
  } catch (error) {
    health.services.database = 'error';
    health.status = 'degraded';
  }

  // Redis
  try {
    await redis.ping();
    health.services.redis = 'ok';
  } catch (error) {
    health.services.redis = 'error';
    health.status = 'degraded';
  }

  // Elasticsearch
  try {
    const esHealth = await es.cluster.health();
    health.services.elasticsearch = esHealth.status;
  } catch (error) {
    health.services.elasticsearch = 'error';
    health.status = 'degraded';
  }

  return health;
}

Performance Monitoring

Request Tracing

// Add request ID to all logs
app.use((req, res, next) => {
  req.id = uuidv4();
  logger.info('Request started', {
    requestId: req.id,
    method: req.method,
    path: req.path,
    ip: req.ip
  });
  next();
});

// Log request completion
app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = Date.now() - start;
    logger.info('Request completed', {
      requestId: req.id,
      statusCode: res.statusCode,
      duration
    });
  });
  next();
});

Slow Query Detection

// Log slow database queries
db.on('query', (query) => {
  const duration = query.duration;
  if (duration > 100) {
    logger.warn('Slow query detected', {
      sql: query.sql,
      duration,
      params: query.params
    });
  }
});

User Experience Monitoring

Real User Monitoring (RUM)

Track client-side metrics:

  • Page load time
  • API response time
  • Error rate
  • User flows

Synthetic Monitoring

Automated checks from different locations:

  • Uptime monitoring
  • Performance testing
  • Functionality checks

Security Monitoring

Security Events

Monitor and alert on:

  • Failed login attempts (> 5 in 5 minutes)
  • Privilege escalation attempts
  • Unusual API patterns
  • Large data exports
  • Rate limit violations

Security Metrics

// Track authentication events
logger.info('Login attempt', {
  userId: user.id,
  success: true,
  ip: req.ip,
  userAgent: req.headers['user-agent']
});

// Track sensitive operations
logger.warn('Data export requested', {
  userId: user.id,
  dataType: 'user_videos',
  count: 150
});

Cost Monitoring

AWS Cost Explorer

Monitor costs by:

  • Service (ECS, RDS, S3, etc.)
  • Tag (environment, team, etc.)
  • Time period (daily, weekly, monthly)

Budget Alerts

# Create budget
aws budgets create-budget \
  --account-id 123456789 \
  --budget file://budget.json \
  --notifications-with-subscribers file://notifications.json

Incident Response

Runbook

When alert triggers:

  1. Acknowledge: Acknowledge alert in PagerDuty/Slack
  2. Assess: Check dashboards and logs
  3. Diagnose: Identify root cause
  4. Fix: Apply fix or rollback
  5. Verify: Confirm resolution
  6. Document: Update incident log

Post-Mortem

After incidents:

  • Timeline of events
  • Root cause analysis
  • Impact assessment
  • Action items
  • Prevention measures

Monitoring Checklist

  • CloudWatch logs configured
  • CloudWatch metrics enabled
  • Alarms configured
  • Dashboards created
  • Sentry integrated
  • Health checks working
  • Log retention set
  • Alert channels configured
  • On-call rotation set up
  • Runbooks documented

Resources