Comprehensive monitoring and observability setup for VideoGen Messenger.
Multi-layered monitoring approach covering:
- Application performance
- Infrastructure health
- Business metrics
- User experience
- Security events
- Logs: Centralized log aggregation
- Metrics: Infrastructure and application metrics
- Alarms: Automated alerting
- Dashboards: Real-time visualization
- Error Tracking: Application errors and exceptions
- Performance Monitoring: Transaction traces
- Release Tracking: Version tracking
- Issue Management: Error grouping and assignment
- APM: Application Performance Monitoring
- Infrastructure: Server monitoring
- Browser: Real User Monitoring (RUM)
- Synthetics: Uptime monitoring
Response Time:
- Target: p50 < 200ms, p95 < 1s, p99 < 2s
- Monitor: API endpoint latency
- Alert: p95 > 2s for 5 minutes
Error Rate:
- Target: < 0.1%
- Monitor: 4xx and 5xx responses
- Alert: Error rate > 1% for 5 minutes
Throughput:
- Monitor: Requests per second
- Alert: Sudden drops > 50%
Availability:
- Target: 99.9% uptime
- Monitor: Health check endpoint
- Alert: Health check failure
// Example: Track generation metrics
const metrics = {
generationRequests: new Counter('generation_requests_total'),
generationDuration: new Histogram('generation_duration_seconds'),
generationErrors: new Counter('generation_errors_total'),
activeJobs: new Gauge('active_generation_jobs')
};
// Increment counters
metrics.generationRequests.inc();
// Record duration
const timer = metrics.generationDuration.startTimer();
// ... do work ...
timer();
// Track active jobs
metrics.activeJobs.set(queueSize);CPU Utilization:
aws cloudwatch put-metric-alarm \
--alarm-name high-cpu \
--metric-name CPUUtilization \
--namespace AWS/ECS \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThresholdMemory Utilization:
aws cloudwatch put-metric-alarm \
--alarm-name high-memory \
--metric-name MemoryUtilization \
--namespace AWS/ECS \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThresholdDatabase Connections:
- Monitor:
DatabaseConnections - Alert: > 90% of max connections
Read/Write Latency:
- Monitor:
ReadLatency,WriteLatency - Alert: > 100ms sustained
CPU & Memory:
- Monitor:
CPUUtilization,FreeableMemory - Alert: CPU > 80%, Memory < 20%
Cache Hit Rate:
- Monitor:
CacheHitRate - Target: > 80%
- Alert: < 60% for 10 minutes
Evictions:
- Monitor:
Evictions - Alert: Spike in evictions
Network:
- Monitor:
NetworkBytesIn,NetworkBytesOut - Alert: Approaching max bandwidth
// Use appropriate log levels
logger.error('Critical error', { error }); // Production issues
logger.warn('Warning', { data }); // Potential issues
logger.info('Info', { data }); // Important events
logger.debug('Debug', { data }); // Detailed debugginglogger.info('Video generated', {
jobId: 'job-123',
userId: 'user-456',
duration: 5,
quality: 'hd',
provider: 'veo3',
generationTime: 45.2,
timestamp: new Date().toISOString()
});CloudWatch Logs Insights Queries:
-- Error rate by endpoint
fields @timestamp, @message
| filter @message like /ERROR/
| stats count() by endpoint
| sort count desc
-- Slow queries
fields @timestamp, duration, endpoint
| filter duration > 1000
| sort duration desc
| limit 20
-- Generation success rate
fields @timestamp, status
| filter operation = "generation"
| stats count() by status- Production: 30 days
- Staging: 7 days
- Development: 3 days
{
"widgets": [
{
"type": "metric",
"properties": {
"metrics": [
["AWS/ECS", "CPUUtilization"],
[".", "MemoryUtilization"]
],
"period": 300,
"stat": "Average",
"region": "us-east-1",
"title": "ECS Resources"
}
},
{
"type": "metric",
"properties": {
"metrics": [
["AWS/ApplicationELB", "TargetResponseTime"],
[".", "RequestCount"],
[".", "HTTPCode_Target_5XX_Count"]
],
"period": 300,
"stat": "Average",
"title": "API Performance"
}
}
]
}-
Overview Dashboard:
- Request rate
- Error rate
- Response time
- System health
-
Infrastructure Dashboard:
- CPU/Memory usage
- Database metrics
- Cache metrics
- Queue depth
-
Business Dashboard:
- Active users
- Videos generated
- Search queries
- Popular content
-
Security Dashboard:
- Failed auth attempts
- Rate limit violations
- Unusual patterns
- API abuse
- PagerDuty: Critical production issues
- Slack: General alerts and warnings
- Email: Non-urgent notifications
- SNS: AWS service alerts
Critical Alerts (PagerDuty):
- Service down (health check fails)
- Error rate > 5%
- Database connection failures
- Queue processing stopped
Warning Alerts (Slack):
- High latency (p95 > 2s)
- High CPU/Memory (> 80%)
- Low cache hit rate (< 60%)
- Disk space low (< 20%)
Info Alerts (Email):
- Deployment completed
- Scheduled tasks completed
- Daily/weekly reports
# Create SNS topic
aws sns create-topic --name videogen-alerts
# Subscribe to topic
aws sns subscribe \
--topic-arn arn:aws:sns:region:account:videogen-alerts \
--protocol email \
--notification-endpoint alerts@example.com
# Create alarm with SNS action
aws cloudwatch put-metric-alarm \
--alarm-name api-errors-high \
--alarm-description "High API error rate" \
--metric-name 5XXError \
--namespace AWS/ApplicationELB \
--statistic Sum \
--period 300 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--alarm-actions arn:aws:sns:region:account:videogen-alerts// GET /health
{
"status": "ok",
"timestamp": "2024-01-15T10:30:00Z",
"uptime": 86400,
"services": {
"database": "ok",
"redis": "ok",
"elasticsearch": "ok",
"s3": "ok"
}
}async function healthCheck() {
const health = {
status: 'ok',
services: {}
};
// Database
try {
await db.query('SELECT 1');
health.services.database = 'ok';
} catch (error) {
health.services.database = 'error';
health.status = 'degraded';
}
// Redis
try {
await redis.ping();
health.services.redis = 'ok';
} catch (error) {
health.services.redis = 'error';
health.status = 'degraded';
}
// Elasticsearch
try {
const esHealth = await es.cluster.health();
health.services.elasticsearch = esHealth.status;
} catch (error) {
health.services.elasticsearch = 'error';
health.status = 'degraded';
}
return health;
}// Add request ID to all logs
app.use((req, res, next) => {
req.id = uuidv4();
logger.info('Request started', {
requestId: req.id,
method: req.method,
path: req.path,
ip: req.ip
});
next();
});
// Log request completion
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = Date.now() - start;
logger.info('Request completed', {
requestId: req.id,
statusCode: res.statusCode,
duration
});
});
next();
});// Log slow database queries
db.on('query', (query) => {
const duration = query.duration;
if (duration > 100) {
logger.warn('Slow query detected', {
sql: query.sql,
duration,
params: query.params
});
}
});Track client-side metrics:
- Page load time
- API response time
- Error rate
- User flows
Automated checks from different locations:
- Uptime monitoring
- Performance testing
- Functionality checks
Monitor and alert on:
- Failed login attempts (> 5 in 5 minutes)
- Privilege escalation attempts
- Unusual API patterns
- Large data exports
- Rate limit violations
// Track authentication events
logger.info('Login attempt', {
userId: user.id,
success: true,
ip: req.ip,
userAgent: req.headers['user-agent']
});
// Track sensitive operations
logger.warn('Data export requested', {
userId: user.id,
dataType: 'user_videos',
count: 150
});Monitor costs by:
- Service (ECS, RDS, S3, etc.)
- Tag (environment, team, etc.)
- Time period (daily, weekly, monthly)
# Create budget
aws budgets create-budget \
--account-id 123456789 \
--budget file://budget.json \
--notifications-with-subscribers file://notifications.jsonWhen alert triggers:
- Acknowledge: Acknowledge alert in PagerDuty/Slack
- Assess: Check dashboards and logs
- Diagnose: Identify root cause
- Fix: Apply fix or rollback
- Verify: Confirm resolution
- Document: Update incident log
After incidents:
- Timeline of events
- Root cause analysis
- Impact assessment
- Action items
- Prevention measures
- CloudWatch logs configured
- CloudWatch metrics enabled
- Alarms configured
- Dashboards created
- Sentry integrated
- Health checks working
- Log retention set
- Alert channels configured
- On-call rotation set up
- Runbooks documented