Skip to content

Latest commit

 

History

History
677 lines (479 loc) · 12.8 KB

File metadata and controls

677 lines (479 loc) · 12.8 KB

Troubleshooting Guide

Common issues and solutions for VideoGen Messenger.

Table of Contents

Server Issues

Server Won't Start

Symptom: Server crashes immediately or won't start

Common Causes:

  1. Port already in use
  2. Missing environment variables
  3. Database connection failure
  4. Node version mismatch

Solutions:

# Check if port is in use
lsof -i :3000
# Kill process if needed
kill -9 <PID>

# Verify environment variables
cat .env
# Ensure all required vars are set

# Check Node version
node --version
# Should be 18.0.0 or higher

# Check logs for detailed error
npm run dev 2>&1 | tee server.log

ECONNREFUSED Errors

Symptom: Connection refused errors

Solution:

# Verify services are running
docker ps

# Restart services
docker-compose restart

# Check service health
curl http://localhost:5432  # PostgreSQL
curl http://localhost:6379  # Redis
curl http://localhost:9200  # Elasticsearch

Module Not Found Errors

Symptom: Cannot find module errors

Solution:

# Clear node_modules and reinstall
rm -rf node_modules package-lock.json
npm install

# Clear npm cache
npm cache clean --force

Database Issues

Connection Timeout

Symptom: Database connection timeouts

Solutions:

# Check database is running
docker ps | grep postgres

# Test connection
psql postgresql://postgres:postgres@localhost:5432/videogen_dev

# Check connection pool settings
# In .env:
DATABASE_POOL_MAX=10
DATABASE_POOL_MIN=2

# Check for hanging connections
SELECT pid, usename, application_name, state, query
FROM pg_stat_activity
WHERE datname = 'videogen_dev';

# Kill hanging connections if needed
SELECT pg_terminate_backend(pid)
FROM pg_stat_activity
WHERE datname = 'videogen_dev' AND pid <> pg_backend_pid();

Migration Failures

Symptom: Database migrations fail

Solutions:

# Check migration status
npm run migrate:status

# Rollback failed migration
npm run migrate:rollback

# Run migrations with verbose logging
NODE_ENV=development npm run migrate

# Manually run SQL if needed
psql videogen_dev < migrations/001_initial.sql

Slow Queries

Symptom: Database queries taking too long

Solutions:

-- Enable query logging
ALTER DATABASE videogen_dev SET log_min_duration_statement = 100;

-- Find slow queries
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;

-- Add missing indexes
CREATE INDEX idx_videos_created_at ON videos(created_at DESC);
CREATE INDEX idx_videos_user_status ON videos(user_id, status);

-- Analyze table statistics
ANALYZE videos;
VACUUM ANALYZE videos;

Redis Issues

Redis Connection Failed

Symptom: Cannot connect to Redis

Solutions:

# Check Redis is running
docker ps | grep redis

# Test connection
redis-cli ping
# Should return "PONG"

# Check Redis logs
docker logs videogen-redis

# Restart Redis
docker restart videogen-redis

# Check Redis configuration
redis-cli CONFIG GET maxmemory
redis-cli CONFIG GET maxmemory-policy

Redis Memory Full

Symptom: OOM errors from Redis

Solutions:

# Check memory usage
redis-cli INFO memory

# Set eviction policy
redis-cli CONFIG SET maxmemory-policy allkeys-lru

# Increase memory limit
redis-cli CONFIG SET maxmemory 2gb

# Clear cache if needed
redis-cli FLUSHDB

Slow Redis Operations

Symptom: Redis commands timing out

Solutions:

# Check slow log
redis-cli SLOWLOG GET 10

# Monitor commands in real-time
redis-cli MONITOR

# Check for large keys
redis-cli --bigkeys

# Use pipeline for bulk operations
# Instead of multiple SET commands, use MSET

Elasticsearch Issues

Elasticsearch Not Starting

Symptom: Elasticsearch fails to start

Solutions:

# Check logs
docker logs videogen-elasticsearch

# Increase memory
# In docker-compose.yml:
ES_JAVA_OPTS=-Xms1g -Xmx1g

# Check disk space
df -h

# Reset Elasticsearch
docker-compose down
docker volume rm videogen_es_data
docker-compose up -d

Index Not Found

Symptom: index_not_found_exception

Solutions:

# Check indexes
curl http://localhost:9200/_cat/indices?v

# Create index
curl -X PUT http://localhost:9200/videos_dev \
  -H 'Content-Type: application/json' \
  -d @index-mapping.json

# Or use the service method
node -e "
const SearchService = require('./services/search/SearchService.js');
const service = new SearchService();
service.createIndex().then(() => console.log('Index created'));
"

Search Queries Slow

Symptom: Elasticsearch queries taking too long

Solutions:

# Check cluster health
curl http://localhost:9200/_cluster/health?pretty

# Profile slow queries
curl -X GET "http://localhost:9200/videos/_search?pretty" \
  -H 'Content-Type: application/json' \
  -d '{ "profile": true, "query": {...} }'

# Optimize index
curl -X POST "http://localhost:9200/videos/_forcemerge?max_num_segments=1"

# Increase refresh interval
curl -X PUT "http://localhost:9200/videos/_settings" \
  -H 'Content-Type: application/json' \
  -d '{ "index": { "refresh_interval": "30s" } }'

API Issues

401 Unauthorized

Symptom: Authentication failures

Solutions:

# Verify JWT token is valid
# Use jwt.io to decode token

# Check JWT_SECRET matches
echo $JWT_SECRET

# Generate new token
curl -X POST http://localhost:3000/api/v1/auth/login \
  -H 'Content-Type: application/json' \
  -d '{"email":"user@example.com","password":"password"}'

# Check token expiration
# Tokens expire after JWT_EXPIRY (default: 24h)

429 Too Many Requests

Symptom: Rate limit exceeded

Solutions:

# Check rate limit settings
# In .env:
RATE_LIMIT_WINDOW_MS=900000  # 15 minutes
RATE_LIMIT_MAX_REQUESTS=100

# Clear rate limit for user (Redis)
redis-cli DEL ratelimit:user123:/api/v1/generate

# Increase limits for development
RATE_LIMIT_MAX_REQUESTS=10000

500 Internal Server Error

Symptom: Unexpected server errors

Solutions:

# Check server logs
tail -f logs/error.log

# Enable debug logging
LOG_LEVEL=debug npm run dev

# Check Sentry for error details
# Or review CloudWatch logs in production

# Common causes:
# - Uncaught exceptions
# - Database connection issues
# - External API failures
# - Missing environment variables

Video Generation Issues

Generation Stuck in Processing

Symptom: Video generation never completes

Solutions:

# Check job status in Redis
redis-cli GET generation:job:JOB_ID

# Check BullMQ queue
redis-cli LRANGE bull:generation:active 0 -1

# Restart workers
docker restart videogen-workers

# Check provider API status
# Google Veo: https://status.google.com
# Runway: https://status.runway.ml
# Minimax: Check their status page

# Manually fail stuck job
redis-cli HSET generation:job:JOB_ID status failed

Provider API Errors

Symptom: AI provider returns errors

Solutions:

# Check API keys are valid
echo $GOOGLE_VEO_API_KEY
echo $RUNWAY_API_KEY
echo $MINIMAX_API_KEY

# Test API connectivity
curl -H "Authorization: Bearer $GOOGLE_VEO_API_KEY" \
  https://api.veo.google.com/v1/status

# Check rate limits
# Verify not exceeding provider limits

# Switch to different provider
# System automatically falls back if available

# Check provider-specific error codes
# Consult provider documentation

Video Download Failed

Symptom: Cannot download generated video

Solutions:

# Check S3 credentials
aws s3 ls s3://videogen-videos-dev/

# Verify S3 bucket exists
aws s3api head-bucket --bucket videogen-videos-dev

# Check CloudFront distribution
aws cloudfront get-distribution --id DISTRIBUTION_ID

# Test video URL directly
curl -I https://cdn.yourdomain.com/video.mp4

# Check CORS configuration
# Ensure S3 bucket allows your domain

Performance Issues

High Latency

Symptom: API responses are slow

Solutions:

  1. Enable Caching:

    // Add caching to expensive operations
    const cached = await redis.get(cacheKey);
    if (cached) return JSON.parse(cached);
  2. Optimize Database Queries:

    -- Use EXPLAIN ANALYZE
    EXPLAIN ANALYZE
    SELECT * FROM videos WHERE user_id = '123';
    
    -- Add indexes
    CREATE INDEX idx_videos_user_id ON videos(user_id);
  3. Connection Pooling:

    // Increase pool size
    DATABASE_POOL_MAX=20
  4. Enable Compression:

    // Already enabled via compression middleware
    // Verify in response headers: Content-Encoding: gzip

Memory Leaks

Symptom: Memory usage constantly increasing

Solutions:

# Monitor memory usage
node --inspect api/server.js
# Open chrome://inspect

# Take heap snapshots
curl http://localhost:9229/json/list

# Common causes:
# - Event listeners not removed
# - Large objects in memory
# - Unclosed connections

# Use weak references for caches
const cache = new WeakMap();

High CPU Usage

Symptom: CPU at 100%

Solutions:

# Profile CPU usage
node --prof api/server.js
node --prof-process isolate-*.log

# Common causes:
# - Inefficient loops
# - RegEx operations
# - JSON parsing large objects
# - Synchronous operations

# Use worker threads for CPU-intensive tasks
const { Worker } = require('worker_threads');

Deployment Issues

ECS Task Failing

Symptom: ECS tasks keep restarting

Solutions:

# Check task logs
aws logs tail /ecs/videogen-backend --follow

# Check task definition
aws ecs describe-tasks --cluster videogen --tasks TASK_ARN

# Common causes:
# - Health check failing
# - Insufficient memory
# - Environment variables missing
# - Container port mismatch

# Update task definition
aws ecs update-service --cluster videogen \
  --service videogen-backend \
  --force-new-deployment

Load Balancer 502/503

Symptom: Bad Gateway or Service Unavailable

Solutions:

# Check target health
aws elbv2 describe-target-health \
  --target-group-arn TARGET_GROUP_ARN

# Common causes:
# - Application not responding to health checks
# - Security group blocking traffic
# - Server taking too long to start

# Adjust health check settings
aws elbv2 modify-target-group \
  --target-group-arn TARGET_GROUP_ARN \
  --health-check-interval-seconds 30 \
  --healthy-threshold-count 2

Database Migration Failure in Production

Symptom: Migration fails on deployment

Solutions:

# Always test migrations in staging first
# Never run migrations directly in production

# Use migration-specific deployment
# 1. Deploy code without running migrations
# 2. Test manually in production
# 3. Run migrations separately
# 4. Monitor for errors

# Rollback procedure
npm run migrate:rollback
# Redeploy previous version

Monitoring & Debugging

Enable Debug Mode

# Local development
DEBUG=* npm run dev

# Production (temporary)
LOG_LEVEL=debug
# Remember to revert to 'info' after debugging

Check System Health

# Health endpoint
curl http://localhost:3000/health

# Database connection
curl http://localhost:3000/health/db

# Redis connection
curl http://localhost:3000/health/redis

# Elasticsearch connection
curl http://localhost:3000/health/search

Application Metrics

# Check CloudWatch metrics
aws cloudwatch get-metric-statistics \
  --namespace AWS/ECS \
  --metric-name CPUUtilization \
  --dimensions Name=ServiceName,Value=videogen-backend \
  --start-time 2024-01-01T00:00:00Z \
  --end-time 2024-01-01T23:59:59Z \
  --period 300 \
  --statistics Average

Getting Help

If you're still experiencing issues:

  1. Check Logs: Always start with application logs
  2. Search Issues: Check GitHub issues for similar problems
  3. Ask Community: Post in discussions with:
    • Error messages
    • Logs
    • Environment details
    • Steps to reproduce
  4. Create Issue: If it's a bug, create a detailed issue

Emergency Procedures

Production Outage

  1. Immediate Actions:

    # Check all services
    aws ecs describe-services --cluster videogen
    
    # Rollback to last known good version
    aws ecs update-service --cluster videogen \
      --service videogen-backend \
      --task-definition videogen-backend:PREVIOUS_VERSION
  2. Communication:

    • Post status update
    • Notify stakeholders
    • Update status page
  3. Investigation:

    • Collect logs
    • Check metrics
    • Review recent changes

Data Loss Prevention

# Emergency database backup
pg_dump videogen_prod > emergency_backup_$(date +%Y%m%d_%H%M%S).sql

# Backup to S3
aws s3 cp emergency_backup.sql s3://videogen-backups/emergency/