Skip to content

Implement Redis job queue with separate heavy/light workers #154

@berntpopp

Description

@berntpopp

Summary

Implement a Redis-based job queue architecture to isolate memory-intensive clustering operations from the main API. This addresses the memory pressure observed when heavy analysis endpoints (network clustering, phenotype clustering) spike to 3GB+ and hit the container's 4.5GB limit.

Problem

Currently, all requests are handled by a single API container with mirai workers running in the same process space:

  • Heavy clustering jobs spike memory to 3GB+ in the mirai worker
  • Container hits 4.5GB memory limit, causing heavy swapping (observed 6-8GB block I/O)
  • All requests (heavy and light) compete for the same memory pool
  • Job state is stored in-memory (jobs_env), making it impossible to scale API horizontally

Proposed Solution

Replace the in-memory job manager with Redis-based queue using the rrq package:

┌──────────┐  HTTP   ┌─────────┐      ┌─────────┐
│  Browser │────────▶│ Traefik │─────▶│   API   │
└──────────┘         └─────────┘      └────┬────┘
                                           │ Redis
                                      ┌────▼────┐
                                      │  Redis  │
                                      └────┬────┘
                           ┌───────────────┼───────────────┐
                      ┌────▼────┐     ┌────▼────┐     ┌────▼────┐
                      │ Worker  │     │ Worker  │     │ Worker  │
                      │ Heavy   │     │ Light   │     │ Light   │
                      │ (6GB)   │     │ (1GB)   │     │ (1GB)   │
                      └─────────┘     └─────────┘     └─────────┘

Benefits

Aspect Current Proposed
Memory isolation None (shared container) Full (separate containers)
Heavy job memory 4.5GB limit (shared) 6GB dedicated
Light request latency Blocked by heavy jobs Unaffected
Horizontal scaling Not possible (in-memory state) Add workers anytime
Job persistence Lost on container restart Survives restarts

Implementation Tasks

Phase 1: Infrastructure

  • Add Redis service to docker-compose.yml
  • Add redux and rrq to API dependencies (renv.lock)
  • Create Redis connection helper with health checks

Phase 2: Job Manager Refactor

  • Create job-manager-redis.R using rrq
  • Implement queue routing logic (heavy vs light based on operation)
  • Migrate create_job() to enqueue to Redis
  • Migrate get_job_status() to read from Redis
  • Update get_job_history() to query Redis

Phase 3: Worker Implementation

  • Create worker.R entrypoint script
  • Implement queue-specific worker startup (--queue=heavy|light)
  • Add graceful shutdown handling
  • Configure worker health checks

Phase 4: Docker Configuration

  • Add worker-heavy service (6GB memory limit)
  • Add worker-light service (1GB memory limit, replicas: 2)
  • Remove sticky sessions from API (no longer needed)
  • Update Traefik labels (simplified routing)

Phase 5: Testing & Migration

  • Add integration tests for Redis job flow
  • Test job persistence across container restarts
  • Test worker crash recovery
  • Document migration path from current architecture

R Packages Required

Package Version Purpose
redux 1.1.5 Redis client (hiredis bindings)
rrq latest Task queue on Redis

Docker Compose Changes

services:
  redis:
    image: redis:7-alpine
    mem_limit: 256m
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]

  api:
    mem_limit: 1g  # Reduced - no heavy processing
    environment:
      - REDIS_URL=redis://redis:6379

  worker-heavy:
    image: sysndd-api
    mem_limit: 6g
    command: ["Rscript", "worker.R", "--queue=heavy"]
    environment:
      - REDIS_URL=redis://redis:6379

  worker-light:
    image: sysndd-api
    mem_limit: 1g
    command: ["Rscript", "worker.R", "--queue=light"]
    deploy:
      replicas: 2
    environment:
      - REDIS_URL=redis://redis:6379

volumes:
  redis_data:

Job Routing Logic

Operation Queue Worker Memory
clustering heavy 6GB
phenotype_clustering heavy 6GB
llm_generation heavy 6GB
pubtator_update light 1GB
comparisons_update light 1GB
backup_create light 1GB

Success Criteria

  1. Heavy clustering jobs run in isolated 6GB container
  2. API container stays under 1GB during heavy job execution
  3. Job status survives API container restart
  4. Light requests (health, status, queries) respond < 100ms during heavy jobs
  5. Workers can be scaled independently (docker-compose up -d --scale worker-light=4)

Related Issues

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions