-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Implement a Redis-based job queue architecture to isolate memory-intensive clustering operations from the main API. This addresses the memory pressure observed when heavy analysis endpoints (network clustering, phenotype clustering) spike to 3GB+ and hit the container's 4.5GB limit.
Problem
Currently, all requests are handled by a single API container with mirai workers running in the same process space:
- Heavy clustering jobs spike memory to 3GB+ in the mirai worker
- Container hits 4.5GB memory limit, causing heavy swapping (observed 6-8GB block I/O)
- All requests (heavy and light) compete for the same memory pool
- Job state is stored in-memory (
jobs_env), making it impossible to scale API horizontally
Proposed Solution
Replace the in-memory job manager with Redis-based queue using the rrq package:
┌──────────┐ HTTP ┌─────────┐ ┌─────────┐
│ Browser │────────▶│ Traefik │─────▶│ API │
└──────────┘ └─────────┘ └────┬────┘
│ Redis
┌────▼────┐
│ Redis │
└────┬────┘
┌───────────────┼───────────────┐
┌────▼────┐ ┌────▼────┐ ┌────▼────┐
│ Worker │ │ Worker │ │ Worker │
│ Heavy │ │ Light │ │ Light │
│ (6GB) │ │ (1GB) │ │ (1GB) │
└─────────┘ └─────────┘ └─────────┘
Benefits
| Aspect | Current | Proposed |
|---|---|---|
| Memory isolation | None (shared container) | Full (separate containers) |
| Heavy job memory | 4.5GB limit (shared) | 6GB dedicated |
| Light request latency | Blocked by heavy jobs | Unaffected |
| Horizontal scaling | Not possible (in-memory state) | Add workers anytime |
| Job persistence | Lost on container restart | Survives restarts |
Implementation Tasks
Phase 1: Infrastructure
- Add Redis service to docker-compose.yml
- Add
reduxandrrqto API dependencies (renv.lock) - Create Redis connection helper with health checks
Phase 2: Job Manager Refactor
- Create
job-manager-redis.Rusing rrq - Implement queue routing logic (heavy vs light based on operation)
- Migrate
create_job()to enqueue to Redis - Migrate
get_job_status()to read from Redis - Update
get_job_history()to query Redis
Phase 3: Worker Implementation
- Create
worker.Rentrypoint script - Implement queue-specific worker startup (
--queue=heavy|light) - Add graceful shutdown handling
- Configure worker health checks
Phase 4: Docker Configuration
- Add
worker-heavyservice (6GB memory limit) - Add
worker-lightservice (1GB memory limit, replicas: 2) - Remove sticky sessions from API (no longer needed)
- Update Traefik labels (simplified routing)
Phase 5: Testing & Migration
- Add integration tests for Redis job flow
- Test job persistence across container restarts
- Test worker crash recovery
- Document migration path from current architecture
R Packages Required
| Package | Version | Purpose |
|---|---|---|
| redux | 1.1.5 | Redis client (hiredis bindings) |
| rrq | latest | Task queue on Redis |
Docker Compose Changes
services:
redis:
image: redis:7-alpine
mem_limit: 256m
volumes:
- redis_data:/data
healthcheck:
test: ["CMD", "redis-cli", "ping"]
api:
mem_limit: 1g # Reduced - no heavy processing
environment:
- REDIS_URL=redis://redis:6379
worker-heavy:
image: sysndd-api
mem_limit: 6g
command: ["Rscript", "worker.R", "--queue=heavy"]
environment:
- REDIS_URL=redis://redis:6379
worker-light:
image: sysndd-api
mem_limit: 1g
command: ["Rscript", "worker.R", "--queue=light"]
deploy:
replicas: 2
environment:
- REDIS_URL=redis://redis:6379
volumes:
redis_data:Job Routing Logic
| Operation | Queue | Worker Memory |
|---|---|---|
clustering |
heavy | 6GB |
phenotype_clustering |
heavy | 6GB |
llm_generation |
heavy | 6GB |
pubtator_update |
light | 1GB |
comparisons_update |
light | 1GB |
backup_create |
light | 1GB |
Success Criteria
- Heavy clustering jobs run in isolated 6GB container
- API container stays under 1GB during heavy job execution
- Job status survives API container restart
- Light requests (health, status, queries) respond < 100ms during heavy jobs
- Workers can be scaled independently (
docker-compose up -d --scale worker-light=4)
Related Issues
- Optimize mirai worker configuration for memory-constrained servers #150 - Optimize mirai worker configuration for memory-constrained servers
- Bug: ViewLogs page fails to load - logging endpoint loads entire table into memory #152 - ViewLogs endpoint loads entire table into memory before filtering
References
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels