Implement Redis job queue with separate heavy/light workers

## Summary

Implement a Redis-based job queue architecture to isolate memory-intensive clustering operations from the main API. This addresses the memory pressure observed when heavy analysis endpoints (network clustering, phenotype clustering) spike to 3GB+ and hit the container's 4.5GB limit.

## Problem

Currently, all requests are handled by a single API container with mirai workers running in the same process space:

- Heavy clustering jobs spike memory to 3GB+ in the mirai worker
- Container hits 4.5GB memory limit, causing heavy swapping (observed 6-8GB block I/O)
- All requests (heavy and light) compete for the same memory pool
- Job state is stored in-memory (`jobs_env`), making it impossible to scale API horizontally

## Proposed Solution

Replace the in-memory job manager with Redis-based queue using the `rrq` package:

```
┌──────────┐  HTTP   ┌─────────┐      ┌─────────┐
│  Browser │────────▶│ Traefik │─────▶│   API   │
└──────────┘         └─────────┘      └────┬────┘
                                           │ Redis
                                      ┌────▼────┐
                                      │  Redis  │
                                      └────┬────┘
                           ┌───────────────┼───────────────┐
                      ┌────▼────┐     ┌────▼────┐     ┌────▼────┐
                      │ Worker  │     │ Worker  │     │ Worker  │
                      │ Heavy   │     │ Light   │     │ Light   │
                      │ (6GB)   │     │ (1GB)   │     │ (1GB)   │
                      └─────────┘     └─────────┘     └─────────┘
```

### Benefits

| Aspect | Current | Proposed |
|--------|---------|----------|
| Memory isolation | None (shared container) | Full (separate containers) |
| Heavy job memory | 4.5GB limit (shared) | 6GB dedicated |
| Light request latency | Blocked by heavy jobs | Unaffected |
| Horizontal scaling | Not possible (in-memory state) | Add workers anytime |
| Job persistence | Lost on container restart | Survives restarts |

## Implementation Tasks

### Phase 1: Infrastructure
- [ ] Add Redis service to docker-compose.yml
- [ ] Add `redux` and `rrq` to API dependencies (renv.lock)
- [ ] Create Redis connection helper with health checks

### Phase 2: Job Manager Refactor
- [ ] Create `job-manager-redis.R` using rrq
- [ ] Implement queue routing logic (heavy vs light based on operation)
- [ ] Migrate `create_job()` to enqueue to Redis
- [ ] Migrate `get_job_status()` to read from Redis
- [ ] Update `get_job_history()` to query Redis

### Phase 3: Worker Implementation
- [ ] Create `worker.R` entrypoint script
- [ ] Implement queue-specific worker startup (`--queue=heavy|light`)
- [ ] Add graceful shutdown handling
- [ ] Configure worker health checks

### Phase 4: Docker Configuration
- [ ] Add `worker-heavy` service (6GB memory limit)
- [ ] Add `worker-light` service (1GB memory limit, replicas: 2)
- [ ] Remove sticky sessions from API (no longer needed)
- [ ] Update Traefik labels (simplified routing)

### Phase 5: Testing & Migration
- [ ] Add integration tests for Redis job flow
- [ ] Test job persistence across container restarts
- [ ] Test worker crash recovery
- [ ] Document migration path from current architecture

## R Packages Required

| Package | Version | Purpose |
|---------|---------|---------|
| [redux](https://cran.r-project.org/package=redux) | 1.1.5 | Redis client (hiredis bindings) |
| [rrq](https://mrc-ide.github.io/rrq/) | latest | Task queue on Redis |

## Docker Compose Changes

```yaml
services:
  redis:
    image: redis:7-alpine
    mem_limit: 256m
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]

  api:
    mem_limit: 1g  # Reduced - no heavy processing
    environment:
      - REDIS_URL=redis://redis:6379

  worker-heavy:
    image: sysndd-api
    mem_limit: 6g
    command: ["Rscript", "worker.R", "--queue=heavy"]
    environment:
      - REDIS_URL=redis://redis:6379

  worker-light:
    image: sysndd-api
    mem_limit: 1g
    command: ["Rscript", "worker.R", "--queue=light"]
    deploy:
      replicas: 2
    environment:
      - REDIS_URL=redis://redis:6379

volumes:
  redis_data:
```

## Job Routing Logic

| Operation | Queue | Worker Memory |
|-----------|-------|---------------|
| `clustering` | heavy | 6GB |
| `phenotype_clustering` | heavy | 6GB |
| `llm_generation` | heavy | 6GB |
| `pubtator_update` | light | 1GB |
| `comparisons_update` | light | 1GB |
| `backup_create` | light | 1GB |

## Success Criteria

1. Heavy clustering jobs run in isolated 6GB container
2. API container stays under 1GB during heavy job execution
3. Job status survives API container restart
4. Light requests (health, status, queries) respond < 100ms during heavy jobs
5. Workers can be scaled independently (`docker-compose up -d --scale worker-light=4`)

## Related Issues

- #150 - Optimize mirai worker configuration for memory-constrained servers
- #152 - ViewLogs endpoint loads entire table into memory before filtering

## References

- [rrq documentation](https://mrc-ide.github.io/rrq/)
- [redux package](https://cran.r-project.org/web/packages/redux/vignettes/redux.html)
- [Docker resource constraints](https://docs.docker.com/engine/containers/resource_constraints/)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Redis job queue with separate heavy/light workers #154

Summary

Problem

Proposed Solution

Benefits

Implementation Tasks

Phase 1: Infrastructure

Phase 2: Job Manager Refactor

Phase 3: Worker Implementation

Phase 4: Docker Configuration

Phase 5: Testing & Migration

R Packages Required

Docker Compose Changes

Job Routing Logic

Success Criteria

Related Issues

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Aspect	Current	Proposed
Memory isolation	None (shared container)	Full (separate containers)
Heavy job memory	4.5GB limit (shared)	6GB dedicated
Light request latency	Blocked by heavy jobs	Unaffected
Horizontal scaling	Not possible (in-memory state)	Add workers anytime
Job persistence	Lost on container restart	Survives restarts

Package	Version	Purpose
redux	1.1.5	Redis client (hiredis bindings)
rrq	latest	Task queue on Redis

Operation	Queue	Worker Memory
`clustering`	heavy	6GB
`phenotype_clustering`	heavy	6GB
`llm_generation`	heavy	6GB
`pubtator_update`	light	1GB
`comparisons_update`	light	1GB
`backup_create`	light	1GB

Implement Redis job queue with separate heavy/light workers #154

Description

Summary

Problem

Proposed Solution

Benefits

Implementation Tasks

Phase 1: Infrastructure

Phase 2: Job Manager Refactor

Phase 3: Worker Implementation

Phase 4: Docker Configuration

Phase 5: Testing & Migration

R Packages Required

Docker Compose Changes

Job Routing Logic

Success Criteria

Related Issues

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions