Skip to content

redis-cluster-support#1

Open
krhoward-amd wants to merge 1 commit intomainfrom
feat/redis-cluster-support
Open

redis-cluster-support#1
krhoward-amd wants to merge 1 commit intomainfrom
feat/redis-cluster-support

Conversation

@krhoward-amd
Copy link
Owner

Implements Redis Cluster client support in slurm-web agent to enable high-availability caching.

Summary

  • Adds optional Redis Cluster mode (opt-in, default: disabled)
  • Maintains 100% backwards compatibility with standalone Redis
  • Production-tested on Ubuntu 24.04 with 3-node cluster

Changes

  • slurmweb/cache.py: RedisCluster client support
  • slurmweb/apps/agent.py: Pass cluster parameters
  • conf/vendor/agent.yml: Schema definitions

Implements Redis Cluster client support in slurm-web agent to enable
high-availability caching across distributed Redis clusters.

## Problem

Slurm-web currently supports only standalone Redis instances for caching.
In high-availability deployments with Redis Cluster (3+ node clustered Redis),
slurm-web agents fail to connect because they use the standard redis.Redis()
client instead of the cluster-aware redis.cluster.RedisCluster() client.

## Solution

This commit adds optional Redis Cluster support while maintaining full
backwards compatibility with standalone Redis deployments.

### Core Changes

**slurmweb/cache.py**:
- Import RedisCluster and ClusterNode from redis.cluster
- Add cluster_mode and cluster_nodes optional parameters to CachingService
- Implement cluster mode initialization with RedisCluster client
- Parse cluster_nodes from "host:port" string format
- Add connection validation with fail-fast error handling

**slurmweb/apps/agent.py**:
- Pass cluster_mode and cluster_nodes parameters to CachingService
- Use getattr() with defaults for backwards compatibility

**conf/vendor/agent.yml**:
- Add cluster_mode boolean parameter (default: false)
- Add cluster_nodes list parameter with string content type
- Document configuration with examples

## Features

- **Opt-in design**: Cluster mode disabled by default (cluster_mode=false)
- **Automatic failover**: Cluster continues if a Redis node fails
- **Load distribution**: Requests distributed across cluster nodes
- **Backwards compatible**: Existing standalone configurations work unchanged
- **Fail-fast validation**: Connection tested at initialization

## Configuration Example

```ini
[cache]
enabled = yes
cluster_mode = yes
cluster_nodes =
    10.0.0.1:6379
    10.0.0.2:6379
    10.0.0.3:6379
jobs = 30
nodes = 30
```

## Testing

Tested on production environment:
- Slurm-web 6.0.0
- Redis cluster: 3 nodes
- Slurm controllers: 2 nodes
- OS: Ubuntu 24.04
- Verified backward compatibility with standalone mode

## Implementation Notes

- Uses "host:port" string format for RFL schema compatibility (list content type must be str, not dict)
- skip_full_coverage_check=True allows partial cluster visibility
- decode_responses=False maintains pickle serialization compatibility
- Connection validated with ping() at initialization

Closes: #[issue-number]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant