diff --git a/docs/PRD.md b/docs/PRD.md index ecd686c..64f4030 100644 --- a/docs/PRD.md +++ b/docs/PRD.md @@ -39,42 +39,47 @@ Designed for growth to 100+ nodes. ## Core Features -### 1. Automated RKE2 Kubernetes Deployment +### Automated RKE2 Kubernetes Deployment Automated deployment of production-ready RKE2 clusters with first node initialization, additional node joining, Cilium CNI integration, and compliance-ready audit logging. -**[πŸ“„ Detailed Documentation](./01-rke2-deployment.md)** +**[πŸ“„ Detailed Documentation](./rke2-deployment.md)** -### 2. AMD GPU Support with ROCm +### AMD GPU Support with ROCm Automated AMD GPU driver installation, device detection, permission configuration, and Kubernetes GPU resource integration for AI/ML workloads. -**[πŸ“„ Detailed Documentation](./02-rocm-support.md)** +**[πŸ“„ Detailed Documentation](./rocm-support.md)** -### 3. Storage Management with Longhorn +### Storage Management with Longhorn Distributed block storage with automatic disk detection, interactive selection, persistent mounting, and Longhorn CSI integration for reliable persistent volumes. -**[πŸ“„ Detailed Documentation](./03-storage-management.md)** +**[πŸ“„ Detailed Documentation](./storage-management.md)** -### 4. Network Configuration +### Longhorn Drive Setup and Recovery +Comprehensive drive recovery procedures including RAID detection and removal, disk space analysis, automated formatting and mounting, and troubleshooting for storage issues after node reboots. + +**[πŸ“„ Detailed Documentation](./longhorn-drive-setup-and-recovery.md)** + +### Network Configuration Comprehensive networking with MetalLB load balancing, firewall configuration, multipath storage networking, and time synchronization across cluster nodes. -**[πŸ“„ Detailed Documentation](./04-network-configuration.md)** +**[πŸ“„ Detailed Documentation](./network-configuration.md)** -### 5. Interactive Terminal UI +### Interactive Terminal UI Rich terminal interface with real-time progress tracking, live log streaming, interactive configuration wizards, and comprehensive error handling and recovery options. -**[πŸ“„ Detailed Documentation](./06-terminal-ui.md)** +**[πŸ“„ Detailed Documentation](./terminal-ui.md)** -### 6. Configuration Management +### Configuration Management Flexible configuration system supporting YAML files, environment variables, and CLI flags with comprehensive validation and an interactive wizard for guided setup. -**[πŸ“„ Configuration Reference](./10-configuration-reference.md)** +**[πŸ“„ Configuration Reference](./configuration-reference.md)** -### 7. Node Validation and Testing +### Node Validation and Testing Comprehensive pre-deployment validation ensures node readiness, connectivity, GPU availability, and proper firewall configuration before any system modifications. -**[πŸ“„ Installation Guide](./08-installation-guide.md)** +**[πŸ“„ Installation Guide](./installation-guide.md)** -### 8. TLS Certificate Management +### TLS Certificate Management Flexible certificate management with three deployment options: @@ -95,25 +100,25 @@ Flexible certificate management with three deployment options: All certificates are stored as Kubernetes secrets in the `kgateway-system` namespace and integrated with the cluster's ingress controller for HTTPS traffic. -**[πŸ“„ Certificate Management Details](./05-certificate-management.md)** +**[πŸ“„ Certificate Management Details](./certificate-management.md)** -### 9. Web UI and Monitoring Interface +### Web UI and Monitoring Interface Browser-based configuration wizard with real-time monitoring dashboard, error recovery interface, and responsive design for remote cluster management from any device. -**[πŸ“„ Technical Architecture](./07-technical-architecture.md)** +**[πŸ“„ Technical Architecture](./technical-architecture.md)** -### 10. Comprehensive Configuration Validation +### Comprehensive Configuration Validation Pre-flight validation system checks all configuration, resources, and system requirements before making any changes, providing clear error messages with actionable fixes. -**[πŸ“„ Configuration Reference](./10-configuration-reference.md)** +**[πŸ“„ Configuration Reference](./configuration-reference.md)** ## Technical Architecture ClusterBloom uses a modular architecture with command-based interfaces, sequential installation pipelines, and multiple interaction modes (CLI, TUI, Web UI). The system executes in three phases: pre-Kubernetes system preparation, Kubernetes cluster setup, and post-Kubernetes add-on deployment. -**[πŸ“„ Technical Architecture Documentation](./07-technical-architecture.md)** +**[πŸ“„ Technical Architecture Documentation](./technical-architecture.md)** -**[πŸ“„ Configuration Reference](./10-configuration-reference.md)** +**[πŸ“„ Configuration Reference](./configuration-reference.md)** ## User Experience @@ -270,16 +275,16 @@ Browser-based testing with chromedp and comprehensive mock system: ### For Developers and Operators -**[πŸ“„ Manual Installation Guide](./08-installation-guide.md)** +**[πŸ“„ Manual Installation Guide](./installation-guide.md)** Complete manual installation procedures for understanding automation or performing custom installations. -**[πŸ“„ Cloud Platform Compatibility](./09-cloud-compatibility.md)** +**[πŸ“„ Cloud Platform Compatibility](./cloud-compatibility.md)** Infrastructure dependencies, migration strategies, and configuration for multi-platform deployments (EKS, AKS, GKE). -**[πŸ“„ Configuration Reference](./10-configuration-reference.md)** +**[πŸ“„ Configuration Reference](./configuration-reference.md)** Comprehensive configuration variable reference with examples and validation rules. -**[πŸ“„ Technical Architecture](./07-technical-architecture.md)** +**[πŸ“„ Technical Architecture](./technical-architecture.md)** Detailed technical architecture, component organization, and implementation patterns. ## Conclusion diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..045cb6c --- /dev/null +++ b/docs/README.md @@ -0,0 +1,79 @@ +# Cluster-Bloom Documentation + +Welcome to the comprehensive documentation for Cluster-Bloom, an enterprise-ready AI/ML cluster deployment platform built on RKE2 and Kubernetes. + +## Documentation Overview + +This documentation provides complete guidance for deploying, configuring, and managing Cluster-Bloom environments. Each document covers specific aspects of the platform, from initial sizing to advanced configuration. + +## Documentation Index + +### Getting Started +- [**Cluster Sizing and Configurations**](cluster-sizing-configurations.md) - Hardware requirements, sizing guidelines, and deployment planning +- [**Manual Steps Quick Reference**](manual-steps-quick-reference.md) - Essential commands and procedures for cluster management + +### Core Deployment +- [**RKE2 Deployment**](rke2-deployment.md) - Kubernetes cluster foundation setup and configuration +- [**ROCm Support**](rocm-support.md) - AMD GPU support and ROCm integration for AI workloads +- [**Storage Management**](storage-management.md) - Longhorn distributed storage configuration and management +- [**Longhorn Drive Setup and Recovery**](longhorn-drive-setup-and-recovery.md) - Detailed drive recovery, RAID handling, and storage troubleshooting + +### Infrastructure Configuration +- [**Network Configuration**](network-configuration.md) - Networking setup, load balancing, and connectivity +- [**Certificate Management**](certificate-management.md) - TLS/SSL certificate handling and automation +- [**Terminal UI**](terminal-ui.md) - Interactive command-line interface and user experience +- [**Technical Architecture**](technical-architecture.md) - System design, component interactions, and architectural decisions + +### Operations and Maintenance +- [**Installation Guide**](installation-guide.md) - Complete step-by-step installation procedures +- [**Cloud Compatibility**](cloud-compatibility.md) - Multi-cloud deployment strategies and platform-specific considerations +- [**Configuration Reference**](configuration-reference.md) - Comprehensive configuration options and parameters +- [**OIDC Authentication**](oidc-authentication.md) - Single sign-on integration and identity management + +## Quick Navigation + +### For New Users +1. Start with [Cluster Sizing and Configurations](cluster-sizing-configurations.md) to plan your deployment +2. Follow the [Installation Guide](installation-guide.md) for step-by-step setup +3. Reference [Manual Steps Quick Reference](manual-steps-quick-reference.md) for common operations + +### For System Administrators +- [Technical Architecture](technical-architecture.md) - Understand system design +- [Storage Management](storage-management.md) + [Longhorn Drive Setup and Recovery](longhorn-drive-setup-and-recovery.md) - Complete storage configuration +- [Configuration Reference](configuration-reference.md) - Detailed parameter documentation + +### For DevOps Engineers +- [RKE2 Deployment](rke2-deployment.md) - Kubernetes foundation +- [Network Configuration](network-configuration.md) - Infrastructure networking +- [Certificate Management](certificate-management.md) - Security configuration + +### Troubleshooting and Recovery +- [Longhorn Drive Setup and Recovery](longhorn-drive-setup-and-recovery.md) - Storage troubleshooting and RAID handling +- [Manual Steps Quick Reference](manual-steps-quick-reference.md) - Emergency procedures and common fixes + +## Documentation Standards + +- **Comprehensive Coverage**: Each document provides complete information for its topic area +- **Practical Examples**: Real-world configurations and command examples +- **Cross-References**: Links between related topics for easy navigation +- **Version Compatibility**: All procedures tested with current platform versions + +## Contributing + +This documentation is maintained as part of the Cluster-Bloom project. For updates, corrections, or additions: + +1. Follow the established documentation patterns +2. Include practical examples and command snippets +3. Test all procedures before documentation +4. Maintain cross-references between related topics + +## Support + +For questions about the documentation or Cluster-Bloom platform: +- Reference the [Configuration Reference](configuration-reference.md) for parameter details +- Check [Technical Architecture](technical-architecture.md) for design questions +- Use [Manual Steps Quick Reference](manual-steps-quick-reference.md) for operational procedures + +--- + +*This is the way to build enterprise-grade AI infrastructure that eliminates impurities.* \ No newline at end of file diff --git a/docs/05-certificate-management.md b/docs/certificate-management.md similarity index 100% rename from docs/05-certificate-management.md rename to docs/certificate-management.md diff --git a/docs/09-cloud-compatibility.md b/docs/cloud-compatibility.md similarity index 100% rename from docs/09-cloud-compatibility.md rename to docs/cloud-compatibility.md diff --git a/docs/00-cluster-sizing-configurations.md b/docs/cluster-sizing-configurations.md similarity index 100% rename from docs/00-cluster-sizing-configurations.md rename to docs/cluster-sizing-configurations.md diff --git a/docs/10-configuration-reference.md b/docs/configuration-reference.md similarity index 100% rename from docs/10-configuration-reference.md rename to docs/configuration-reference.md diff --git a/docs/08-installation-guide.md b/docs/installation-guide.md similarity index 100% rename from docs/08-installation-guide.md rename to docs/installation-guide.md diff --git a/docs/longhorn-drive-setup-and-recovery.md b/docs/longhorn-drive-setup-and-recovery.md new file mode 100644 index 0000000..a5c50eb --- /dev/null +++ b/docs/longhorn-drive-setup-and-recovery.md @@ -0,0 +1,485 @@ +# Longhorn Drive Setup and Recovery Documentation + +This documentation provides comprehensive instructions forsetting up, recovering, and managing Longhorn drives on cluster-bloom nodes. It includes both manual step-by-step procedures and a sample (not officially supported) script which serves as an automation example. + +## Table of Contents + +1. [Overview](#overview) +2. [Prerequisites](#prerequisites) +3. [Disk Space Requirements](#disk-space-requirements) +4. [RAID Considerations](#raid-considerations) +5. [Reboot Checklist](#reboot-checklist) +6. [Manual Disk Setup Procedure](#manual-disk-setup-procedure) +7. [Automation Script](#automation-script) +8. [Longhorn UI Configuration](#longhorn-ui-configuration) +9. [Troubleshooting](#troubleshooting) +10. [Reference](#reference) + +## Overview + +Longhorn is a distributed block storage system for Kubernetes that requires proper disk configuration to ensure data persistence across node reboots. This documentation covers: + +- **Drive Priority**: NVMe drives (preferred) β†’ SSD drives β†’ HDD drives +- **RAID Restriction**: Longhorn explicitly does NOT support RAID configurations +- **Special Requirements**: `/var/lib/rancher` needs dedicated mountpoint only if root partition is space-constrained +- **Mount Pattern**: Disks mounted at `/mnt/diskX` where X starts from 0 and increments by one for each additional disk +- **Filesystem**: ext4 with UUID-based mounting for reliability + +## Prerequisites + +- Root access to the cluster node +- Understanding of Linux disk management +- Backup of important data (formatting operations are destructive) +- Basic knowledge of Longhorn concepts + +### Required Packages + +Ensure these utilities are available: +```bash +sudo apt update +sudo apt install -y util-linux e2fsprogs mdadm +``` + +## Disk Space Requirements + +Based on cluster-bloom requirements, ensure adequate disk space: + +#### Disk Space Requirements +- **Root partition**: Minimum 10GB required, 20GB recommended +- **Available space**: Minimum 10GB required +- **/var partition**: 5GB recommended for container images +- **/var/lib/rancher**: dedicated partition in the case that the root partition is constrained (and no separate /var or /var/lib mounts exist) + +#### Space Validation +```bash +# Check current disk usage +df -h / +df -h /var +df -h /var/lib/rancher 2>/dev/null || echo "/var/lib/rancher not separately mounted" + +# Check available space +df -h | awk '$6=="/" {print "Root partition: " $4 " available"}' +``` + +## RAID Considerations + +**⚠️ CRITICAL**: Longhorn documentation explicitly states that **RAID configurations are NOT supported**. Longhorn provides its own replication and high availability mechanisms. + +### Detecting RAID Configuration + +Check if your system has software RAID that needs to be removed: + +```bash +# Check for software RAID arrays +cat /proc/mdstat + +# List RAID arrays +sudo mdadm --detail --scan + +# Example of RAID configuration that must be removed: +``` + +**Example `lsblk` output showing problematic RAID setup:** +``` +NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS UUID +nvme0n1 259:0 0 3.5T 0 disk +└─nvme0n1p1 259:1 0 3.5T 0 part + └─md0 9:0 0 14T 0 raid0 / +nvme1n1 259:2 0 894G 0 disk +β”œβ”€nvme1n1p1 259:3 0 512M 0 part /boot/efi +└─nvme1n1p2 259:4 0 893G 0 part [SWAP] +nvme2n1 259:5 0 3.5T 0 disk +└─nvme2n1p1 259:6 0 3.5T 0 part + └─md0 9:0 0 14T 0 raid0 / +nvme3n1 259:7 0 3.5T 0 disk +└─nvme3n1p1 259:8 0 3.5T 0 part + └─md0 9:0 0 14T 0 raid0 / +nvme4n1 259:9 0 3.5T 0 disk +└─nvme4n1p1 259:10 0 3.5T 0 part + └─md0 9:0 0 14T 0 raid0 / +nvme5n1 259:11 0 894G 0 disk +``` + +In the above example, **md0** shows a RAID0 array using multiple NVMe drives - this must be broken apart for Longhorn use. + +### RAID Removal Process + +**⚠️ WARNING ** The automation [script](../experimental/longhorn-disk-setup.sh) is not robustly tested, but rather serves as a starting point for your particular use case. + +The automation script (`/cluster-bloom/experimental/longhorn-disk-setup.sh`) can backup, remove, and optionally restore RAID configurations: + +```bash +# Check if RAID is present +cat /proc/mdstat + +# Backup and remove RAID (interactive) +sudo bash experimental/longhorn-disk-setup.sh --remove-raid + +# Force RAID removal without confirmation +sudo bash experimental/longhorn-disk-setup.sh --force-raid-removal +``` + +#### RAID Backup and Restore + +The script automatically backs up RAID configurations before removal: + +**Backup Location**: `/root/longhorn-raid-backup/` +**Backup Contents**: +- `mdadm.conf.backup` - RAID configuration +- `mdstat.backup` - RAID status at backup time +- `md*_detail.backup` - Individual array details + +**Manual RAID Restoration** (if needed): +```bash +# List backups +ls -la /root/longhorn-raid-backup/ + +# View original configuration +cat /root/longhorn-raid-backup/mdadm.conf.backup + +# Restore RAID (DESTRUCTIVE - will recreate arrays) +sudo mdadm --assemble --scan --config=/root/longhorn-raid-backup/mdadm.conf.backup +``` + +**⚠️ Important**: RAID restoration will destroy any data written to individual disks after RAID removal. + +## Reboot Checklist + +After any node reboot, verify that all Longhorn storage disks are properly mounted: + +### Quick Validation Commands + +```bash +# 1. Check current fstab entries for Longhorn disks +sudo cat /etc/fstab | grep -E "/mnt/disk[0-9]+" + +# 2. Check currently mounted disks +df -h | grep -E "/mnt/disk[0-9]+" + +# 3. List all disks with UUIDs +lsblk -o +UUID + +# 4. Verify all fstab entries mount correctly +sudo mount -a && echo "All mounts successful" || echo "Mount errors detected" +``` + +### Expected fstab Format + +Your `/etc/fstab` should contain entries like: + +```bash +UUID=f9134cf2-0205-4012-8e8b-ac44757a0d15 /mnt/disk0 ext4 defaults,nofail 0 2 +UUID=9111f9b3-e4e5-4a50-a9cc-3258d40786f3 /mnt/disk1 ext4 defaults,nofail 0 2 +UUID=e27fc7cd-356a-40de-89ae-ea1f0af59d24 /mnt/disk2 ext4 defaults,nofail 0 2 +UUID=489f3576-cf3b-4319-ba9d-a07427225f81 /mnt/disk3 ext4 defaults,nofail 0 2 +UUID=3206db8b-109e-4b9f-8320-7db4cca5210d /mnt/disk4 ext4 defaults,nofail 0 2 +``` + +**Note**: The `nofail` option ensures the system boots even if a disk is unavailable. + +## Manual Disk Setup Procedure + +### Step 1: Identify Candidate Disks + +First, examine your system's storage layout: + +```bash +# List all block devices with UUIDs +lsblk -o +UUID + +# Example output with 5 NVMe drives: +NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS UUID +nvme0n1 259:0 0 3.5T 0 disk +β”œβ”€nvme0n1p1 259:1 0 3.5T 0 part / a1b2c3d4-e5f6-7890-abcd-ef1234567890 +nvme1n1 259:2 0 894G 0 disk +β”œβ”€nvme1n1p1 259:3 0 512M 0 part /boot/efi 1234-5678 +└─nvme1n1p2 259:4 0 893G 0 part [SWAP] +nvme2n1 259:5 0 3.5T 0 disk +nvme3n1 259:6 0 3.5T 0 disk +nvme4n1 259:7 0 3.5T 0 disk +nvme5n1 259:8 0 894G 0 disk +sdb 8:16 0 256G 0 disk +``` + +**Disk Priority for Longhorn Storage**: +1. **NVMe drives** (nvme0n1, nvme2n1, nvme3n1, nvme4n1, nvme5n1) - Highest priority +2. **SSD drives** (typically sdb, sdc, etc.) - Medium priority +3. **HDD drives** (sda usually excluded as boot drive) - Lowest priority + +### Step 2: Check Current Mount Status + +```bash +# Check what's currently mounted +mount | grep -E "/mnt/disk|/var/lib/rancher" + +# Compare with fstab entries +sudo cat /etc/fstab +``` + +### Step 3: Identify Unmounted Candidate Disks + +Look for disks that: +- Are not currently mounted +- Don't have a UUID (indicating they need formatting) +- Are suitable for Longhorn storage + +Example identification process: + +```bash +# Check if disk has UUID (formatted) +sudo blkid /dev/nvme2n1 +# If no output, disk needs formatting + +# Check if disk is mounted +mount | grep /dev/nvme2n1 +# If no output, disk is not mounted +``` + +### Step 4: Format Unmounted Disks + +**⚠️ WARNING**: This will destroy all data on the disk! + +For each unformatted disk: + +```bash +# Format with ext4 filesystem +sudo mkfs.ext4 /dev/nvme2n1 + +# Verify UUID was assigned +sudo blkid /dev/nvme2n1 +# Output: /dev/nvme2n1: UUID="e27fc7cd-356a-40de-89ae-ea1f0af59d24" TYPE="ext4" +``` + +### Step 5: Create Mount Points + +```bash +# Create mount directories +sudo mkdir -p /mnt/disk0 +sudo mkdir -p /mnt/disk1 +sudo mkdir -p /mnt/disk2 +sudo mkdir -p /mnt/disk3 +sudo mkdir -p /mnt/disk4 +# Continue for additional disks +``` + +### Step 6: Add Disks to fstab + +For each formatted disk, add an entry to `/etc/fstab`: + +```bash +# Get the UUID for the disk +UUID=$(sudo blkid -s UUID -o value /dev/nvme2n1) + +# Add entry to fstab (replace with actual UUID) +echo "UUID=$UUID /mnt/disk2 ext4 defaults,nofail 0 2" | sudo tee -a /etc/fstab +``` + +### Step 7: Mount All Disks + +```bash +# Mount all entries in fstab +sudo mount -a + +# Verify successful mounting +df -h | grep "/mnt/disk" +``` + +Expected output: +``` +/dev/nvme2n1 3.4T 89M 3.2T 1% /mnt/disk0 +/dev/nvme3n1 3.4T 89M 3.2T 1% /mnt/disk1 +/dev/nvme4n1 3.4T 89M 3.2T 1% /mnt/disk2 +/dev/nvme5n1 894G 77M 848G 1% /mnt/disk3 +/dev/sdb 251G 65M 238G 1% /mnt/disk4 +``` + +## Automation Script + +The automation [script](../experimental/longhorn-disk-setup.sh) at `cluster-forge/experimental/longhorn-disk-setup.sh` provides comprehensive disk management capabilities, including RAID handling, disk discovery, formatting, mounting, and fstab configuration. + +### Script Usage + +```bash +# The automation script is available in the experimental folder +# Script location: experimental/longhorn-disk-setup.sh + +# Dry run to see recommendations without making changes +sudo bash experimental/longhorn-disk-setup.sh --dry-run + +# Full interactive setup (with RAID handling if needed) +sudo bash experimental/longhorn-disk-setup.sh + +# Force RAID backup and removal (if detected) +sudo bash experimental/longhorn-disk-setup.sh --remove-raid +``` + +### Script Capabilities + +The script will: +1. **Check disk space requirements** and recommend `/var/lib/rancher` setup if needed +2. **Detect and handle software RAID** configurations safely +3. **Discover candidate disks** (prioritized by type: NVMe β†’ SSD β†’ HDD) +4. **Identify unformatted disks** and prompt for formatting +5. **Create mount points** with proper permissions +6. **Add fstab entries** with UUID-based mounting +7. **Validate mounts** and test reboot safety +8. **Provide summary** and next steps + +### RAID Handling Features + +- **RAID Detection**: Automatically detects software RAID arrays +- **Configuration Backup**: Saves RAID configuration for potential restoration +- **Safe Removal**: Properly stops and removes RAID arrays +- **Restoration Capability**: Can restore original RAID if needed + +## Longhorn UI Configuration + +After disks are mounted and persistent, configure them in Longhorn: + +### Step 1: Access Longhorn UI + +```bash +# Access the Longhorn dashboard +https://longhorn.cluster-name +``` + +### Step 2: Add Disks to Nodes + +1. **Navigate to Nodes**: Click on the "Node" tab in the Longhorn UI +2. **Select Node**: Choose the node you want to configure +3. **Edit Disks**: In the "Operations" column (far right), click the dropdown menu and select "Edit node and disks" +4. **Add Disk**: Scroll to the bottom of the form and click "Add disk" +5. **Configure Disk**: + - **Name**: Descriptive name (e.g., "nvme-disk-0") + - **Disk Type**: "filesystem" + - **Path**: Mount path (e.g., "/mnt/disk0") + - **Storage Reserved**: Amount to reserve (bytes) - optional +6. **Enable Scheduling**: Click the "Enable" button under "Scheduling" +7. **Save**: Click "Save" to apply changes + +### Step 3: Verify Disk Addition + +- Check that the disk appears in the node's disk list +- Verify "Schedulable" status is "True" +- Monitor disk space and usage + +## Special Requirements + +### /var/lib/rancher Partition + +Based on cluster-bloom requirements and available space: + +- **Conditional Requirement**: `/var/lib/rancher` should have its own dedicated mount point **only if** root partition is space-constrained +- **Size Guidelines**: Refer to disk space requirements above +- **Configuration**: Can be specified via `CLUSTER_DISKS` or `CLUSTER_PREMOUNTED_DISKS` in bloom.yaml + +#### When to Create Separate /var/lib/rancher: +```bash +# Check if root partition needs dedicated /var/lib/rancher +ROOT_AVAILABLE=$(df --output=avail / | tail -1) +if [ "$ROOT_AVAILABLE" -lt 20971520 ]; then # Less than 20GB in KB + echo "Root partition space-constrained, recommend separate /var/lib/rancher" +else + echo "Root partition has sufficient space" +fi +``` + +Example setup if needed: +```bash +# If using a dedicated disk for /var/lib/rancher +UUID=12345678-90ab-cdef-1234-567890abcdef /var/lib/rancher ext4 defaults,nofail 0 2 +``` + +## Troubleshooting + +### Common Issues + +**1. Disk not mounting after reboot** +```bash +# Check fstab entry syntax +sudo cat /etc/fstab | grep UUID + +# Test mount manually +sudo mount UUID=your-uuid-here /mnt/disk0 + +# Check filesystem health +sudo fsck -f /dev/nvme2n1 +``` + +**2. UUID not found** +```bash +# Regenerate UUID if filesystem is corrupted +sudo tune2fs -U random /dev/nvme2n1 + +# Update fstab with new UUID +sudo blkid /dev/nvme2n1 +``` + +**3. Mount point permission issues** +```bash +# Fix ownership and permissions +sudo chown root:root /mnt/disk0 +sudo chmod 755 /mnt/disk0 +``` + +**4. Longhorn not detecting disks** +- Ensure disk path matches exactly in Longhorn UI +- Verify disk has sufficient space +- Check Longhorn logs for errors +```bash +kubectl logs -n longhorn-system deployment/longhorn-manager +``` + +### Validation Commands + +```bash +# Comprehensive disk check +echo "=== Disk Status ===" +lsblk -o +UUID + +echo -e "\n=== RAID Check ===" +if [[ -f /proc/mdstat ]]; then + cat /proc/mdstat + if grep -q "^md" /proc/mdstat; then + echo "⚠️ RAID arrays detected - Longhorn does not support RAID!" + else + echo "βœ“ No RAID arrays found" + fi +else + echo "βœ“ No RAID support" +fi + +echo -e "\n=== Mount Status ===" +df -h | grep "/mnt/disk" + +echo -e "\n=== fstab Entries ===" +grep "/mnt/disk" /etc/fstab + +echo -e "\n=== Mount Test ===" +sudo mount -a && echo "βœ“ All mounts successful" || echo "βœ— Mount errors" + +echo -e "\n=== Disk Space Check ===" +df -h / | tail -1 | awk '{print "Root: " $4 " available (" $5 " used)"}' +``` + +## Reference + +### Longhorn Documentation +- [Multiple Disk Support](https://longhorn.io/docs/1.8.0/nodes-and-volumes/nodes/multidisk/) +- [Node Space Usage](https://longhorn.io/docs/1.8.0/nodes-and-volumes/nodes/node-space-usage/) + +### Cluster-Bloom Configuration +- Storage options: `NO_DISKS_FOR_CLUSTER`, `CLUSTER_DISKS`, `CLUSTER_PREMOUNTED_DISKS` +- Device path format: `/dev/nvme0n1,/dev/nvme1n1` (comma-separated) +- Premounted disk format: `/mnt/disk1,/mnt/disk2` (comma-separated) + +### Setup Script + - cluster-bloom/experimental/longhorn-disk-setup.sh + +### Best Practices +1. **Always backup data** before disk operations +2. **Use UUID-based mounting** for reliability +3. **Test mount operations** before rebooting +4. **Monitor disk space** regularly +5. **Keep fstab entries simple** and well-documented +6. **Use `nofail` option** to prevent boot issues diff --git a/docs/11-manual-steps-quick-reference.md b/docs/manual-steps-quick-reference.md similarity index 100% rename from docs/11-manual-steps-quick-reference.md rename to docs/manual-steps-quick-reference.md diff --git a/docs/04-network-configuration.md b/docs/network-configuration.md similarity index 100% rename from docs/04-network-configuration.md rename to docs/network-configuration.md diff --git a/docs/12-oidc-authentication.md b/docs/oidc-authentication.md similarity index 100% rename from docs/12-oidc-authentication.md rename to docs/oidc-authentication.md diff --git a/docs/01-rke2-deployment.md b/docs/rke2-deployment.md similarity index 100% rename from docs/01-rke2-deployment.md rename to docs/rke2-deployment.md diff --git a/docs/02-rocm-support.md b/docs/rocm-support.md similarity index 100% rename from docs/02-rocm-support.md rename to docs/rocm-support.md diff --git a/docs/03-storage-management.md b/docs/storage-management.md similarity index 100% rename from docs/03-storage-management.md rename to docs/storage-management.md diff --git a/docs/07-technical-architecture.md b/docs/technical-architecture.md similarity index 100% rename from docs/07-technical-architecture.md rename to docs/technical-architecture.md diff --git a/docs/06-terminal-ui.md b/docs/terminal-ui.md similarity index 100% rename from docs/06-terminal-ui.md rename to docs/terminal-ui.md diff --git a/experimental/longhorn-disk-setup.sh b/experimental/longhorn-disk-setup.sh new file mode 100755 index 0000000..df7f7f5 --- /dev/null +++ b/experimental/longhorn-disk-setup.sh @@ -0,0 +1,597 @@ +#!/bin/bash + +# Longhorn Disk Setup Automation Script +# This script automates the disk formatting, mounting, and fstab configuration +# for Longhorn storage on cluster-bloom nodes +# +# Features: +# - RAID detection and removal (with backup/restore capability) +# - Disk space analysis and recommendations +# - Dry-run mode for planning +# - Safe disk formatting and mounting + +set -euo pipefail + +# Configuration +MOUNT_BASE="/mnt/disk" +FILESYSTEM_TYPE="ext4" +FSTAB_OPTIONS="defaults,nofail 0 2" +BACKUP_DIR="/root/longhorn-raid-backup" + +# Runtime flags +DRY_RUN=false +REMOVE_RAID=false +FORCE_RAID_REMOVAL=false + +# Colors for output +RED='\033[0;31m' +GREEN='\033[0;32m' +YELLOW='\033[1;33m' +BLUE='\033[0;34m' +NC='\033[0m' # No Color + +# Logging functions +log_info() { + echo -e "${BLUE}[INFO]${NC} $1" +} + +log_success() { + echo -e "${GREEN}[SUCCESS]${NC} $1" +} + +log_warning() { + echo -e "${YELLOW}[WARNING]${NC} $1" +} + +log_error() { + echo -e "${RED}[ERROR]${NC} $1" +} + +# Parse command line arguments +parse_args() { + while [[ $# -gt 0 ]]; do + case $1 in + --dry-run) + DRY_RUN=true + log_info "Running in dry-run mode - no changes will be made" + shift + ;; + --remove-raid) + REMOVE_RAID=true + shift + ;; + --force-raid-removal) + FORCE_RAID_REMOVAL=true + REMOVE_RAID=true + shift + ;; + -h|--help) + show_help + exit 0 + ;; + *) + log_error "Unknown option: $1" + show_help + exit 1 + ;; + esac + done +} + +# Show help +show_help() { + cat << EOF +Longhorn Disk Setup Automation Script + +Usage: $0 [OPTIONS] + +OPTIONS: + --dry-run Show what would be done without making changes + --remove-raid Remove detected RAID configurations + --force-raid-removal Force RAID removal without confirmation + -h, --help Show this help message + +Examples: + $0 --dry-run # See recommendations without changes + $0 # Interactive setup + $0 --remove-raid # Handle RAID removal if needed + +This script will: +1. Check disk space requirements +2. Detect and optionally remove software RAID +3. Set up Longhorn-compatible disk configuration +4. Create proper fstab entries for persistent mounting + +EOF +} + +# Check if running as root +check_root() { + if [[ $EUID -ne 0 ]]; then + log_error "This script must be run as root (use sudo)" + exit 1 + fi +} + +# Check disk space requirements +check_disk_space() { + log_info "Checking disk space requirements..." + + local root_available_kb root_available_gb var_available_kb var_available_gb + + # Get available space in KB + root_available_kb=$(df --output=avail / | tail -1) + root_available_gb=$((root_available_kb / 1024 / 1024)) + + log_info "Root partition available space: ${root_available_gb}GB" + + # Check minimum requirements + if [[ $root_available_kb -lt 10485760 ]]; then # Less than 10GB + log_error "Root partition has less than 10GB available (minimum requirement)" + return 1 + elif [[ $root_available_kb -lt 20971520 ]]; then # Less than 20GB + log_warning "Root partition has less than 20GB available (recommended)" + log_warning "Consider creating dedicated /var/lib/rancher partition" + echo "RECOMMEND_VAR_LIB_RANCHER=true" + else + log_success "Root partition has sufficient space (${root_available_gb}GB available)" + echo "RECOMMEND_VAR_LIB_RANCHER=false" + fi + + # Check /var if separately mounted + if mountpoint -q /var; then + var_available_kb=$(df --output=avail /var | tail -1) + var_available_gb=$((var_available_kb / 1024 / 1024)) + log_info "/var partition available space: ${var_available_gb}GB" + + if [[ $var_available_kb -lt 5242880 ]]; then # Less than 5GB + log_warning "/var partition has less than 5GB available (recommended for container images)" + fi + fi +} + +# Detect RAID arrays +detect_raid() { + log_info "Checking for software RAID configurations..." + + if [[ ! -f /proc/mdstat ]]; then + log_info "No software RAID support detected" + return 1 + fi + + local md_arrays + md_arrays=$(awk '/^md/ {print $1}' /proc/mdstat) + + if [[ -z "$md_arrays" ]]; then + log_info "No active RAID arrays found" + return 1 + fi + + log_warning "Found active RAID arrays:" + cat /proc/mdstat + echo "" + + log_warning "⚠️ Longhorn does NOT support RAID configurations!" + log_warning "RAID arrays must be removed before configuring Longhorn storage" + + return 0 +} + +# Backup RAID configuration +backup_raid_config() { + log_info "Backing up RAID configuration..." + + if [[ $DRY_RUN == true ]]; then + log_info "[DRY-RUN] Would backup RAID config to $BACKUP_DIR" + return 0 + fi + + # Create backup directory + mkdir -p "$BACKUP_DIR" + + # Backup mdadm config + if command -v mdadm >/dev/null 2>&1; then + mdadm --detail --scan > "$BACKUP_DIR/mdadm.conf.backup" + cp /proc/mdstat "$BACKUP_DIR/mdstat.backup" + + # Backup individual array details + for md_device in /dev/md*; do + if [[ -b "$md_device" ]]; then + local md_name + md_name=$(basename "$md_device") + mdadm --detail "$md_device" > "$BACKUP_DIR/${md_name}_detail.backup" 2>/dev/null || true + fi + done + + log_success "RAID configuration backed up to $BACKUP_DIR" + log_info "Backup includes: mdadm.conf, mdstat, and individual array details" + else + log_error "mdadm not found - cannot backup RAID configuration" + return 1 + fi +} + +# Remove RAID arrays +remove_raid_arrays() { + log_warning "Removing RAID arrays..." + + if [[ $DRY_RUN == true ]]; then + log_info "[DRY-RUN] Would remove the following RAID arrays:" + awk '/^md/ {print " /dev/" $1}' /proc/mdstat + return 0 + fi + + # Get list of RAID arrays + local md_arrays + md_arrays=$(awk '/^md/ {print "/dev/" $1}' /proc/mdstat) + + if [[ -z "$md_arrays" ]]; then + log_info "No RAID arrays to remove" + return 0 + fi + + # Confirm removal unless forced + if [[ $FORCE_RAID_REMOVAL != true ]]; then + echo "" + log_warning "This will DESTROY the following RAID arrays:" + printf '%s\n' $md_arrays + echo "" + read -p "Are you sure you want to proceed? (yes/no): " -r + + if [[ $REPLY != "yes" ]]; then + log_info "RAID removal cancelled by user" + return 1 + fi + fi + + # Stop and remove arrays + for md_array in $md_arrays; do + log_info "Stopping RAID array: $md_array" + + # Unmount if mounted + if mount | grep -q "$md_array"; then + log_info "Unmounting $md_array" + umount "$md_array" || log_warning "Failed to unmount $md_array" + fi + + # Stop the array + mdadm --stop "$md_array" || log_warning "Failed to stop $md_array" + + # Remove the array + mdadm --remove "$md_array" 2>/dev/null || true + done + + # Zero superblocks on member disks + log_info "Clearing RAID superblocks on member disks..." + for disk in /dev/sd* /dev/nvme*n1; do + if [[ -b "$disk" ]]; then + mdadm --zero-superblock "$disk" 2>/dev/null || true + fi + done + + log_success "RAID arrays removed successfully" + log_info "Individual disks are now available for Longhorn use" +} + +# Restore RAID configuration (if needed) +restore_raid_config() { + local backup_file="$BACKUP_DIR/mdadm.conf.backup" + + if [[ ! -f "$backup_file" ]]; then + log_error "No RAID backup found at $backup_file" + return 1 + fi + + log_warning "Restoring RAID configuration from backup..." + log_warning "This will recreate the original RAID arrays" + + read -p "Are you sure you want to restore RAID? (yes/no): " -r + if [[ $REPLY != "yes" ]]; then + log_info "RAID restoration cancelled" + return 1 + fi + + # Restore using backed up configuration + while IFS= read -r line; do + if [[ $line == ARRAY* ]]; then + log_info "Restoring: $line" + eval "mdadm --assemble $line" + fi + done < "$backup_file" + + log_success "RAID configuration restored" + log_info "Check /proc/mdstat to verify arrays" +} + +# Discover candidate disks (prioritized: NVMe > SSD > HDD) +discover_disks() { + local disks=() + + # Priority 1: NVMe drives (excluding those in RAID) + for disk in /dev/nvme*n1; do + if [[ -b "$disk" ]] && ! is_disk_in_raid "$disk"; then + disks+=("$disk") + fi + done + + # Priority 2: SATA/SCSI drives (excluding sda and those in RAID) + for disk in /dev/sd[b-z]; do + if [[ -b "$disk" ]] && ! is_disk_in_raid "$disk"; then + disks+=("$disk") + fi + done + + printf '%s\n' "${disks[@]}" | sort +} + +# Check if disk is part of a RAID array +is_disk_in_raid() { + local disk="$1" + + # Check if disk or its partitions are part of any md array + if [[ -f /proc/mdstat ]]; then + local disk_base + disk_base=$(basename "$disk") + grep -q "$disk_base" /proc/mdstat 2>/dev/null + else + return 1 + fi +} + +# Check if disk is formatted +is_disk_formatted() { + local disk="$1" + blkid -s UUID -o value "$disk" >/dev/null 2>&1 +} + +# Get disk UUID +get_disk_uuid() { + local disk="$1" + blkid -s UUID -o value "$disk" +} + +# Check if disk is mounted +is_disk_mounted() { + local disk="$1" + mount | grep -q "^$disk" +} + +# Get next available mount point +get_next_mount_point() { + local counter=0 + while [[ -d "${MOUNT_BASE}${counter}" ]]; do + if mount | grep -q "${MOUNT_BASE}${counter}"; then + ((counter++)) + else + break + fi + done + echo "${MOUNT_BASE}${counter}" +} + +# Format disk with ext4 +format_disk() { + local disk="$1" + + if [[ $DRY_RUN == true ]]; then + log_info "[DRY-RUN] Would format $disk with $FILESYSTEM_TYPE" + return 0 + fi + + log_warning "Formatting $disk with $FILESYSTEM_TYPE (THIS WILL DESTROY ALL DATA)" + read -p "Are you sure you want to format $disk? (yes/no): " -r + + if [[ $REPLY == "yes" ]]; then + mkfs.ext4 -F "$disk" + log_success "Formatted $disk successfully" + else + log_info "Skipping formatting of $disk" + return 1 + fi +} + +# Create mount point +create_mount_point() { + local mount_point="$1" + + if [[ $DRY_RUN == true ]]; then + if [[ ! -d "$mount_point" ]]; then + log_info "[DRY-RUN] Would create mount point: $mount_point" + else + log_info "[DRY-RUN] Mount point already exists: $mount_point" + fi + return 0 + fi + + if [[ ! -d "$mount_point" ]]; then + mkdir -p "$mount_point" + log_success "Created mount point: $mount_point" + else + log_info "Mount point already exists: $mount_point" + fi +} + +# Add entry to fstab +add_to_fstab() { + local uuid="$1" + local mount_point="$2" + + local fstab_entry="UUID=$uuid $mount_point $FILESYSTEM_TYPE $FSTAB_OPTIONS" + + if [[ $DRY_RUN == true ]]; then + log_info "[DRY-RUN] Would add to /etc/fstab: $fstab_entry" + return 0 + fi + + # Check if entry already exists + if grep -q "$uuid" /etc/fstab; then + log_warning "Entry for UUID $uuid already exists in /etc/fstab" + return 0 + fi + + # Add entry to fstab + echo "$fstab_entry" >> /etc/fstab + log_success "Added to /etc/fstab: $fstab_entry" +} + +# Validate mounts +validate_mounts() { + log_info "Validating mounts..." + + if [[ $DRY_RUN == true ]]; then + log_info "[DRY-RUN] Would validate all fstab entries can mount" + log_info "[DRY-RUN] Current Longhorn mounts:" + df -h | grep "${MOUNT_BASE}" || log_info "[DRY-RUN] No Longhorn disks currently mounted" + return 0 + fi + + # Test mount all + if mount -a; then + log_success "All fstab entries mounted successfully" + else + log_error "Failed to mount some entries from fstab" + return 1 + fi + + # Show mounted disks + echo "" + log_info "Currently mounted Longhorn storage disks:" + df -h | grep "${MOUNT_BASE}" || log_warning "No Longhorn disks currently mounted" +} + +# Display summary +display_summary() { + echo "" + echo "==========================================" + log_info "Longhorn Disk Setup Summary" + echo "==========================================" + + echo "" + log_info "Mounted disks:" + df -h | grep "${MOUNT_BASE}" || echo "None" + + echo "" + log_info "fstab entries for Longhorn disks:" + grep "${MOUNT_BASE}" /etc/fstab || echo "None" + + echo "" + log_info "Next steps:" + echo "1. Access Longhorn UI at: https://longhorn.cluster-name" + echo "2. Navigate to Node tab" + echo "3. For each node, select 'Edit node and disks'" + echo "4. Add each mounted disk path (e.g., /mnt/disk0, /mnt/disk1)" + echo "5. Enable scheduling for each disk" +} + +# Main function +main() { + # Parse command line arguments + parse_args "$@" + + log_info "Starting Longhorn disk setup automation..." + if [[ $DRY_RUN == true ]]; then + log_info "=== DRY RUN MODE - No changes will be made ===" + fi + + # Check prerequisites + check_root + + # Check disk space requirements + check_disk_space + echo "" + + # Check for RAID and handle if necessary + if detect_raid; then + echo "" + if [[ $REMOVE_RAID == true ]] || [[ $DRY_RUN == true ]]; then + backup_raid_config + remove_raid_arrays + else + log_warning "RAID detected but not removing. Use --remove-raid to handle this." + log_warning "Longhorn requires individual disks, not RAID arrays." + echo "" + log_info "Options:" + echo " 1. Run with --remove-raid to safely remove RAID" + echo " 2. Run with --dry-run to see what would be done" + echo " 3. Manually remove RAID configuration first" + exit 1 + fi + echo "" + fi + + # Discover available disks + log_info "Discovering candidate disks..." + mapfile -t candidate_disks < <(discover_disks) + + if [[ ${#candidate_disks[@]} -eq 0 ]]; then + log_warning "No candidate disks found" + if [[ -f /proc/mdstat ]] && grep -q "^md" /proc/mdstat; then + log_info "Note: Disks may be in RAID arrays. Use --remove-raid to make them available." + fi + exit 0 + fi + + log_info "Found ${#candidate_disks[@]} candidate disk(s):" + printf ' %s\n' "${candidate_disks[@]}" + + echo "" + + # Process each disk + for disk in "${candidate_disks[@]}"; do + log_info "Processing disk: $disk" + + # Skip if already mounted + if is_disk_mounted "$disk"; then + log_info "Disk $disk is already mounted, skipping" + continue + fi + + # Check if formatted + if ! is_disk_formatted "$disk"; then + log_warning "Disk $disk appears to be unformatted" + if ! format_disk "$disk"; then + continue + fi + else + log_info "Disk $disk is already formatted" + fi + + # Get UUID (skip in dry run if not formatted) + if [[ $DRY_RUN == true ]] && ! is_disk_formatted "$disk"; then + log_info "[DRY-RUN] Would assign UUID after formatting" + uuid="" + else + uuid=$(get_disk_uuid "$disk") + log_info "Disk UUID: $uuid" + fi + + # Get mount point + mount_point=$(get_next_mount_point) + log_info "Mount point: $mount_point" + + # Create mount point + create_mount_point "$mount_point" + + # Add to fstab + add_to_fstab "$uuid" "$mount_point" + + echo "" + done + + # Validate mounts + validate_mounts + + # Display summary + display_summary + + if [[ $DRY_RUN == true ]]; then + log_info "=== DRY RUN COMPLETE - No changes were made ===" + log_info "Run without --dry-run to apply these changes" + else + log_success "Longhorn disk setup completed successfully!" + log_info "This is the way to configure storage that eliminates impurities." + fi +} + +# Run main function +main "$@" \ No newline at end of file