Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
57 changes: 31 additions & 26 deletions docs/PRD.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,42 +39,47 @@ Designed for growth to 100+ nodes.

## Core Features

### 1. Automated RKE2 Kubernetes Deployment
### Automated RKE2 Kubernetes Deployment
Automated deployment of production-ready RKE2 clusters with first node initialization, additional node joining, Cilium CNI integration, and compliance-ready audit logging.

**[📄 Detailed Documentation](./01-rke2-deployment.md)**
**[📄 Detailed Documentation](./rke2-deployment.md)**

### 2. AMD GPU Support with ROCm
### AMD GPU Support with ROCm
Automated AMD GPU driver installation, device detection, permission configuration, and Kubernetes GPU resource integration for AI/ML workloads.

**[📄 Detailed Documentation](./02-rocm-support.md)**
**[📄 Detailed Documentation](./rocm-support.md)**

### 3. Storage Management with Longhorn
### Storage Management with Longhorn
Distributed block storage with automatic disk detection, interactive selection, persistent mounting, and Longhorn CSI integration for reliable persistent volumes.

**[📄 Detailed Documentation](./03-storage-management.md)**
**[📄 Detailed Documentation](./storage-management.md)**

### 4. Network Configuration
### Longhorn Drive Setup and Recovery
Comprehensive drive recovery procedures including RAID detection and removal, disk space analysis, automated formatting and mounting, and troubleshooting for storage issues after node reboots.

**[📄 Detailed Documentation](./longhorn-drive-setup-and-recovery.md)**

### Network Configuration
Comprehensive networking with MetalLB load balancing, firewall configuration, multipath storage networking, and time synchronization across cluster nodes.

**[📄 Detailed Documentation](./04-network-configuration.md)**
**[📄 Detailed Documentation](./network-configuration.md)**

### 5. Interactive Terminal UI
### Interactive Terminal UI
Rich terminal interface with real-time progress tracking, live log streaming, interactive configuration wizards, and comprehensive error handling and recovery options.

**[📄 Detailed Documentation](./06-terminal-ui.md)**
**[📄 Detailed Documentation](./terminal-ui.md)**

### 6. Configuration Management
### Configuration Management
Flexible configuration system supporting YAML files, environment variables, and CLI flags with comprehensive validation and an interactive wizard for guided setup.

**[📄 Configuration Reference](./10-configuration-reference.md)**
**[📄 Configuration Reference](./configuration-reference.md)**

### 7. Node Validation and Testing
### Node Validation and Testing
Comprehensive pre-deployment validation ensures node readiness, connectivity, GPU availability, and proper firewall configuration before any system modifications.

**[📄 Installation Guide](./08-installation-guide.md)**
**[📄 Installation Guide](./installation-guide.md)**

### 8. TLS Certificate Management
### TLS Certificate Management

Flexible certificate management with three deployment options:

Expand All @@ -95,25 +100,25 @@ Flexible certificate management with three deployment options:

All certificates are stored as Kubernetes secrets in the `kgateway-system` namespace and integrated with the cluster's ingress controller for HTTPS traffic.

**[📄 Certificate Management Details](./05-certificate-management.md)**
**[📄 Certificate Management Details](./certificate-management.md)**

### 9. Web UI and Monitoring Interface
### Web UI and Monitoring Interface
Browser-based configuration wizard with real-time monitoring dashboard, error recovery interface, and responsive design for remote cluster management from any device.

**[📄 Technical Architecture](./07-technical-architecture.md)**
**[📄 Technical Architecture](./technical-architecture.md)**

### 10. Comprehensive Configuration Validation
### Comprehensive Configuration Validation
Pre-flight validation system checks all configuration, resources, and system requirements before making any changes, providing clear error messages with actionable fixes.

**[📄 Configuration Reference](./10-configuration-reference.md)**
**[📄 Configuration Reference](./configuration-reference.md)**

## Technical Architecture

ClusterBloom uses a modular architecture with command-based interfaces, sequential installation pipelines, and multiple interaction modes (CLI, TUI, Web UI). The system executes in three phases: pre-Kubernetes system preparation, Kubernetes cluster setup, and post-Kubernetes add-on deployment.

**[📄 Technical Architecture Documentation](./07-technical-architecture.md)**
**[📄 Technical Architecture Documentation](./technical-architecture.md)**

**[📄 Configuration Reference](./10-configuration-reference.md)**
**[📄 Configuration Reference](./configuration-reference.md)**

## User Experience

Expand Down Expand Up @@ -270,16 +275,16 @@ Browser-based testing with chromedp and comprehensive mock system:

### For Developers and Operators

**[📄 Manual Installation Guide](./08-installation-guide.md)**
**[📄 Manual Installation Guide](./installation-guide.md)**
Complete manual installation procedures for understanding automation or performing custom installations.

**[📄 Cloud Platform Compatibility](./09-cloud-compatibility.md)**
**[📄 Cloud Platform Compatibility](./cloud-compatibility.md)**
Infrastructure dependencies, migration strategies, and configuration for multi-platform deployments (EKS, AKS, GKE).

**[📄 Configuration Reference](./10-configuration-reference.md)**
**[📄 Configuration Reference](./configuration-reference.md)**
Comprehensive configuration variable reference with examples and validation rules.

**[📄 Technical Architecture](./07-technical-architecture.md)**
**[📄 Technical Architecture](./technical-architecture.md)**
Detailed technical architecture, component organization, and implementation patterns.

## Conclusion
Expand Down
79 changes: 79 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# Cluster-Bloom Documentation

Welcome to the comprehensive documentation for Cluster-Bloom, an enterprise-ready AI/ML cluster deployment platform built on RKE2 and Kubernetes.

## Documentation Overview

This documentation provides complete guidance for deploying, configuring, and managing Cluster-Bloom environments. Each document covers specific aspects of the platform, from initial sizing to advanced configuration.

## Documentation Index

### Getting Started
- [**Cluster Sizing and Configurations**](cluster-sizing-configurations.md) - Hardware requirements, sizing guidelines, and deployment planning
- [**Manual Steps Quick Reference**](manual-steps-quick-reference.md) - Essential commands and procedures for cluster management

### Core Deployment
- [**RKE2 Deployment**](rke2-deployment.md) - Kubernetes cluster foundation setup and configuration
- [**ROCm Support**](rocm-support.md) - AMD GPU support and ROCm integration for AI workloads
- [**Storage Management**](storage-management.md) - Longhorn distributed storage configuration and management
- [**Longhorn Drive Setup and Recovery**](longhorn-drive-setup-and-recovery.md) - Detailed drive recovery, RAID handling, and storage troubleshooting

### Infrastructure Configuration
- [**Network Configuration**](network-configuration.md) - Networking setup, load balancing, and connectivity
- [**Certificate Management**](certificate-management.md) - TLS/SSL certificate handling and automation
- [**Terminal UI**](terminal-ui.md) - Interactive command-line interface and user experience
- [**Technical Architecture**](technical-architecture.md) - System design, component interactions, and architectural decisions

### Operations and Maintenance
- [**Installation Guide**](installation-guide.md) - Complete step-by-step installation procedures
- [**Cloud Compatibility**](cloud-compatibility.md) - Multi-cloud deployment strategies and platform-specific considerations
- [**Configuration Reference**](configuration-reference.md) - Comprehensive configuration options and parameters
- [**OIDC Authentication**](oidc-authentication.md) - Single sign-on integration and identity management

## Quick Navigation

### For New Users
1. Start with [Cluster Sizing and Configurations](cluster-sizing-configurations.md) to plan your deployment
2. Follow the [Installation Guide](installation-guide.md) for step-by-step setup
3. Reference [Manual Steps Quick Reference](manual-steps-quick-reference.md) for common operations

### For System Administrators
- [Technical Architecture](technical-architecture.md) - Understand system design
- [Storage Management](storage-management.md) + [Longhorn Drive Setup and Recovery](longhorn-drive-setup-and-recovery.md) - Complete storage configuration
- [Configuration Reference](configuration-reference.md) - Detailed parameter documentation

### For DevOps Engineers
- [RKE2 Deployment](rke2-deployment.md) - Kubernetes foundation
- [Network Configuration](network-configuration.md) - Infrastructure networking
- [Certificate Management](certificate-management.md) - Security configuration

### Troubleshooting and Recovery
- [Longhorn Drive Setup and Recovery](longhorn-drive-setup-and-recovery.md) - Storage troubleshooting and RAID handling
- [Manual Steps Quick Reference](manual-steps-quick-reference.md) - Emergency procedures and common fixes

## Documentation Standards

- **Comprehensive Coverage**: Each document provides complete information for its topic area
- **Practical Examples**: Real-world configurations and command examples
- **Cross-References**: Links between related topics for easy navigation
- **Version Compatibility**: All procedures tested with current platform versions

## Contributing

This documentation is maintained as part of the Cluster-Bloom project. For updates, corrections, or additions:

1. Follow the established documentation patterns
2. Include practical examples and command snippets
3. Test all procedures before documentation
4. Maintain cross-references between related topics

## Support

For questions about the documentation or Cluster-Bloom platform:
- Reference the [Configuration Reference](configuration-reference.md) for parameter details
- Check [Technical Architecture](technical-architecture.md) for design questions
- Use [Manual Steps Quick Reference](manual-steps-quick-reference.md) for operational procedures

---

*This is the way to build enterprise-grade AI infrastructure that eliminates impurities.*
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Loading
Loading