Skip to content

System Overview

genesluna edited this page Jan 2, 2026 · 1 revision

System Overview

This page provides detailed technical documentation of the k8s-ephemeral-environments platform architecture, including infrastructure components, namespace organization, and the PR environment lifecycle.

High-Level Architecture

The platform runs on a single VPS with k3s, hosting both permanent infrastructure components and ephemeral PR environments.

+---------------------------------------------------------------------+
|                     VPS (4 vCPU, 24GB RAM, 100GB NVMe)              |
|  +---------------------------------------------------------------+  |
|  |                        k3s Cluster                            |  |
|  |                                                               |  |
|  |  +-----------------+  +-----------------+  +-----------------+ |  |
|  |  |  observability  |  |   arc-runners   |  |  app-pr-123    | |  |
|  |  |                 |  |                 |  |  (ephemeral)   | |  |
|  |  | - Prometheus    |  | - Runner x2     |  |                | |  |
|  |  | - Loki          |  |                 |  | - App Pod      | |  |
|  |  | - Grafana       |  |                 |  | - DB Pod       | |  |
|  |  +-----------------+  +-----------------+  +-----------------+ |  |
|  |                                                               |  |
|  |  +-----------------+  +-----------------+  +-----------------+ |  |
|  |  |   app-pr-456   |  |   app-pr-789   |  |    platform     | |  |
|  |  |  (ephemeral)   |  |  (ephemeral)   |  |    (system)     | |  |
|  |  +-----------------+  +-----------------+  +-----------------+ |  |
|  +---------------------------------------------------------------+  |
+---------------------------------------------------------------------+

Infrastructure Details

Attribute Value
Provider Oracle Cloud Infrastructure (OCI)
Public IP 168.138.151.63
Hostname genilda
OS Ubuntu 24.04.3 LTS (Noble Numbat)
Architecture ARM64 (aarch64)
vCPUs 4
RAM 24 GB
Disk 96 GB NVMe

Important: All container images must support linux/arm64 architecture.

Namespace Structure

The cluster organizes workloads into permanent system namespaces and ephemeral PR namespaces.

Namespace Purpose Lifecycle
kube-system k3s core components, Traefik ingress Permanent
observability Prometheus, Loki, Grafana Permanent
arc-systems ARC controller (manages runner lifecycle) Permanent
arc-runners GitHub Actions self-hosted runner pods Permanent
platform Shared base components, CronJobs Permanent
{project-id}-pr-{number} Ephemeral environment per PR Ephemeral (PR lifecycle)

Namespace Naming Convention

Ephemeral namespaces follow the pattern: {project-id}-pr-{number}

Examples:

  • k8s-ee-pr-28 - PR #28 in the k8s-ee project
  • my-app-pr-156 - PR #156 in the my-app project

Technology Stack

Component Technology Justification
Kubernetes k3s Lightweight, production-ready, ideal for single-node
Ingress Traefik Included in k3s, native Let's Encrypt support
CI/CD GitHub Actions Native integration, familiar to developers
Logs Loki + Promtail Lightweight, native Grafana integration
Metrics Prometheus Industry standard, broad ecosystem
Dashboards Grafana Unified interface for logs and metrics
Runners actions-runner-controller (ARC) Ephemeral and scalable runners in cluster
PostgreSQL CloudNativePG Manages PostgreSQL lifecycle automatically
MariaDB mariadb:11 Simple deployment for MySQL-compatible needs
MongoDB MongoDB Community Operator Replica set management for NoSQL needs
Redis redis:7-alpine High-performance caching
Object Storage MinIO S3-compatible file storage
Secrets Sealed Secrets Encrypted secrets in git
Storage Local Path Provisioner Simple, adequate for MVP
DNS Wildcard *.k8s-ee.genesluna.dev resolves to VPS IP
Network Isolation NetworkPolicies (kube-router) Isolation between PR namespaces
Priority Classes platform-critical, default-app Workload prioritization

PR Environment Lifecycle

The complete flow from PR creation to environment destruction:

+------------+     +------------+     +------------+     +------------+
|  PR Open   |---->|  GitHub    |---->|  Create    |---->|  Deploy    |
|            |     |  Action    |     | Namespace  |     | App + DB   |
+------------+     +------------+     +------------+     +------------+
                                                               |
                                                               v
+------------+     +------------+     +------------+     +------------+
|  PR Close  |---->|  GitHub    |---->|  Delete    |<----|  Preview   |
|  or Merge  |     |  Action    |     | Namespace  |     |    URL     |
+------------+     +------------+     +------------+     +------------+

Detailed Lifecycle Steps

  1. PR Opened - Developer opens a pull request
  2. Organization Validated - PR author's org checked against allowlist
  3. Namespace Created - {project-id}-pr-{number} namespace provisioned
  4. Resource Quotas Applied - Dynamic quotas based on enabled databases
  5. NetworkPolicies Applied - Isolation rules for the namespace
  6. Application Deployed - App + configured databases deployed via Helm
  7. Ingress Created - Public URL becomes available
  8. Bot Comments - PR receives comment with preview URL
  9. Push to PR - New commits trigger automatic re-deployment
  10. Optional Preserve - /preserve command extends environment life
  11. PR Closed/Merged - Namespace destroyed (unless preserved)
  12. Preserve Expiry - Hourly CronJob removes expired preserve labels
  13. Orphan Cleanup - 6-hour CronJob catches any missed namespaces

Preserve Environment Feature

Developers can keep environments alive after PR close using the /preserve command:

Constraint Value
Default duration 48 hours
Maximum duration 48 hours
Max preserved per user 3 environments
Expiry check Hourly CronJob

Dynamic Resource Quotas

ResourceQuota is automatically calculated based on enabled databases. No manual configuration required.

Base Resources

Base (app only):    300m CPU,  512Mi memory,  1Gi storage

Per-Database Additions

Database CPU Memory Storage
PostgreSQL +500m +512Mi +2Gi
MongoDB +1000m (init) +640Mi +3Gi
Redis +200m +128Mi -
MinIO +1000m (sidecar) +1024Mi +2Gi
MariaDB +300m +256Mi +2Gi

Rolling Update Headroom

The platform adds buffer for rolling updates when both old and new pods run simultaneously:

+ Rolling update buffer: +100m CPU requests, +256Mi memory requests
+ Limits buffer: +15% on CPU limits, +15% on memory limits

Example Quota Calculations

Configuration CPU Limit Memory Limit Storage
App only 300m 512Mi 1Gi
App + PostgreSQL 800m 1Gi 3Gi
App + PostgreSQL + Redis 1000m 1.1Gi 3Gi
All databases enabled 2100m 2.4Gi 9Gi

Network Architecture

DNS Configuration

  • Wildcard DNS: *.k8s-ee.genesluna.dev resolves to VPS IP
  • Preview URLs: {project-id}-pr-{number}.k8s-ee.genesluna.dev
  • Service URLs:
    • Grafana: grafana.k8s-ee.genesluna.dev
    • Prometheus: prometheus.k8s-ee.genesluna.dev

Network Isolation

Each PR namespace has NetworkPolicies that:

  1. Deny all ingress by default - No cross-namespace traffic
  2. Allow ingress from Traefik - Only through the ingress controller
  3. Allow egress to DNS - Required for service discovery
  4. Allow egress to Kubernetes API - Required for operators
  5. Allow internal namespace traffic - App can reach its databases

Service Level Objectives

SLI Target
PRs with URL delivered in < 10 min >= 95%
Namespaces removed in < 5 min after close >= 98%
Pods with metrics/logs collected >= 95%

Non-Functional Requirements

Performance

Metric Target
Environment creation time <= 10 min (p95)
Namespace destruction time < 5 min
Observability stack overhead < 6 GB RAM

Capacity

Metric Target
Simultaneous PRs supported >= 5
Log retention 7 days
Metric retention 7 days

Availability

Metric Target
Cluster uptime (business hours) >= 95%
Automatic recovery after VPS reboot Yes

Related Pages

Clone this wiki locally