Skip to content

sgour-akamai/my-work

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

      1 +  # Complete Onboarding Guide: terraform-observability-team Repository
      2 +  
      3 +  ## Welcome to the SRE Observability Team!
      4 +  
      5 +  This guide will teach you everything you need to know about the `terraform-observability-team` 
        + repository as a new team member. No prior knowledge assumed - we'll start from the basics and go 
        + deep.
      6 +  
      7 +  **Team Motto**: *Mors melius quam Nagios* ("Death is better than Nagios")
      8 +  
      9 +  ---
     10 +  
     11 +  ## Table of Contents
     12 +  
     13 +  1. [What is This Repository?](#1-what-is-this-repository)
     14 +  2. [What Does the Observability Team Do?](#2-what-does-the-observability-team-do)
     15 +  3. [The Big Picture: Observability Architecture](#3-the-big-picture-observability-architecture)
     16 +  4. [Repository Structure Explained](#4-repository-structure-explained)
     17 +  5. [Key Services Deep Dive](#5-key-services-deep-dive)
     18 +  6. [How Deployments Work](#6-how-deployments-work)
     19 +  7. [Important Files You Need to Know](#7-important-files-you-need-to-know)
     20 +  8. [Documentation Structure](#8-documentation-structure)
     21 +  9. [Day-to-Day Operations](#9-day-to-day-operations)
     22 +  10. [Common Tasks with Examples](#10-common-tasks-with-examples)
     23 +  11. [Things to Be Careful About](#11-things-to-be-careful-about)
     24 +  12. [Getting Started Checklist](#12-getting-started-checklist)
     25 +  13. [Glossary](#13-glossary)
     26 +  
     27 +  ---
     28 +  
     29 +  ## 1. What is This Repository?
     30 +  
     31 +  The `terraform-observability-team` repository serves **two main purposes**:
     32 +  
     33 +  ### Purpose 1: Team Permissions Management
     34 +  
     35 +  This repository manages who has access to what across multiple platforms using 
        + **Infrastructure-as-Code (Terraform)**:
     36 +  
     37 +  - **GitHub** (public `linode-obs` organization)
     38 +  - **Bits** (internal Akamai/Linode GitHub - `sre-o11y` teams across ops, Linode, devcloud orgs)
     39 +  - **PagerDuty** (team memberships, on-call schedules, escalation policies)
     40 +  - **Vault** (secret storage for team applications)
     41 +  
     42 +  **Why Terraform?** Because managing permissions manually across 4+ platforms for 10+ people is 
        + error-prone and time-consuming. Terraform allows us to:
     43 +  - Define team membership once
     44 +  - Apply changes consistently everywhere
     45 +  - Track permission changes in git history
     46 +  - Easily onboard/offboard team members
     47 +  
     48 +  ### Purpose 2: Team Documentation Hub
     49 +  
     50 +  This repository contains all team documentation using **Hugo** (a static site generator):
     51 +  
     52 +  - **Handbooks**: How we work (on-call, tools, git conventions)
     53 +  - **Service Documentation**: VictoriaMetrics, Prometheus, Grafana, ArgoCD, etc.
     54 +  - **MOPs** (Manual Operations Procedures): Step-by-step guides for complex tasks
     55 +  - **Proposals**: Design documents for major changes (OP-## format)
     56 +  - **On-call Logs**: Weekly logs of incidents and work completed
     57 +  - **Runbooks**: How to respond to specific alerts
     58 +  
     59 +  The documentation is published to **Bits Pages** (internal documentation hosting) at:
     60 +  `https://bits.linode.com/pages/ops/terraform-observability-team/`
     61 +  
     62 +  ---
     63 +  
     64 +  ## 2. What Does the Observability Team Do?
     65 +  
     66 +  The **SRE Observability team** owns and operates the entire observability infrastructure for 
        + Linode/Akamai. Here's what that means:
     67 +  
     68 +  ### Metrics Collection & Storage
     69 +  
     70 +  **What**: Collecting and storing performance metrics from all infrastructure.
     71 +  
     72 +  **Services We Manage**:
     73 +  - **100+ Prometheus instances** globally (one or more per datacenter)
     74 +  - **VictoriaMetrics clusters** for long-term storage (LTS) of metrics
     75 +  - **Thanos** for query federation across Prometheus instances
     76 +  
     77 +  **Example Metrics**:
     78 +  - CPU/memory/disk usage on servers
     79 +  - Network traffic and errors
     80 +  - Application response times and error rates
     81 +  - Database query performance
     82 +  - Kubernetes pod health
     83 +  
     84 +  ### Logging Infrastructure
     85 +  
     86 +  **What**: Collecting, storing, and querying logs from all systems.
     87 +  
     88 +  **Services We Manage**:
     89 +  - **Loki** clusters for centralized log storage
     90 +  - **OpenTelemetry Collectors** (otelgw) for log aggregation
     91 +  - Integration with Grafana for log querying
     92 +  
     93 +  ### Visualization
     94 +  
     95 +  **What**: Providing dashboards and graphs for engineers to understand system behavior.
     96 +  
     97 +  **Services We Manage**:
     98 +  - **Grafana** instances
     99 +  - **Dashboard management** (via Terraform)
    100 +  - **User permissions** for Grafana
    101 +  
    102 +  ### Alerting
    103 +  
    104 +  **What**: Notifying engineers when things go wrong.
    105 +  
    106 +  **Services We Manage**:
    107 +  - **Alertmanager** (part of Prometheus)
    108 +  - **PagerDuty** integrations
    109 +  - **Alert rules** (prometheus_rules repository)
    110 +  - **Alert routing** (who gets paged for what)
    111 +  
    112 +  ### Deployment & Orchestration
    113 +  
    114 +  **What**: Managing how observability services are deployed and updated.
    115 +  
    116 +  **Services We Manage**:
    117 +  - **ArgoCD** for GitOps deployment to Kubernetes
    118 +  - **Kubernetes clusters** running observability workloads
    119 +  - **Helm charts** for application configuration
    120 +  
    121 +  ### Legacy Systems (Being Phased Out)
    122 +  
    123 +  - **Nagios** - Old monitoring system (being replaced by Prometheus)
    124 +  
    125 +  ### Network Observability
    126 +  
    127 +  **What**: Special monitoring for network devices and traffic.
    128 +  
    129 +  **Services We Manage**:
    130 +  - **Linmon** - Network monitoring platform
    131 +  - Specialized VictoriaMetrics clusters for network metrics
    132 +  
    133 +  ---
    134 +  
    135 +  ## 3. The Big Picture: Observability Architecture
    136 +  
    137 +  Let me explain how all these pieces fit together with a real-world example:
    138 +  
    139 +  ### Example: Monitoring a Linode Customer's VM
    140 +  
    141 +  ```
    142 +  1. DATA COLLECTION
    143 +     ┌────────────────────────────────────────────────────────────┐
    144 +     │  Customer's Linode VM (in Newark datacenter - ewr1)        │
    145 +     │  ├─ node_exporter (exposes system metrics)                 │
    146 +     │  └─ Metrics: CPU, memory, disk, network                    │
    147 +     └────────────────────────┬───────────────────────────────────┘
    148 +                              │ (HTTP scrape every 30s)
    149 +                              ▼
    150 +     ┌────────────────────────────────────────────────────────────┐
    151 +     │  Prometheus Shard (prometheus-1a in ewr1)                  │
    152 +     │  ├─ Scrapes 1000s of targets in this datacenter           │
    153 +     │  ├─ Stores metrics locally (15 days retention)            │
    154 +     │  └─ Evaluates alert rules                                 │
    155 +     └────────────────────────┬───────────────────────────────────┘
    156 +                              │ (remote_write)
    157 +                              ▼
    158 +  2. LONG-TERM STORAGE
    159 +     ┌────────────────────────────────────────────────────────────┐
    160 +     │  VictoriaMetrics LTS (North America - ord2/lax3)           │
    161 +     │  ├─ Receives metrics from all NA datacenters              │
    162 +     │  ├─ Compresses and stores metrics (13 months retention)   │
    163 +     │  └─ Fast queries for historical data                      │
    164 +     └────────────────────────┬───────────────────────────────────┘
    165 +                              │
    166 +                              ▼
    167 +  3. QUERYING & FEDERATION
    168 +     ┌────────────────────────────────────────────────────────────┐
    169 +     │  VictoriaMetrics Global Select                             │
    170 +     │  ├─ Federates queries across all regions (NA, EU, AP)     │
    171 +     │  └─ Single query point for worldwide data                 │
    172 +     └────────────────────────┬───────────────────────────────────┘
    173 +                              │ (queries)
    174 +                              ▼
    175 +  4. VISUALIZATION
    176 +     ┌────────────────────────────────────────────────────────────┐
    177 +     │  Grafana                                                    │
    178 +     │  ├─ Dashboards showing VM performance                      │
    179 +     │  └─ Engineers use this to troubleshoot issues             │
    180 +     └────────────────────────────────────────────────────────────┘
    181 +  
    182 +  5. ALERTING (if something goes wrong)
    183 +     ┌────────────────────────────────────────────────────────────┐
    184 +     │  Prometheus Alert Rules                                    │
    185 +     │  └─ "High CPU usage on VM for 5 minutes"                  │
    186 +     └────────────────────────┬───────────────────────────────────┘
    187 +                              │ (fires alert)
    188 +                              ▼
    189 +     ┌────────────────────────────────────────────────────────────┐
    190 +     │  Alertmanager                                               │
    191 +     │  └─ Routes alert based on severity and team               │
    192 +     └────────────────────────┬───────────────────────────────────┘
    193 +                              │
    194 +                              ▼
    195 +     ┌────────────────────────────────────────────────────────────┐
    196 +     │  PagerDuty                                                  │
    197 +     │  └─ Pages on-call engineer                                │
    198 +     └────────────────────────────────────────────────────────────┘
    199 +  ```
    200 +  
    201 +  ### Regional Architecture
    202 +  
    203 +  We split the world into **three regions** for scalability:
    204 +  
    205 +  ```
    206 +  ┌─────────────────────────────────────────────────────────────────┐
    207 +  │                   GLOBAL ARCHITECTURE                            │
    208 +  │                                                                  │
    209 +  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
    210 +  │  │ North America│  │    Europe    │  │ Asia Pacific │          │
    211 +  │  ├──────────────┤  ├──────────────┤  ├──────────────┤          │
    212 +  │  │ VictoriaMetrics  │ VictoriaMetrics  │ VictoriaMetrics          │
    213 +  │  │ LTS Clusters │  │ LTS Clusters │  │ LTS Clusters │          │
    214 +  │  │              │  │              │  │              │          │
    215 +  │  │ ord2 (primary)│  │ mad2 (primary)│  │ osa1 (primary)│          │
    216 +  │  │ lax3 (backup)│  │ sto2 (backup)│  │ cgk1 (backup)│          │
    217 +  │  │              │  │              │  │ sea1 (backup)│          │
    218 +  │  └──────────────┘  └──────────────┘  └──────────────┘          │
    219 +  │         │                 │                  │                  │
    220 +  │         └─────────────────┴──────────────────┘                  │
    221 +  │                           │                                     │
    222 +  │                           ▼                                     │
    223 +  │              ┌────────────────────────┐                         │
    224 +  │              │ VictoriaMetrics        │                         │
    225 +  │              │ Global Select          │                         │
    226 +  │              │ (Query Federation)     │                         │
    227 +  │              └────────────────────────┘                         │
    228 +  └─────────────────────────────────────────────────────────────────┘
    229 +  ```
    230 +  
    231 +  **Why Regional?**
    232 +  - **Reduced latency**: Data stays close to where it's generated
    233 +  - **Compliance**: Some regions require data to stay in-country
    234 +  - **Scalability**: Spreading load across multiple clusters
    235 +  - **Resilience**: Regional failure doesn't impact other regions
    236 +  
    237 +  ---
    238 +  
    239 +  ## 4. Repository Structure Explained
    240 +  
    241 +  Let's walk through the repository structure directory by directory:
    242 +  
    243 +  ```
    244 +  /Users/sgour/terraform-observability-team/
    245 +  ├── .github/                    # GitHub-specific files
    246 +  │   ├── workflows/             # GitHub Actions CI/CD pipelines
    247 +  │   └── CODEOWNERS             # Who reviews PRs for which files
    248 +  ├── .vale/                      # Vale (prose linting) configuration
    249 +  │   └── styles/                # Custom style rules for docs
    250 +  ├── docs/                       # Hugo documentation site (explained later)
    251 +  ├── hack/                       # Utility scripts
    252 +  ├── img/                        # Images for README
    253 +  ├── modules/                    # Terraform modules (explained below)
    254 +  │   ├── pagerduty/             # PagerDuty configuration
    255 +  │   ├── github/                # GitHub (public) configuration
    256 +  │   └── bits/                  # Bits (internal GitHub) configuration
    257 +  ├── scripts/                    # Helper scripts
    258 +  ├── main.tf                     # Main Terraform configuration
    259 +  ├── provider.tf                 # Terraform provider setup
    260 +  ├── vars.tf                     # Variable definitions
    261 +  ├── team.auto.tfvars           # IMPORTANT: Team member definitions
    262 +  ├── pagerduty_suppression.auto.tfvars  # Alert suppression config
    263 +  ├── atlantis.yaml              # Atlantis (Terraform automation) config
    264 +  ├── .envrc                     # Environment variables (Vault auth)
    265 +  ├── .pre-commit-config.yaml    # Pre-commit hooks for code quality
    266 +  ├── .mise.toml                 # Mise task runner configuration
    267 +  └── README.md                  # Repository overview
    268 +  ```
    269 +  
    270 +  ### The `/modules/` Directory (Terraform Modules)
    271 +  
    272 +  Think of Terraform modules as reusable "functions" for infrastructure. Each module manages a 
        + specific platform:
    273 +  
    274 +  #### `/modules/pagerduty/`
    275 +  
    276 +  Manages all PagerDuty configuration:
    277 +  
    278 +  **Files**:
    279 +  - `team.tf` - Team membership (who's on the team)
    280 +  - `escalation_policy.tf` - Who gets alerted when (primary → secondary → round-robin)
    281 +  - `services.tf` - PagerDuty services (e.g., "Prometheus Alerts")
    282 +  - `schedules.tf` - On-call schedules (primary, secondary)
    283 +  - `orchestration.tf` - Event routing and processing
    284 +  - `slack_connections.tf` - Slack channel integrations
    285 +  - `rules.tf` - Alert suppression rules (for maintenance)
    286 +  
    287 +  **What It Does**:
    288 +  - Creates PagerDuty team
    289 +  - Sets up on-call rotation schedule
    290 +  - Creates escalation policies
    291 +  - Configures alert routing
    292 +  - Manages alert suppression during maintenance
    293 +  
    294 +  **Example**: When you're added to the team, Terraform creates your PagerDuty user and adds you to 
        + schedules.
    295 +  
    296 +  #### `/modules/github/`
    297 +  
    298 +  Manages the **public** `linode-obs` GitHub organization:
    299 +  
    300 +  **What It Does**:
    301 +  - Creates repositories
    302 +  - Manages team memberships
    303 +  - Sets repository permissions
    304 +  - Configures branch protection
    305 +  
    306 +  **Example Repos Managed**:
    307 +  - `linode-obs/victoriametrics-operator`
    308 +  - `linode-obs/prometheus-salt-formula`
    309 +  
    310 +  #### `/modules/bits/`
    311 +  
    312 +  Manages the **internal** Bits (Akamai GitHub) teams across multiple organizations:
    313 +  
    314 +  **Organizations Managed**:
    315 +  - `ops` - Main SRE organization
    316 +  - `Linode` - Linode-specific repos
    317 +  - `devcloud` - Devcloud infrastructure
    318 +  - `ansible-collections` - Ansible collections
    319 +  
    320 +  **What It Does**:
    321 +  - Creates `sre-o11y` team in each org
    322 +  - Manages team memberships (member vs maintainer roles)
    323 +  - Sets repository permissions
    324 +  - Configures webhooks (Atlantis, Slack notifications)
    325 +  - Sets up branch protection
    326 +  
    327 +  **Important Repos Managed**:
    328 +  - `ops/prometheus_rules` - Alert and recording rules
    329 +  - `ops/prometheus-formula` - Prometheus Salt configuration
    330 +  - `ops/o11y-helm-charts` - Helm charts for Kubernetes apps
    331 +  - `ops/palantir` - Kubernetes manifests
    332 +  - `ops/loki_rules` - Loki alert rules
    333 +  - `ops/terraform-grafana-config` - Grafana configuration
    334 +  
    335 +  ### The `/docs/` Directory (Documentation Site)
    336 +  
    337 +  This is a Hugo-based documentation site. Let's break it down:
    338 +  
    339 +  ```
    340 +  docs/
    341 +  ├── archetypes/                # Templates for new content
    342 +  │   ├── default.md
    343 +  │   ├── on-call.md            # Template for on-call logs
    344 +  │   └── proposals.md          # Template for proposals
    345 +  ├── content/                   # ACTUAL DOCUMENTATION (main content)
    346 +  │   ├── _index.md             # Homepage
    347 +  │   ├── Handbooks/            # How the team operates
    348 +  │   │   ├── on-call.md        # On-call guide
    349 +  │   │   ├── tools.md          # Tooling standards
    350 +  │   │   ├── git-conventions.md
    351 +  │   │   └── docs/             # How to write docs
    352 +  │   ├── Services/             # Service documentation
    353 +  │   │   ├── ArgoCD/
    354 +  │   │   ├── VictoriaMetrics/
    355 +  │   │   ├── Prometheus/
    356 +  │   │   ├── Grafana/
    357 +  │   │   ├── Kubernetes Clusters/
    358 +  │   │   ├── Centralized Logging/
    359 +  │   │   ├── Network Observability/
    360 +  │   │   ├── Nagios/
    361 +  │   │   └── ...
    362 +  │   ├── mops/                 # Manual Operations Procedures
    363 +  │   │   ├── prometheus-shard.md
    364 +  │   │   ├── victoriametrics-cluster-upgrade.md
    365 +  │   │   └── ...
    366 +  │   ├── on-call/              # On-call logs by year
    367 +  │   │   ├── 2024/
    368 +  │   │   └── 2025/
    369 +  │   ├── projects/             # Active projects
    370 +  │   ├── proposals/            # Design proposals (OP-##)
    371 +  │   │   ├── OP-01-prometheus-sharding.md
    372 +  │   │   ├── OP-03-o11y-helm-charts.md
    373 +  │   │   └── ...
    374 +  │   └── Runbooks/             # Alert response guides
    375 +  │       ├── HighMemoryUsage.md
    376 +  │       └── ...
    377 +  ├── layouts/                  # Custom Hugo templates
    378 +  ├── static/                   # Static files (images, CSS)
    379 +  ├── config.toml               # Hugo configuration
    380 +  └── go.mod                    # Hugo module dependencies
    381 +  ```
    382 +  
    383 +  **How It Works**:
    384 +  1. You write documentation in Markdown in `/docs/content/`
    385 +  2. Hugo builds it into HTML
    386 +  3. GitHub Actions deploys it to Bits Pages
    387 +  4. Team members access it at `https://bits.linode.com/pages/ops/terraform-observability-team/`
    388 +  
    389 +  ---
    390 +  
    391 +  ## 5. Key Services Deep Dive
    392 +  
    393 +  Now let's understand the major services the team manages:
    394 +  
    395 +  ### VictoriaMetrics (Long-Term Metrics Storage)
    396 +  
    397 +  **What is it?**
    398 +  VictoriaMetrics is a time-series database optimized for storing Prometheus metrics long-term. Think
        +  of it as "Prometheus but faster and with more storage capacity."
    399 +  
    400 +  **Why do we use it?**
    401 +  - **Compression**: Stores data 10x more efficiently than Prometheus
    402 +  - **Fast queries**: Queries historical data much faster
    403 +  - **Long retention**: We keep metrics for 13 months vs Prometheus's 15 days
    404 +  - **Compatible**: Works with Prometheus query language (PromQL)
    405 +  
    406 +  **Architecture**:
    407 +  
    408 +  VictoriaMetrics runs as a **cluster** with three components:
    409 +  
    410 +  ```
    411 +  ┌────────────────────────────────────────────────────────────────┐
    412 +  │              VictoriaMetrics Cluster Architecture               │
    413 +  │                                                                 │
    414 +  │  ┌─────────────┐                                               │
    415 +  │  │  vminsert   │ ← Receives metrics from Prometheus            │
    416 +  │  │  (3 replicas)│    (via remote_write)                        │
    417 +  │  └──────┬──────┘                                               │
    418 +  │         │ Distributes data across storage nodes                │
    419 +  │         ▼                                                       │
    420 +  │  ┌─────────────┐                                               │
    421 +  │  │  vmstorage  │ ← Stores compressed metrics on disk           │
    422 +  │  │  (3+ replicas)│   (each replica has full dataset)           │
    423 +  │  └──────┬──────┘                                               │
    424 +  │         │ Serves data to query nodes                           │
    425 +  │         ▼                                                       │
    426 +  │  ┌─────────────┐                                               │
    427 +  │  │  vmselect   │ ← Handles queries from Grafana                │
    428 +  │  │  (2 replicas)│   (PromQL compatible)                        │
    429 +  │  └─────────────┘                                               │
    430 +  └────────────────────────────────────────────────────────────────┘
    431 +  ```
    432 +  
    433 +  **Components Explained**:
    434 +  
    435 +  1. **vminsert** - Ingestion
    436 +     - Receives metrics from Prometheus instances
    437 +     - Validates and processes data
    438 +     - Distributes data across vmstorage nodes
    439 +     - **Replicas**: 3 (for redundancy)
    440 +  
    441 +  2. **vmstorage** - Storage
    442 +     - Stores compressed metrics on disk
    443 +     - Each replica has the full dataset
    444 +     - Handles compaction and retention
    445 +     - **Replicas**: 3-6 depending on cluster size
    446 +  
    447 +  3. **vmselect** - Queries
    448 +     - Receives queries from Grafana
    449 +     - Fetches data from vmstorage nodes
    450 +     - Aggregates results
    451 +     - **Replicas**: 2 (for load balancing)
    452 +  
    453 +  **Cluster Types**:
    454 +  
    455 +  We have two types of VictoriaMetrics clusters:
    456 +  
    457 +  1. **LTS (Long-Term Storage) Clusters** - Regional
    458 +     - One per continent (NA, EU, AP)
    459 +     - Stores metrics from Prometheus instances in that region
    460 +     - **Retention**: 13 months
    461 +     - **Examples**:
    462 +       - `victoriametrics-ord2-us-prod` (North America primary)
    463 +       - `victoriametrics-lax3-us-prod` (North America backup)
    464 +       - `victoriametrics-mad2-es-prod` (Europe primary)
    465 +  
    466 +  2. **Global Select Cluster** - Worldwide
    467 +     - Federates queries across all LTS clusters
    468 +     - Allows querying global data from one place
    469 +     - **Does NOT store data** - just proxies queries
    470 +     - **Example**: `victoriametrics-global-select-iad3-us-prod`
    471 +  
    472 +  **Data Flow Example**:
    473 +  
    474 +  ```
    475 +  Prometheus in Newark (ewr1)
    476 +      │ remote_write
    477 +      ▼
    478 +  VictoriaMetrics LTS (ord2) - North America
    479 +      │
    480 +      │ Query from Grafana
    481 +      ▼
    482 +  VictoriaMetrics Global Select
    483 +      │ Queries all regions
    484 +      ├─── ord2 (North America)
    485 +      ├─── mad2 (Europe)
    486 +      └─── osa1 (Asia Pacific)
    487 +      │
    488 +      ▼ Combined results
    489 +  Grafana Dashboard
    490 +  ```
    491 +  
    492 +  **How It's Deployed**:
    493 +  - Runs in **Kubernetes**
    494 +  - Managed by **ArgoCD** (GitOps)
    495 +  - Configuration in `ops/o11y-helm-charts`
    496 +  - Manifests in `ops/palantir`
    497 +  
    498 +  **Important Files**:
    499 +  - Configuration: `o11y-helm-charts/values-files/victoriametrics-{dc}-{country}-{env}.yaml`
    500 +  - Documentation: `terraform-observability-team/docs/content/Services/VictoriaMetrics/`
    501 +  
    502 +  ### Prometheus (Real-Time Metrics Collection)
    503 +  
    504 +  **What is it?**
    505 +  Prometheus is a time-series database that **scrapes** (pulls) metrics from servers and 
        + applications.
    506 +  
    507 +  **Why do we use it?**
    508 +  - **Industry standard** for metrics collection
    509 +  - **Pull-based model** - Prometheus scrapes targets, they don't push to it
    510 +  - **Service discovery** - Automatically finds what to monitor
    511 +  - **Alert evaluation** - Runs alert rules in real-time
    512 +  
    513 +  **Deployment Model - Sharding**:
    514 +  
    515 +  We run **multiple Prometheus instances per datacenter** to handle scale. This is called 
        + **sharding**.
    516 +  
    517 +  ```
    518 +  Datacenter: Newark (ewr1)
    519 +     │
    520 +     ├─ prometheus-1a  (shard 0) ─ Monitors targets with hash % 3 == 0
    521 +     ├─ prometheus-1b  (shard 1) ─ Monitors targets with hash % 3 == 1
    522 +     └─ prometheus-1c  (shard 2) ─ Monitors targets with hash % 3 == 2
    523 +  ```
    524 +  
    525 +  **How Sharding Works**:
    526 +  1. Each Prometheus instance is assigned a **shard number**
    527 +  2. Targets (servers, apps) are distributed across shards using a hash function
    528 +  3. Each shard monitors ~1/3 of the targets
    529 +  
    530 +  **Shard Naming Convention**:
    531 +  - `prometheus-1a` → Shard 0
    532 +  - `prometheus-1b` → Shard 1
    533 +  - `prometheus-1c` → Shard 2
    534 +  - Pattern: `a=0, b=1, c=2, d=3, ...`
    535 +  
    536 +  **High Availability**:
    537 +  - Each shard runs **2 replicas**
    538 +  - If one replica fails, the other keeps working
    539 +  - Both replicas scrape the same targets (duplicate data, but ensures availability)
    540 +  
    541 +  **Configuration**:
    542 +  - Managed by **Salt formula**: `ops/prometheus-formula`
    543 +  - Alert rules in: `ops/prometheus_rules`
    544 +  - Recording rules in: `ops/prometheus_rules`
    545 +  
    546 +  **Where Data Goes**:
    547 +  - **Local storage**: 15 days retention
    548 +  - **Remote write**: Sent to VictoriaMetrics for long-term storage
    549 +  
    550 +  **Important Concepts**:
    551 +  
    552 +  1. **Scraping**: Prometheus pulls metrics from targets every 15-30 seconds
    553 +     ```
    554 +     Prometheus → HTTP GET http://server:9100/metrics → node_exporter
    555 +     ```
    556 +  
    557 +  2. **Service Discovery**: Prometheus automatically finds what to monitor
    558 +     - **Salt grains** - Server metadata tells Prometheus what the server does
    559 +     - **Kubernetes SD** - Auto-discovers pods in Kubernetes
    560 +  
    561 +  3. **Relabeling**: Modifying metric labels before storage
    562 +     - Add datacenter label
    563 +     - Add team ownership label
    564 +     - Filter out unwanted metrics
    565 +  
    566 +  **Important Files**:
    567 +  - Documentation: `terraform-observability-team/docs/content/Services/Prometheus/`
    568 +  - Sharding guide: `terraform-observability-team/docs/content/Services/Prometheus/sharding.md`
    569 +  
    570 +  ### Grafana (Visualization)
    571 +  
    572 +  **What is it?**
    573 +  Grafana is a web application for creating dashboards and visualizations from metrics and logs.
    574 +  
    575 +  **What do we manage?**
    576 +  - **Grafana instances** (production, staging, dev)
    577 +  - **Datasources** (connections to Prometheus, VictoriaMetrics, Loki)
    578 +  - **Dashboards** (via Terraform)
    579 +  - **User permissions** (teams, folders, access control)
    580 +  
    581 +  **How Users Access It**:
    582 +  - **Production**: https://grafana.linode.com
    583 +  - **Staging**: https://grafana-staging.linode.com
    584 +  
    585 +  **Datasource Types**:
    586 +  
    587 +  1. **Prometheus** - Real-time data (last 15 days)
    588 +     - Example: `prometheus-ewr1-1a-us-prod`
    589 +  
    590 +  2. **VictoriaMetrics** - Historical data (13 months)
    591 +     - Example: `victoriametrics-ord2-us-prod`
    592 +  
    593 +  3. **Loki** - Logs
    594 +     - Example: `loki-na-prod`
    595 +  
    596 +  4. **Thanos Query** - Federated Prometheus queries
    597 +     - Queries across multiple Prometheus instances
    598 +  
    599 +  **Dashboard Management**:
    600 +  - Dashboards stored as JSON
    601 +  - Managed in: `ops/terraform-grafana-config`
    602 +  - Changes deployed via Terraform
    603 +  
    604 +  **Permissions**:
    605 +  - Users organized into **teams**
    606 +  - Teams granted access to **folders**
    607 +  - Folders contain dashboards
    608 +  - We manage permissions via Terraform
    609 +  
    610 +  **Important Files**:
    611 +  - Configuration: `ops/terraform-grafana-config`
    612 +  - Documentation: `terraform-observability-team/docs/content/Services/Grafana/`
    613 +  
    614 +  ### ArgoCD (GitOps Deployment)
    615 +  
    616 +  **What is it?**
    617 +  ArgoCD is a GitOps tool that deploys applications to Kubernetes by syncing with Git repositories.
    618 +  
    619 +  **GitOps Concept**:
    620 +  - **Desired state** is defined in Git
    621 +  - ArgoCD **continuously monitors** Git
    622 +  - When Git changes, ArgoCD **automatically syncs** to Kubernetes
    623 +  - Git is the **single source of truth**
    624 +  
    625 +  **How It Works**:
    626 +  
    627 +  ```
    628 +  1. Engineer makes change
    629 +     ↓
    630 +  2. Git repository updated (ops/palantir)
    631 +     ↓
    632 +  3. ArgoCD detects change
    633 +     ↓
    634 +  4. ArgoCD compares Git vs Kubernetes
    635 +     ↓
    636 +  5. ArgoCD applies changes to Kubernetes
    637 +     ↓
    638 +  6. Application updated
    639 +  ```
    640 +  
    641 +  **ArgoCD Instances**:
    642 +  
    643 +  We run one ArgoCD per environment:
    644 +  
    645 +  - **Production**: https://argocd.infra-o11y-apps.iad3.us.prod.linode.com
    646 +    - Manages ~20 production Kubernetes clusters
    647 +    - Uses Git **release tags** (stable)
    648 +  
    649 +  - **Staging**: https://argocd.infra-o11y-apps.rin1.us.staging.linode.com
    650 +    - Manages ~10 staging clusters
    651 +    - Uses Git **main branch** (latest)
    652 +  
    653 +  - **Dev**: https://argocd.infra-o11y-apps.rin1.us.dev.linode.com
    654 +    - Manages ~5 dev clusters
    655 +    - Uses Git **main branch**
    656 +  
    657 +  **Key Concepts**:
    658 +  
    659 +  1. **Application** - A deployed workload
    660 +     - Example: VictoriaMetrics cluster in ord2
    661 +     - Defined by: Name, Git repo, path, target cluster
    662 +  
    663 +  2. **ApplicationSet** - Template for creating multiple similar Applications
    664 +     - Example: Deploy VictoriaMetrics to all LTS clusters
    665 +     - Uses generators to create Applications dynamically
    666 +  
    667 +  3. **Sync Policy**
    668 +     - **Manual**: Changes require clicking "Sync" button
    669 +     - **Automatic**: Changes deploy automatically
    670 +  
    671 +  4. **Health**
    672 +     - **Healthy**: All resources running correctly
    673 +     - **Degraded**: Some resources have issues
    674 +     - **Progressing**: Deployment in progress
    675 +  
    676 +  **Repositories ArgoCD Watches**:
    677 +  
    678 +  1. **ops/palantir** - Primary repo for Kubernetes manifests
    679 +     - Kustomize-based overlays
    680 +     - Bootstrap configs
    681 +     - Core components
    682 +  
    683 +  2. **ops/o11y-helm-charts** - Generates Application manifests
    684 +     - Helm chart with values files per cluster
    685 +     - Generates ArgoCD Application YAML
    686 +  
    687 +  **Important Files**:
    688 +  - Documentation: `terraform-observability-team/docs/content/Services/ArgoCD/`
    689 +  
    690 +  ### Kubernetes Clusters
    691 +  
    692 +  **What Clusters Do We Manage?**
    693 +  
    694 +  The team manages **30+ Kubernetes clusters** across three types:
    695 +  
    696 +  #### 1. infra-o11y-apps Clusters
    697 +  
    698 +  **Purpose**: Run centralized observability services
    699 +  
    700 +  **What Runs Here**:
    701 +  - ArgoCD
    702 +  - Grafana
    703 +  - Some Prometheus instances
    704 +  - Support services (cert-manager, external-dns)
    705 +  
    706 +  **Naming**: `infra-o11y-apps-{dc}-{country}-{env}`
    707 +  - Example: `infra-o11y-apps-iad3-us-prod`
    708 +  
    709 +  **Count**: ~5 clusters (1 per environment + extras)
    710 +  
    711 +  #### 2. VictoriaMetrics Clusters
    712 +  
    713 +  **Purpose**: Run VictoriaMetrics LTS clusters
    714 +  
    715 +  **What Runs Here**:
    716 +  - vminsert pods
    717 +  - vmstorage pods
    718 +  - vmselect pods
    719 +  - Monitoring exporters
    720 +  
    721 +  **Naming**: `victoriametrics-{dc}-{country}-{env}`
    722 +  - Example: `victoriametrics-ord2-us-prod`
    723 +  
    724 +  **Count**: ~15 clusters (NA, EU, AP regions × staging/prod)
    725 +  
    726 +  #### 3. infra-logging Clusters
    727 +  
    728 +  **Purpose**: Run centralized logging infrastructure
    729 +  
    730 +  **What Runs Here**:
    731 +  - Loki distributors
    732 +  - Loki ingesters
    733 +  - Loki queriers
    734 +  - OpenTelemetry collectors (otelgw)
    735 +  
    736 +  **Naming**: `infra-logging-{dc}-{country}-{env}`
    737 +  - Example: `infra-logging-cjj1-us-staging`
    738 +  
    739 +  **Count**: ~10 clusters
    740 +  
    741 +  **Cluster Deployment Process**:
    742 +  
    743 +  Deploying a new cluster involves multiple teams and repositories:
    744 +  
    745 +  ```
    746 +  1. Terraform (Infrastructure)
    747 +     ├─ Repo: Linode/terraform-module-infra
    748 +     ├─ Action: Provision Linodes, IPs, DNS
    749 +     └─ Output: Server nodes created
    750 +  
    751 +  2. Salt (Server Configuration)
    752 +     ├─ Repo: ops/salt-pillar
    753 +     ├─ Action: Accept minion keys, set grains
    754 +     └─ Output: Servers configured
    755 +  
    756 +  3. Vault (Secrets Management)
    757 +     ├─ Action: Store approle secret, kubeconfig
    758 +     ├─ PKI: Add cluster domain to allowed list
    759 +     └─ Output: Secrets available for apps
    760 +  
    761 +  4. Ansible (Kubernetes Installation)
    762 +     ├─ Repo: ops/ansible-playbooks
    763 +     ├─ Playbook: Kubespray
    764 +     └─ Output: Kubernetes cluster running
    765 +  
    766 +  5. ArgoCD (Join Cluster)
    767 +     ├─ Repo: ops/palantir
    768 +     ├─ Command: make join-cluster
    769 +     └─ Output: ArgoCD can deploy to cluster
    770 +  
    771 +  6. Application Configuration
    772 +     ├─ Repo: ops/o11y-helm-charts
    773 +     ├─ Action: Create values file
    774 +     └─ Repo: ops/palantir
    775 +         ├─ Action: Create overlays
    776 +         └─ Output: Apps deployed via ArgoCD
    777 +  ```
    778 +  
    779 +  **Important Files**:
    780 +  - Documentation: `terraform-observability-team/docs/content/Services/Kubernetes 
        + Clusters/cluster-deployment.md`
    781 +  
    782 +  ### Centralized Logging
    783 +  
    784 +  **Purpose**: Collect logs from all infrastructure in one place for searching and alerting.
    785 +  
    786 +  **Architecture**:
    787 +  
    788 +  ```
    789 +  Servers (all DCs)
    790 +      │ Send logs via Fluent Bit / Promtail
    791 +      ▼
    792 +  OpenTelemetry Collector (otelgw) - Per DC
    793 +      │ Aggregates logs, adds labels
    794 +      ▼
    795 +  Loki Distributor - Regional
    796 +      │ Receives logs, hashes by labels
    797 +      ▼
    798 +  Loki Ingester - Regional
    799 +      │ Buffers logs, creates chunks
    800 +      ▼
    801 +  Object Storage (S3)
    802 +      │ Long-term log storage
    803 +      ▼
    804 +  Loki Querier - Regional
    805 +      │ Queries logs from ingesters + S3
    806 +      ▼
    807 +  Grafana - Global
    808 +      │ User searches logs
    809 +  ```
    810 +  
    811 +  **Components**:
    812 +  
    813 +  1. **otelgw** (OpenTelemetry Gateway)
    814 +     - One per datacenter
    815 +     - Aggregates logs from local servers
    816 +     - Adds metadata (datacenter, environment)
    817 +     - Forwards to Loki
    818 +  
    819 +  2. **Loki**
    820 +     - Distributed log aggregation system
    821 +     - Like "Prometheus but for logs"
    822 +     - **Doesn't index log content** (indexes labels only)
    823 +     - Very cost-effective
    824 +  
    825 +  **Loki Components**:
    826 +  - **Distributor**: Receives logs, validates, forwards
    827 +  - **Ingester**: Buffers logs, creates chunks, stores to S3
    828 +  - **Querier**: Queries logs from ingesters and S3
    829 +  - **Query Frontend**: Splits large queries, caching
    830 +  
    831 +  **Log Flow Example**:
    832 +  
    833 +  ```
    834 +  1. Web server generates log:
    835 +     "2025-11-20 10:00:00 GET /api/v4/linodes 200 45ms"
    836 +  
    837 +  2. Fluent Bit on server sends to otelgw:
    838 +     {
    839 +       "log": "GET /api/v4/linodes 200 45ms",
    840 +       "timestamp": "2025-11-20T10:00:00Z"
    841 +     }
    842 +  
    843 +  3. otelgw adds labels:
    844 +     {
    845 +       "log": "GET /api/v4/linodes 200 45ms",
    846 +       "datacenter": "ewr1",
    847 +       "service": "api",
    848 +       "environment": "prod"
    849 +     }
    850 +  
    851 +  4. Loki stores with labels:
    852 +     Labels: {datacenter="ewr1", service="api", environment="prod"}
    853 +     Log Line: "GET /api/v4/linodes 200 45ms"
    854 +  
    855 +  5. Engineer searches in Grafana:
    856 +     Query: {service="api", datacenter="ewr1"}
    857 +  ```
    858 +  
    859 +  **Important Files**:
    860 +  - Documentation: `terraform-observability-team/docs/content/Services/Centralized Logging/`
    861 +  
    862 +  ---
    863 +  
    864 +  ## 6. How Deployments Work
    865 +  
    866 +  This section explains how changes move from your laptop to production.
    867 +  
    868 +  ### Deployment Workflow: This Repository
    869 +  
    870 +  **For terraform-observability-team**:
    871 +  
    872 +  ```
    873 +  ┌─────────────────────────────────────────────────────────────────┐
    874 +  │ Step 1: Developer Creates PR                                    │
    875 +  ├─────────────────────────────────────────────────────────────────┤
    876 +  │  git checkout -b add-new-team-member                            │
    877 +  │  vim team.auto.tfvars  # Add new person                         │
    878 +  │  git commit -m "Add Jane Doe to team"                           │
    879 +  │  git push origin add-new-team-member                            │
    880 +  │  # Create PR on Bits                                            │
    881 +  └─────────────────────────────────────────────────────────────────┘
    882 +                              ▼
    883 +  ┌─────────────────────────────────────────────────────────────────┐
    884 +  │ Step 2: Atlantis Automatically Runs Plan                        │
    885 +  ├─────────────────────────────────────────────────────────────────┤
    886 +  │  Atlantis (bot) comments on PR:                                 │
    887 +  │                                                                  │
    888 +  │  Terraform Plan:                                                │
    889 +  │  + pagerduty_user.jane_doe                                      │
    890 +  │  + github_team_membership.jane_doe                              │
    891 +  │  + bits_team_membership.jane_doe                                │
    892 +  │                                                                  │
    893 +  │  Plan: 3 to add, 0 to change, 0 to destroy                      │
    894 +  └─────────────────────────────────────────────────────────────────┘
    895 +                              ▼
    896 +  ┌─────────────────────────────────────────────────────────────────┐
    897 +  │ Step 3: Team Reviews PR                                         │
    898 +  ├─────────────────────────────────────────────────────────────────┤
    899 +  │  Required Reviewers: ops/sre-o11y team                          │
    900 +  │  ✅ Check Terraform plan looks correct                          │
    901 +  │  ✅ Approve PR                                                  │
    902 +  └─────────────────────────────────────────────────────────────────┘
    903 +                              ▼
    904 +  ┌─────────────────────────────────────────────────────────────────┐
    905 +  │ Step 4: Merge PR                                                │
    906 +  ├─────────────────────────────────────────────────────────────────┤
    907 +  │  PR is merged to main branch                                    │
    908 +  │  (Atlantis does NOT apply automatically)                        │
    909 +  └─────────────────────────────────────────────────────────────────┘
    910 +                              ▼
    911 +  ┌─────────────────────────────────────────────────────────────────┐
    912 +  │ Step 5: Apply Changes                                           │
    913 +  ├─────────────────────────────────────────────────────────────────┤
    914 +  │  Comment on merged PR: "atlantis apply"                         │
    915 +  │                                                                  │
    916 +  │  Atlantis runs:                                                 │
    917 +  │  + Creates PagerDuty user                                       │
    918 +  │  + Adds to GitHub teams                                         │
    919 +  │  + Adds to Bits teams                                           │
    920 +  │                                                                  │
    921 +  │  Apply Complete! Resources: 3 added, 0 changed, 0 destroyed     │
    922 +  └─────────────────────────────────────────────────────────────────┘
    923 +                              ▼
    924 +  ┌─────────────────────────────────────────────────────────────────┐
    925 +  │ Step 6: Verify                                                  │
    926 +  ├─────────────────────────────────────────────────────────────────┤
    927 +  │  ✅ Jane receives PagerDuty invite                              │
    928 +  │  ✅ Jane appears in GitHub org                                  │
    929 +  │  ✅ Jane can access Bits repos                                  │
    930 +  └─────────────────────────────────────────────────────────────────┘
    931 +  ```
    932 +  
    933 +  **Key Points**:
    934 +  - **Atlantis automates Terraform**
    935 +  - **Plan is automatic**, apply is manual
    936 +  - **Always review the plan** before applying
    937 +  - **Vault secrets** are automatically loaded
    938 +  - **Apply happens via comment**: `atlantis apply`
    939 +  
    940 +  ### Deployment Workflow: Application Changes (Kubernetes)
    941 +  
    942 +  **For ops/o11y-helm-charts → ops/palantir → ArgoCD**:
    943 +  
    944 +  ```
    945 +  ┌─────────────────────────────────────────────────────────────────┐
    946 +  │ Step 1: Change Application Configuration                        │
    947 +  ├─────────────────────────────────────────────────────────────────┤
    948 +  │  Repo: ops/o11y-helm-charts                                     │
    949 +  │                                                                  │
    950 +  │  Example: Upgrade VictoriaMetrics version                       │
    951 +  │                                                                  │
    952 +  │  Edit: values-files/victoriametrics-ord2-us-staging.yaml        │
    953 +  │  Change:                                                         │
    954 +  │    victoriametrics:                                             │
    955 +  │      version: v1.99.0  →  v1.100.0                              │
    956 +  │                                                                  │
    957 +  │  git commit -m "Upgrade VictoriaMetrics to v1.100.0"            │
    958 +  │  git push, create PR, get review, merge                         │
    959 +  └─────────────────────────────────────────────────────────────────┘
    960 +                              ▼
    961 +  ┌─────────────────────────────────────────────────────────────────┐
    962 +  │ Step 2: GitHub Action Auto-Creates Palantir PR                  │
    963 +  ├─────────────────────────────────────────────────────────────────┤
    964 +  │  GitHub Action in o11y-helm-charts runs:                        │
    965 +  │    1. Renders Helm chart with new values                        │
    966 +  │    2. Generates updated Application manifests                   │
    967 +  │    3. Creates PR in ops/palantir with changes                   │
    968 +  │                                                                  │
    969 +  │  Palantir PR shows:                                             │
    970 +  │    - Updated VictoriaMetrics Application YAML                   │
    971 +  │    - New container image version                                │
    972 +  └─────────────────────────────────────────────────────────────────┘
    973 +                              ▼
    974 +  ┌─────────────────────────────────────────────────────────────────┐
    975 +  │ Step 3: Review and Merge Palantir PR                            │
    976 +  ├─────────────────────────────────────────────────────────────────┤
    977 +  │  Team reviews Palantir PR:                                      │
    978 +  │    ✅ Check image version is correct                            │
    979 +  │    ✅ Verify only staging cluster affected                      │
    980 +  │    ✅ Approve and merge                                         │
    981 +  └─────────────────────────────────────────────────────────────────┘
    982 +                              ▼
    983 +  ┌─────────────────────────────────────────────────────────────────┐
    984 +  │ Step 4: ArgoCD Detects Change                                   │
    985 +  ├─────────────────────────────────────────────────────────────────┤
    986 +  │  ArgoCD polls ops/palantir every 3 minutes                      │
    987 +  │  Detects: VictoriaMetrics Application changed                  │
    988 +  │  Status: "OutOfSync"                                            │
    989 +  └─────────────────────────────────────────────────────────────────┘
    990 +                              ▼
    991 +  ┌─────────────────────────────────────────────────────────────────┐
    992 +  │ Step 5: Manual Sync in ArgoCD                                   │
    993 +  ├─────────────────────────────────────────────────────────────────┤
    994 +  │  Engineer logs into ArgoCD UI                                   │
    995 +  │  Clicks "Sync" on VictoriaMetrics Application                   │
    996 +  │  ArgoCD applies changes to Kubernetes:                          │
    997 +  │    - Rolls out new vminsert pods (v1.100.0)                     │
    998 +  │    - Rolls out new vmselect pods (v1.100.0)                     │
    999 +  │    - Rolls out new vmstorage pods (v1.100.0)                    │
   1000 +  │  Status: "Synced" + "Healthy"                                   │
   1001 +  └─────────────────────────────────────────────────────────────────┘
   1002 +                              ▼
   1003 +  ┌─────────────────────────────────────────────────────────────────┐
   1004 +  │ Step 6: Verify Deployment                                       │
   1005 +  ├─────────────────────────────────────────────────────────────────┤
   1006 +  │  kubectl get pods -n victoriametrics                            │
   1007 +  │  → All pods running with new image                              │
   1008 +  │                                                                  │
   1009 +  │  Check Grafana dashboards                                       │
   1010 +  │  → Metrics still flowing                                        │
   1011 +  │  → No errors in logs                                            │
   1012 +  └─────────────────────────────────────────────────────────────────┘
   1013 +  ```
   1014 +  
   1015 +  **Key Points**:
   1016 +  - **Two PRs required**: One in o11y-helm-charts, one auto-generated in palantir
   1017 +  - **Staging uses main branch**, production uses release tags
   1018 +  - **Manual sync in ArgoCD** (staging/dev auto-sync, prod is manual)
   1019 +  - **Always verify** after deployment
   1020 +  
   1021 +  ### Repository Relationship Diagram
   1022 +  
   1023 +  ```
   1024 +  ┌────────────────────────────────────────────────────────────────┐
   1025 +  │                    SOURCE OF TRUTH                              │
   1026 +  │                                                                 │
   1027 +  │  ┌──────────────────────┐                                      │
   1028 +  │  │ ops/o11y-helm-charts │ ← Engineers make changes here        │
   1029 +  │  │ (Helm Chart + Values)│                                      │
   1030 +  │  └──────────┬───────────┘                                      │
   1031 +  │             │                                                   │
   1032 +  │             │ GitHub Action                                    │
   1033 +  │             │ (helm template + render)                         │
   1034 +  │             ▼                                                   │
   1035 +  │  ┌──────────────────────┐                                      │
   1036 +  │  │  ops/palantir        │ ← Auto-generated manifests           │
   1037 +  │  │  (Kubernetes YAML)   │                                      │
   1038 +  │  └──────────┬───────────┘                                      │
   1039 +  │             │                                                   │
   1040 +  │             │ ArgoCD polls every 3min                          │
   1041 +  │             ▼                                                   │
   1042 +  │  ┌──────────────────────┐                                      │
   1043 +  │  │  ArgoCD              │ ← Detects changes, syncs             │
   1044 +  │  └──────────┬───────────┘                                      │
   1045 +  │             │                                                   │
   1046 +  │             │ kubectl apply                                    │
   1047 +  │             ▼                                                   │
   1048 +  │  ┌──────────────────────┐                                      │
   1049 +  │  │  Kubernetes Cluster  │ ← Applications running               │
   1050 +  │  └──────────────────────┘                                      │
   1051 +  └────────────────────────────────────────────────────────────────┘
   1052 +  ```
   1053 +  
   1054 +  ---
   1055 +  
   1056 +  ## 7. Important Files You Need to Know
   1057 +  
   1058 +  ### Critical Files (Edit With Care!)
   1059 +  
   1060 +  #### `/team.auto.tfvars`
   1061 +  
   1062 +  **What**: Single source of truth for team membership
   1063 +  
   1064 +  **When to Edit**:
   1065 +  - Adding/removing team members
   1066 +  - Changing on-call rotation
   1067 +  - Modifying repository permissions
   1068 +  
   1069 +  **Structure**:
   1070 +  
   1071 +  ```hcl
   1072 +  # Team Members
   1073 +  observability_members = {
   1074 +    jdoe = {
   1075 +      name                 = "Jane Doe"
   1076 +      email                = "jane.doe@akamai.com"
   1077 +      job_title            = "Senior SRE"
   1078 +      github_username      = "jdoe"
   1079 +      github_admin         = false          # Admin on linode-obs GitHub org
   1080 +      bits_team_maintainer = true           # Maintainer role on Bits teams
   1081 +      bits_orgs            = ["ops", "Linode"]  # Which Bits orgs
   1082 +      pd_enabled           = true           # Add to PagerDuty
   1083 +    }
   1084 +    # ... more team members
   1085 +  }
   1086 +  
   1087 +  # On-call Rotation (Primary)
   1088 +  observability_oncall_primary = [
   1089 +    "current_oncall",   # MUST be first (currently on-call)
   1090 +    "person2",
   1091 +    "person3",
   1092 +    "new_person"        # Add new people at END
   1093 +  ]
   1094 +  
   1095 +  # On-call Schedule Start Time
   1096 +  # IMPORTANT: Set to Monday 11am ET of current on-call's shift
   1097 +  pagerduty_schedule_starttime = "2025-11-18T11:00:00-05:00"
   1098 +  
   1099 +  # Repository Configurations
   1100 +  observability_bits_repos = {
   1101 +    ops = [
   1102 +      {
   1103 +        name  = "prometheus_rules"
   1104 +        hooks = ["notification-o11y-prs", "atlantis"]
   1105 +        permissions = {
   1106 +          pull = false
   1107 +          push = true
   1108 +          admin = false
   1109 +        }
   1110 +        branch_protection = {
   1111 +          pattern = "main"
   1112 +          require_code_owners_review = true
   1113 +          required_approving_review_count = 1
   1114 +        }
   1115 +      }
   1116 +      # ... more repos
   1117 +    ]
   1118 +  }
   1119 +  ```
   1120 +  
   1121 +  **Important Notes**:
   1122 +  - **On-call order matters!** Current on-call must be first
   1123 +  - **Adding to on-call**: Put new person LAST
   1124 +  - **Update schedule_starttime** when changing on-call
   1125 +  - **Review carefully** - affects real permissions
   1126 +  
   1127 +  #### `/pagerduty_suppression.auto.tfvars`
   1128 +  
   1129 +  **What**: Alert suppression during maintenance
   1130 +  
   1131 +  **When to Edit**:
   1132 +  - DC scaling events (temporarily suppress alerts)
   1133 +  - Maintenance windows
   1134 +  
   1135 +  **Structure**:
   1136 +  
   1137 +  ```hcl
   1138 +  # DCs with alerts suppressed
   1139 +  observability_suppressed_dcs = [
   1140 +    "iad5",    # Include both formats
   1141 +    "iad05",   # (some alerts use zero-padded)
   1142 +  ]
   1143 +  
   1144 +  # Services with suppression rules
   1145 +  observability_suppressed_services = [
   1146 +    "PJ95HJI",  # Prometheus Alerts service ID
   1147 +    # ... more service IDs
   1148 +  ]
   1149 +  ```
   1150 +  
   1151 +  **Process**:
   1152 +  1. Edit file to add DC
   1153 +  2. `atlantis plan` to verify
   1154 +  3. Merge and `atlantis apply`
   1155 +  4. Suppression active
   1156 +  5. **IMPORTANT**: Remove DC after maintenance!
   1157 +  
   1158 +  #### `/atlantis.yaml`
   1159 +  
   1160 +  **What**: Atlantis automation configuration
   1161 +  
   1162 +  **When to Edit**: Rarely (only if changing Terraform workflow)
   1163 +  
   1164 +  **Key Settings**:
   1165 +  
   1166 +  ```yaml
   1167 +  version: 3
   1168 +  automerge: false
   1169 +  delete_source_branch_on_merge: false
   1170 +  
   1171 +  projects:
   1172 +  - name: terraform-observability-team
   1173 +    dir: .
   1174 +    workspace: default
   1175 +    terraform_version: v1.5.7
   1176 +  
   1177 +    # Automatic plan on PR
   1178 +    autoplan:
   1179 +      when_modified: ["*.tf", "*.tfvars"]
   1180 +      enabled: true
   1181 +  
   1182 +    # Requirements before apply
   1183 +    apply_requirements:
   1184 +      - approved      # PR must be approved
   1185 +      - mergeable     # PR must be mergeable
   1186 +  
   1187 +    workflow: observability-team
   1188 +  
   1189 +  workflows:
   1190 +    observability-team:
   1191 +      plan:
   1192 +        steps:
   1193 +        - run: vault login -method=oidc  # Authenticate to Vault
   1194 +        - init
   1195 +        - plan
   1196 +      apply:
   1197 +        steps:
   1198 +        - run: vault login -method=oidc
   1199 +        - apply
   1200 +  ```
   1201 +  
   1202 +  **Don't Touch Unless**: You know what you're doing with Atlantis
   1203 +  
   1204 +  ### Configuration Files
   1205 +  
   1206 +  #### `/.envrc`
   1207 +  
   1208 +  **What**: Environment variables for local development (loaded by direnv)
   1209 +  
   1210 +  **Contents**:
   1211 +  
   1212 +  ```bash
   1213 +  export VAULT_ADDR="https://vault.infra.linode.com"
   1214 +  export VAULT_NAMESPACE="infra"
   1215 +  
   1216 +  # Auto-login to Vault when entering directory
   1217 +  vault login -method=oidc username=$USER
   1218 +  ```
   1219 +  
   1220 +  **Usage**:
   1221 +  1. Install direnv: `brew install direnv`
   1222 +  2. Add to shell: `eval "$(direnv hook zsh)"`
   1223 +  3. Allow in this directory: `direnv allow`
   1224 +  4. Now auto-authenticated to Vault when you `cd` here
   1225 +  
   1226 +  #### `/.pre-commit-config.yaml`
   1227 +  
   1228 +  **What**: Automated code quality checks before commits
   1229 +  
   1230 +  **Hooks**:
   1231 +  - `terraform fmt` - Format Terraform code
   1232 +  - `terraform validate` - Validate Terraform syntax
   1233 +  - `markdownlint` - Lint Markdown docs
   1234 +  - `vale` - Prose style checking
   1235 +  - `prettier` - Format YAML/JSON/Markdown
   1236 +  - `trailing-whitespace` - Remove trailing whitespace
   1237 +  - `end-of-file-fixer` - Ensure files end with newline
   1238 +  
   1239 +  **Setup**:
   1240 +  ```bash
   1241 +  pre-commit install --install-hooks
   1242 +  ```
   1243 +  
   1244 +  **Usage**: Runs automatically on `git commit`
   1245 +  
   1246 +  **Skip if needed** (not recommended):
   1247 +  ```bash
   1248 +  git commit --no-verify
   1249 +  ```
   1250 +  
   1251 +  #### `/.mise.toml`
   1252 +  
   1253 +  **What**: Task runner configuration (like Makefile but better)
   1254 +  
   1255 +  **Common Tasks**:
   1256 +  
   1257 +  ```bash
   1258 +  # Create new on-call log for this week
   1259 +  mise run oncall
   1260 +  
   1261 +  # Create new proposal
   1262 +  mise run proposal
   1263 +  
   1264 +  # Run pre-commit on all files
   1265 +  mise run pre-commit-all
   1266 +  
   1267 +  # Serve documentation locally
   1268 +  mise run docs-serve
   1269 +  ```
   1270 +  
   1271 +  **View all tasks**:
   1272 +  ```bash
   1273 +  mise tasks
   1274 +  ```
   1275 +  
   1276 +  ### Documentation Files
   1277 +  
   1278 +  #### `/docs/config.toml`
   1279 +  
   1280 +  **What**: Hugo site configuration
   1281 +  
   1282 +  **Key Settings**:
   1283 +  
   1284 +  ```toml
   1285 +  baseURL = "https://bits.linode.com/pages/ops/terraform-observability-team/"
   1286 +  title = "SRE Observability Team"
   1287 +  theme = "docsy"
   1288 +  
   1289 +  [params]
   1290 +    description = "SRE Observability Team Documentation"
   1291 +    github_repo = "https://bits.linode.com/ops/terraform-observability-team"
   1292 +    github_branch = "main"
   1293 +  ```
   1294 +  
   1295 +  **When to Edit**: Changing site metadata, theme settings
   1296 +  
   1297 +  ---
   1298 +  
   1299 +  ## 8. Documentation Structure
   1300 +  
   1301 +  ### How Documentation is Organized
   1302 +  
   1303 +  The team uses **Hugo** with the **Docsy** theme for documentation.
   1304 +  
   1305 +  **Why Hugo?**
   1306 +  - Static site generator (fast, secure)
   1307 +  - Markdown-based (easy to write)
   1308 +  - Version controlled (in git)
   1309 +  - Searchable
   1310 +  - Navigation sidebar auto-generated
   1311 +  
   1312 +  ### Content Types
   1313 +  
   1314 +  #### Handbooks (`/docs/content/Handbooks/`)
   1315 +  
   1316 +  **Purpose**: How the team operates
   1317 +  
   1318 +  **Files**:
   1319 +  - `on-call.md` - On-call guide (schedule, PagerDuty setup, responsibilities)
   1320 +  - `tools.md` - Standardized tooling (Go, Jsonnet, Kubernetes, etc.)
   1321 +  - `git-conventions.md` - Commit message format, branching strategy
   1322 +  - `docs/` - How to write documentation
   1323 +  
   1324 +  **When to Update**: Process changes, new tools adopted
   1325 +  
   1326 +  #### Services (`/docs/content/Services/`)
   1327 +  
   1328 +  **Purpose**: Documentation for each service we manage
   1329 +  
   1330 +  **Structure**:
   1331 +  ```
   1332 +  Services/
   1333 +  ├── ArgoCD/
   1334 +  │   ├── _index.md              # Overview
   1335 +  │   ├── repositories.md        # Repo management
   1336 +  │   └── troubleshooting.md
   1337 +  ├── VictoriaMetrics/
   1338 +  │   ├── _index.md
   1339 +  │   ├── architecture.md
   1340 +  │   ├── upgrade.md
   1341 +  │   └── troubleshooting.md
   1342 +  ├── Prometheus/
   1343 +  │   ├── _index.md
   1344 +  │   ├── sharding.md
   1345 +  │   ├── Updating/
   1346 +  │   │   └── updating.md
   1347 +  │   └── troubleshooting.md
   1348 +  └── ...
   1349 +  ```
   1350 +  
   1351 +  **When to Update**: Service upgrades, architecture changes, new troubleshooting steps
   1352 +  
   1353 +  #### MOPs (`/docs/content/mops/`)
   1354 +  
   1355 +  **Purpose**: Manual Operations Procedures - step-by-step guides for complex tasks
   1356 +  
   1357 +  **MOP Structure**:
   1358 +  
   1359 +  ```markdown
   1360 +  # MOP: Prometheus Shard
   1361 +  
   1362 +  ## Overview
   1363 +  High-level description of the procedure.
   1364 +  
   1365 +  ## Prerequisites
   1366 +  - [ ] Access to Salt master
   1367 +  - [ ] 2-4 hours of focused time
   1368 +  - [ ] Approval from team lead
   1369 +  
   1370 +  ## Procedure
   1371 +  
   1372 +  ### Step 1: Prepare
   1373 +  Detailed instructions...
   1374 +  
   1375 +  ### Step 2: Execute
   1376 +  More instructions...
   1377 +  
   1378 +  ## Verification
   1379 +  How to verify the procedure succeeded.
   1380 +  
   1381 +  ## Rollback Plan
   1382 +  How to undo changes if something goes wrong.
   1383 +  
   1384 +  ## References
   1385 +  - [Related Documentation](link)
   1386 +  ```
   1387 +  
   1388 +  **Examples**:
   1389 +  - `prometheus-shard.md` - Adding a new Prometheus shard
   1390 +  - `victoriametrics-cluster-upgrade.md` - Upgrading VictoriaMetrics
   1391 +  
   1392 +  **When to Create**: Complex multi-step procedures that are done infrequently
   1393 +  
   1394 +  #### Proposals (`/docs/content/proposals/`)
   1395 +  
   1396 +  **Purpose**: Design documents for major changes
   1397 +  
   1398 +  **Proposal Naming**: `OP-##-descriptive-name.md`
   1399 +  - OP = Observability Proposal
   1400 +  - ## = Sequential number (01, 02, 03...)
   1401 +  
   1402 +  **Proposal Template**:
   1403 +  
   1404 +  ```markdown
   1405 +  ---
   1406 +  title: "OP-05: My Proposal Title"
   1407 +  status: "accepted"  # draft | accepted | rejected | done
   1408 +  date: 2025-11-20
   1409 +  ---
   1410 +  
   1411 +  ## Summary
   1412 +  One paragraph summary.
   1413 +  
   1414 +  ## Motivation
   1415 +  Why are we doing this?
   1416 +  
   1417 +  ## Proposal
   1418 +  Detailed design.
   1419 +  
   1420 +  ## Alternatives Considered
   1421 +  What else did we think about?
   1422 +  
   1423 +  ## Implementation Plan
   1424 +  How will we do this?
   1425 +  
   1426 +  ## Success Metrics
   1427 +  How do we know it worked?
   1428 +  ```
   1429 +  
   1430 +  **Statuses**:
   1431 +  - `draft` - Being written
   1432 +  - `accepted` - Approved, not implemented
   1433 +  - `done` - Implemented
   1434 +  - `rejected` - Not approved
   1435 +  
   1436 +  **Creating a Proposal**:
   1437 +  ```bash
   1438 +  cd docs/
   1439 +  mise run proposal
   1440 +  # OR
   1441 +  hugo new proposals/OP-##-my-proposal.md
   1442 +  ```
   1443 +  
   1444 +  **Approval Process**:
   1445 +  1. Create proposal as draft
   1446 +  2. Share with team for feedback
   1447 +  3. Present in team meeting
   1448 +  4. Minimum 2 approvals required
   1449 +  5. Update status to "accepted"
   1450 +  
   1451 +  #### On-call Logs (`/docs/content/on-call/YYYY/`)
   1452 +  
   1453 +  **Purpose**: Weekly logs of on-call work
   1454 +  
   1455 +  **Structure**:
   1456 +  
   1457 +  ```markdown
   1458 +  ---
   1459 +  title: "On-call: 2025-11-18"
   1460 +  date: 2025-11-18
   1461 +  author: "Jane Doe"
   1462 +  ---
   1463 +  
   1464 +  ## Summary
   1465 +  Brief summary of the week.
   1466 +  
   1467 +  ## Incidents
   1468 +  ### [INC-1234] Production Prometheus Down
   1469 +  - **When**: 2025-11-18 14:00 UTC
   1470 +  - **Impact**: 5 minutes of data loss
   1471 +  - **Root Cause**: Out of disk space
   1472 +  - **Resolution**: Cleaned up old WAL files
   1473 +  - **Follow-up**: Created ticket to increase disk size
   1474 +  
   1475 +  ## Reliability Improvements
   1476 +  - Automated disk cleanup script
   1477 +  - Added disk space alerting
   1478 +  
   1479 +  ## Intake & Requests
   1480 +  - Granted Grafana access to 3 new users
   1481 +  - Helped Product team with dashboard creation
   1482 +  
   1483 +  ## Notes
   1484 +  - VictoriaMetrics upgrade planned for next week
   1485 +  - Need to review Prometheus sharding in iad5
   1486 +  ```
   1487 +  
   1488 +  **Creating On-call Log**:
   1489 +  ```bash
   1490 +  cd docs/
   1491 +  mise run oncall
   1492 +  # Creates: on-call/2025/2025-11-18.md (for current Monday)
   1493 +  ```
   1494 +  
   1495 +  **Handoff Process**:
   1496 +  1. Monday 11am ET - on-call shift changes
   1497 +  2. Outgoing on-call fills out summary
   1498 +  3. Posts link in #o11y-core Slack thread
   1499 +  4. Incoming on-call reads to catch up
   1500 +  
   1501 +  #### Runbooks (`/docs/content/Runbooks/`)
   1502 +  
   1503 +  **Purpose**: How to respond to specific alerts
   1504 +  
   1505 +  **Runbook Structure**:
   1506 +  
   1507 +  ```markdown
   1508 +  # Alert: HighMemoryUsage
   1509 +  
   1510 +  ## Summary
   1511 +  This alert fires when a server's memory usage exceeds 90% for 5 minutes.
   1512 +  
   1513 +  ## Impact
   1514 +  - Potential performance degradation
   1515 +  - Risk of OOM killer terminating processes
   1516 +  
   1517 +  ## Investigation Steps
   1518 +  1. Check which process is using memory:
   1519 +     ```bash
   1520 +     ssh server
   1521 +     top -o %MEM
   1522 +     ```
   1523 +  
   1524 +  2. Check for memory leaks:
   1525 +     ```bash
   1526 +     ps aux --sort=-%mem | head -n 10
   1527 +     ```
   1528 +  
   1529 +  ## Resolution
   1530 +  - Restart offending process
   1531 +  - Increase memory if consistently high
   1532 +  - Check for memory leak in application
   1533 +  
   1534 +  ## Escalation
   1535 +  If you can't resolve in 30 minutes, escalate to:
   1536 +  - #team-infrastructure
   1537 +  ```
   1538 +  
   1539 +  **When to Create**: For any alert that pages
   1540 +  
   1541 +  ### Writing Documentation
   1542 +  
   1543 +  **Creating New Pages**:
   1544 +  
   1545 +  ```bash
   1546 +  cd docs/
   1547 +  
   1548 +  # New service documentation
   1549 +  hugo new Services/MyService/_index.md
   1550 +  
   1551 +  # New MOP
   1552 +  hugo new mops/my-procedure.md
   1553 +  
   1554 +  # New proposal
   1555 +  mise run proposal
   1556 +  
   1557 +  # New on-call log
   1558 +  mise run oncall
   1559 +  ```
   1560 +  
   1561 +  **Markdown Tips**:
   1562 +  
   1563 +  ```markdown
   1564 +  # Headers
   1565 +  Use # for title, ## for sections, ### for subsections
   1566 +  
   1567 +  # Links
   1568 +  [Link Text](https://example.com)
   1569 +  [Internal Link]({{< ref "path/to/page.md" >}})
   1570 +  
   1571 +  # Code Blocks
   1572 +  ```bash
   1573 +  command here
   1574 +  ```
   1575 +  
   1576 +  # Images
   1577 +  ![Alt Text](/static/images/diagram.png)
   1578 +  
   1579 +  # Admonitions (special boxes)
   1580 +  {{< alert title="Warning" >}}
   1581 +  This is important!
   1582 +  {{< /alert >}}
   1583 +  ```
   1584 +  
   1585 +  **Documentation Standards**:
   1586 +  1. **Clarity**: Write for someone unfamiliar with the topic
   1587 +  2. **Examples**: Include real examples, not just theory
   1588 +  3. **Up-to-date**: Update docs when processes change
   1589 +  4. **Searchable**: Use descriptive headers and titles
   1590 +  
   1591 +  **Building Locally**:
   1592 +  
   1593 +  ```bash
   1594 +  cd docs/
   1595 +  mise run docs-serve
   1596 +  # Open http://localhost:1313
   1597 +  ```
   1598 +  
   1599 +  **Publishing**:
   1600 +  - Merged to `main` branch → GitHub Action builds → Published to Bits Pages
   1601 +  
   1602 +  ---
   1603 +  
   1604 +  ## 9. Day-to-Day Operations
   1605 +  
   1606 +  ### On-call Responsibilities
   1607 +  
   1608 +  **On-call Shift**: Monday 11am ET → Next Monday 11am ET
   1609 +  
   1610 +  **Primary On-call Duties**:
   1611 +  
   1612 +  1. **Respond to Pages** (Critical alerts)
   1613 +     - Acknowledgment: ≤ 5 minutes
   1614 +     - Begin investigation immediately
   1615 +     - Update incident ticket
   1616 +     - Escalate if needed
   1617 +  
   1618 +  2. **Monitor Warnings** (Non-critical alerts)
   1619 +     - Acknowledgment: ≤ 12 hours
   1620 +     - Investigate during business hours
   1621 +     - Create tickets for follow-up
   1622 +  
   1623 +  3. **Handle Intake Requests**
   1624 +     - Grafana permission requests
   1625 +     - Nagios user management
   1626 +     - Quick questions in Slack
   1627 +  
   1628 +  4. **PR Reviews**
   1629 +     - Review PRs for ops/sre-o11y repos
   1630 +     - Priority: Blocking changes first
   1631 +  
   1632 +  5. **Reliability Improvement**
   1633 +     - Choose ONE reliability task per week
   1634 +     - Examples: Automate toil, improve documentation, fix flaky alerts
   1635 +  
   1636 +  6. **Attend Post-Mortems**
   1637 +     - For incidents you responded to
   1638 +     - Share learnings with team
   1639 +  
   1640 +  **On-call Handoff**:
   1641 +  1. Fill out on-call log summary
   1642 +  2. Post in #o11y-core Slack thread
   1643 +  3. Highlight any ongoing issues
   1644 +  4. Transfer any active incidents
   1645 +  
   1646 +  ### Common Daily Tasks
   1647 +  
   1648 +  #### Reviewing PRs
   1649 +  
   1650 +  **Repositories We Review**:
   1651 +  - `ops/prometheus_rules`
   1652 +  - `ops/prometheus-formula`
   1653 +  - `ops/o11y-helm-charts`
   1654 +  - `ops/palantir`
   1655 +  - `ops/loki_rules`
   1656 +  - `ops/terraform-grafana-config`
   1657 +  - `terraform-observability-team`
   1658 +  
   1659 +  **Review Checklist**:
   1660 +  - [ ] Read PR description
   1661 +  - [ ] Check CI passes
   1662 +  - [ ] Review code changes
   1663 +  - [ ] Verify Terraform plan (if applicable)
   1664 +  - [ ] Check for secrets in diff
   1665 +  - [ ] Ensure tests added (if applicable)
   1666 +  - [ ] Approve or request changes
   1667 +  
   1668 +  **Slack Notifications**: #notification-o11y-prs
   1669 +  
   1670 +  #### Granting Grafana Access
   1671 +  
   1672 +  **Request Format**: Usually in #sre-observability or Jira (OY project)
   1673 +  
   1674 +  **Process**:
   1675 +  1. Determine access level needed
   1676 +     - Viewer: Read dashboards
   1677 +     - Editor: Create/edit dashboards
   1678 +     - Admin: Manage users (rare)
   1679 +  
   1680 +  2. Add user in Terraform:
   1681 +     ```bash
   1682 +     cd ~/path/to/terraform-grafana-config
   1683 +     vim users.tf
   1684 +     # Add user definition
   1685 +     git commit -m "Add Jane Doe to Grafana"
   1686 +     git push, create PR
   1687 +     ```
   1688 +  
   1689 +  3. After PR merged:
   1690 +     - Atlantis applies
   1691 +     - User receives invite email
   1692 +  
   1693 +  4. Notify requester
   1694 +  
   1695 +  **Time-sensitive**: Try to complete within 1 business day
   1696 +  
   1697 +  #### Monitoring Alerts
   1698 +  
   1699 +  **Alert Channels**:
   1700 +  - **#notification-o11y** - All observability alerts
   1701 +  - **PagerDuty** - Critical alerts (pages)
   1702 +  - **Email** - Low-priority warnings
   1703 +  
   1704 +  **Triage Process**:
   1705 +  
   1706 +  1. **Check Alert Severity**
   1707 +     - **Critical**: Page immediately
   1708 +     - **Warning**: Investigate during business hours
   1709 +     - **Info**: Log for awareness
   1710 +  
   1711 +  2. **Check Runbook**
   1712 +     - Most alerts link to runbook
   1713 +     - Follow investigation steps
   1714 +  
   1715 +  3. **Create Incident Ticket** (if needed)
   1716 +     - Jira project: OY
   1717 +     - Include alert details
   1718 +     - Track resolution
   1719 +  
   1720 +  4. **Silence if Necessary**
   1721 +     ```bash
   1722 +     # Silence alert for maintenance
   1723 +     amtool silence add \
   1724 +       --alertmanager.url=https://alertmanager.linode.com \
   1725 +       --comment="Datacenter maintenance" \
   1726 +       --duration=2h \
   1727 +       alertname="HighCPU" \
   1728 +       datacenter="iad5"
   1729 +     ```
   1730 +  
   1731 +  5. **Post-Incident**
   1732 +     - Document in on-call log
   1733 +     - Create follow-up tickets
   1734 +     - Update runbook if needed
   1735 +  
   1736 +  ### Weekly Team Rituals
   1737 +  
   1738 +  **Monday 11am ET: On-call Handoff**
   1739 +  - Outgoing on-call posts summary
   1740 +  - Incoming on-call reviews
   1741 +  
   1742 +  **Wednesday 10am ET: Team Sync** (if scheduled)
   1743 +  - Current work updates
   1744 +  - Blocker discussion
   1745 +  - Knowledge sharing
   1746 +  
   1747 +  **Fridays: Reliability Improvement Time**
   1748 +  - Work on tech debt
   1749 +  - Improve automation
   1750 +  - Update documentation
   1751 +  
   1752 +  ---
   1753 +  
   1754 +  ## 10. Common Tasks with Examples
   1755 +  
   1756 +  Let me show you how to do common tasks step-by-step.
   1757 +  
   1758 +  ### Task 1: Add a New Team Member
   1759 +  
   1760 +  **Scenario**: Jane Doe is joining the team.
   1761 +  
   1762 +  **Steps**:
   1763 +  
   1764 +  ```bash
   1765 +  # 1. Clone repo (if not already)
   1766 +  cd ~/repos
   1767 +  git clone bits.linode.com:ops/terraform-observability-team
   1768 +  cd terraform-observability-team
   1769 +  
   1770 +  # 2. Create branch
   1771 +  git checkout -b add-jane-doe
   1772 +  
   1773 +  # 3. Edit team.auto.tfvars
   1774 +  vim team.auto.tfvars
   1775 +  
   1776 +  # Add to observability_members:
   1777 +  observability_members = {
   1778 +    # ... existing members ...
   1779 +  
   1780 +    jdoe = {
   1781 +      name                 = "Jane Doe"
   1782 +      email                = "jane.doe@akamai.com"
   1783 +      job_title            = "SRE II"
   1784 +      github_username      = "jdoe-akamai"
   1785 +      github_admin         = false
   1786 +      bits_team_maintainer = true
   1787 +      bits_orgs            = ["ops", "Linode"]
   1788 +      pd_enabled           = true
   1789 +    }
   1790 +  }
   1791 +  
   1792 +  # 4. Commit and push
   1793 +  git add team.auto.tfvars
   1794 +  git commit -m "Add Jane Doe to SRE O11y team"
   1795 +  git push origin add-jane-doe
   1796 +  
   1797 +  # 5. Create PR on Bits
   1798 +  # Visit: bits.linode.com/ops/terraform-observability-team
   1799 +  # Click "Create Pull Request"
   1800 +  
   1801 +  # 6. Wait for Atlantis to run plan
   1802 +  # Review plan in PR comments
   1803 +  
   1804 +  # 7. Get PR approved by team member
   1805 +  
   1806 +  # 8. Merge PR
   1807 +  
   1808 +  # 9. Apply changes
   1809 +  # Comment on merged PR: "atlantis apply"
   1810 +  
   1811 +  # 10. Verify
   1812 +  # - Check PagerDuty: Jane should appear in team
   1813 +  # - Check GitHub: Jane should be in linode-obs org
   1814 +  # - Check Bits: Jane should be in ops/sre-o11y team
   1815 +  ```
   1816 +  
   1817 +  ### Task 2: Add Someone to On-call Rotation
   1818 +  
   1819 +  **Scenario**: Jane is trained and ready for on-call.
   1820 +  
   1821 +  **Important**: On-call rotation order matters!
   1822 +  
   1823 +  **Steps**:
   1824 +  
   1825 +  ```bash
   1826 +  # 1. Determine current on-call
   1827 +  # Check PagerDuty schedule or ask in Slack
   1828 +  
   1829 +  # 2. Create branch
   1830 +  git checkout -b add-jane-oncall
   1831 +  
   1832 +  # 3. Edit team.auto.tfvars
   1833 +  vim team.auto.tfvars
   1834 +  
   1835 +  # BEFORE:
   1836 +  observability_oncall_primary = [
   1837 +    "current_oncall",
   1838 +    "person2",
   1839 +    "person3",
   1840 +  ]
   1841 +  pagerduty_schedule_starttime = "2025-11-18T11:00:00-05:00"
   1842 +  
   1843 +  # AFTER:
   1844 +  observability_oncall_primary = [
   1845 +    "current_oncall",   # Must be first!
   1846 +    "person2",
   1847 +    "person3",
   1848 +    "jdoe"              # Add new person LAST
   1849 +  ]
   1850 +  # Update to Monday of current on-call's shift:
   1851 +  pagerduty_schedule_starttime = "2025-11-18T11:00:00-05:00"
   1852 +  
   1853 +  # 4. Commit, push, PR, merge, apply (same as Task 1)
   1854 +  ```
   1855 +  
   1856 +  **Why Order Matters**:
   1857 +  - PagerDuty schedule starts at `pagerduty_schedule_starttime`
   1858 +  - Rotates through the list in order
   1859 +  - If order changes, rotation gets messed up
   1860 +  - **Always**: Current on-call first, new person last
   1861 +  
   1862 +  ### Task 3: Suppress Alerts for Datacenter Maintenance
   1863 +  
   1864 +  **Scenario**: Datacenter iad5 is being scaled, expect alerts.
   1865 +  
   1866 +  **Steps**:
   1867 +  
   1868 +  ```bash
   1869 +  # 1. Create branch
   1870 +  git checkout -b suppress-iad5
   1871 +  
   1872 +  # 2. Edit pagerduty_suppression.auto.tfvars
   1873 +  vim pagerduty_suppression.auto.tfvars
   1874 +  
   1875 +  # Add DCs (include both formats!):
   1876 +  observability_suppressed_dcs = [
   1877 +    "iad5",
   1878 +    "iad05",   # Some alerts use zero-padded
   1879 +  ]
   1880 +  
   1881 +  # 3. Commit and push
   1882 +  git add pagerduty_suppression.auto.tfvars
   1883 +  git commit -m "Suppress alerts for iad5 during scaling"
   1884 +  git push origin suppress-iad5
   1885 +  
   1886 +  # 4. Create PR, get approved, merge
   1887 +  
   1888 +  # 5. Apply immediately
   1889 +  # Comment: "atlantis apply"
   1890 +  
   1891 +  # 6. After maintenance completes, REMOVE suppression
   1892 +  git checkout -b unsuppress-iad5
   1893 +  vim pagerduty_suppression.auto.tfvars
   1894 +  # Remove iad5, iad05 from list
   1895 +  git commit -m "Remove iad5 alert suppression"
   1896 +  # Push, PR, merge, apply
   1897 +  ```
   1898 +  
   1899 +  **Important**: Don't forget to remove suppression after!
   1900 +  
   1901 +  ### Task 4: Upgrade VictoriaMetrics
   1902 +  
   1903 +  **Scenario**: New VictoriaMetrics version released, need to upgrade staging.
   1904 +  
   1905 +  **Steps**:
   1906 +  
   1907 +  ```bash
   1908 +  # 1. Review changelog
   1909 +  # Check: github.com/VictoriaMetrics/VictoriaMetrics/releases
   1910 +  
   1911 +  # 2. Clone o11y-helm-charts
   1912 +  cd ~/repos
   1913 +  git clone bits.linode.com:ops/o11y-helm-charts
   1914 +  cd o11y-helm-charts
   1915 +  
   1916 +  # 3. Create branch
   1917 +  git checkout -b victoriametrics-v1.100.0
   1918 +  
   1919 +  # 4. Update staging cluster values
   1920 +  vim values-files/victoriametrics-ord2-us-staging.yaml
   1921 +  
   1922 +  # Change version:
   1923 +  victoriametrics:
   1924 +    version: v1.99.0  →  v1.100.0
   1925 +  
   1926 +  # 5. Commit and push
   1927 +  git add values-files/victoriametrics-ord2-us-staging.yaml
   1928 +  git commit -m "Upgrade VictoriaMetrics staging to v1.100.0"
   1929 +  git push origin victoriametrics-v1.100.0
   1930 +  
   1931 +  # 6. Create PR, get reviewed, merge
   1932 +  
   1933 +  # 7. Wait for GitHub Action to create Palantir PR
   1934 +  # Check: bits.linode.com/ops/palantir/pulls
   1935 +  
   1936 +  # 8. Review and merge Palantir PR
   1937 +  
   1938 +  # 9. Sync in ArgoCD
   1939 +  # - Visit ArgoCD staging: argocd.infra-o11y-apps.rin1.us.staging.linode.com
   1940 +  # - Find victoriametrics-ord2 application
   1941 +  # - Click "Sync"
   1942 +  # - Wait for deployment to complete
   1943 +  
   1944 +  # 10. Verify
   1945 +  kubectl --context victoriametrics-ord2-us-staging get pods -n victoriametrics
   1946 +  # All pods should show new image version
   1947 +  
   1948 +  # Check Grafana dashboards for errors
   1949 +  
   1950 +  # 11. If successful, repeat for production clusters
   1951 +  ```
   1952 +  
   1953 +  ### Task 5: Create On-call Log
   1954 +  
   1955 +  **Scenario**: It's Monday, time to start your on-call shift.
   1956 +  
   1957 +  **Steps**:
   1958 +  
   1959 +  ```bash
   1960 +  # 1. Navigate to docs directory
   1961 +  cd ~/repos/terraform-observability-team/docs
   1962 +  
   1963 +  # 2. Create on-call log
   1964 +  mise run oncall
   1965 +  # This creates: content/on-call/2025/2025-11-18.md
   1966 +  
   1967 +  # 3. Edit throughout the week
   1968 +  vim content/on-call/2025/2025-11-18.md
   1969 +  
   1970 +  # Add incidents, improvements, notes
   1971 +  
   1972 +  # 4. At end of week, fill out summary
   1973 +  vim content/on-call/2025/2025-11-18.md
   1974 +  
   1975 +  ## Summary
   1976 +  Quiet week. Responded to 2 minor alerts. Improved disk cleanup automation.
   1977 +  
   1978 +  # 5. Commit and push
   1979 +  git add content/on-call/2025/2025-11-18.md
   1980 +  git commit -m "On-call log: 2025-11-18"
   1981 +  git push origin main
   1982 +  
   1983 +  # 6. Post link in Slack
   1984 +  # In #o11y-core:
   1985 +  # "On-call handoff: 
        + https://bits.linode.com/pages/ops/terraform-observability-team/on-call/2025/2025-11-18/"
   1986 +  ```
   1987 +  
   1988 +  ### Task 6: Deploy New Kubernetes Cluster
   1989 +  
   1990 +  **This is a complex multi-day task involving multiple teams.**
   1991 +  
   1992 +  **Prerequisites**:
   1993 +  - Approval from management
   1994 +  - Infrastructure planned (node count, sizes, datacenter)
   1995 +  - Cluster name decided
   1996 +  
   1997 +  **Steps** (abbreviated - see full MOP in docs):
   1998 +  
   1999 +  ```bash
   2000 +  # Day 1: Infrastructure Provisioning
   2001 +  # 1. Create Terraform PR in Linode/terraform-module-infra
   2002 +  # 2. Add cluster definition
   2003 +  # 3. Apply Terraform
   2004 +  # 4. Linodes created
   2005 +  
   2006 +  # Day 2: Server Configuration
   2007 +  # 5. Ask #sre-salt to accept minion keys
   2008 +  # 6. Set Salt grains for cluster nodes
   2009 +  # 7. Run highstate
   2010 +  
   2011 +  # Day 3: Vault & Secrets
   2012 +  # 8. Store cluster approle secret in Vault
   2013 +  # 9. Store kubeconfig in Vault
   2014 +  # 10. Update Vault PKI to allow cluster domain
   2015 +  
   2016 +  # Day 4: Kubernetes Installation
   2017 +  # 11. Run Kubespray Ansible playbook
   2018 +  # 12. Wait 1-2 hours for completion
   2019 +  # 13. Verify cluster accessible
   2020 +  
   2021 +  # Day 5: ArgoCD Integration
   2022 +  # 14. Join cluster to ArgoCD (make join-cluster)
   2023 +  # 15. Create cluster labels
   2024 +  
   2025 +  # Day 6: Application Configuration
   2026 +  # 16. Create values file in o11y-helm-charts
   2027 +  # 17. Create overlays in palantir
   2028 +  # 18. Merge PRs
   2029 +  
   2030 +  # Day 7: Deploy Applications
   2031 +  # 19. Sync applications in ArgoCD
   2032 +  # 20. Verify all apps healthy
   2033 +  # 21. Update documentation
   2034 +  ```
   2035 +  
   2036 +  **This task requires coordination with**:
   2037 +  - SRE Infrastructure (Terraform)
   2038 +  - SRE Salt (minion keys)
   2039 +  - SRE Observability (that's you!)
   2040 +  
   2041 +  ---
   2042 +  
   2043 +  ## 11. Things to Be Careful About
   2044 +  
   2045 +  ### Critical Mistakes to Avoid
   2046 +  
   2047 +  #### 1. DON'T: Edit On-call Rotation Without Updating Start Time
   2048 +  
   2049 +  **Wrong**:
   2050 +  ```hcl
   2051 +  observability_oncall_primary = [
   2052 +    "person2",  # Reordered list
   2053 +    "person1",
   2054 +    "person3",
   2055 +  ]
   2056 +  pagerduty_schedule_starttime = "2025-11-04T11:00:00-05:00"  # OLD DATE
   2057 +  ```
   2058 +  
   2059 +  **Right**:
   2060 +  ```hcl
   2061 +  observability_oncall_primary = [
   2062 +    "person1",  # Current on-call FIRST
   2063 +    "person2",
   2064 +    "person3",
   2065 +  ]
   2066 +  pagerduty_schedule_starttime = "2025-11-18T11:00:00-05:00"  # Current Monday
   2067 +  ```
   2068 +  
   2069 +  **Why**: PagerDuty rotation starts from the date and follows the list. Wrong date = wrong person 
        + on-call.
   2070 +  
   2071 +  #### 2. DON'T: Commit Secrets to Git
   2072 +  
   2073 +  **Wrong**:
   2074 +  ```bash
   2075 +  # In a file:
   2076 +  VAULT_TOKEN=hvs.1234567890abcdef
   2077 +  DATABASE_PASSWORD=supersecret123
   2078 +  ```
   2079 +  
   2080 +  **Right**:
   2081 +  ```bash
   2082 +  # Store in Vault:
   2083 +  vault kv put infra/prod/myapp/secrets password=supersecret123
   2084 +  
   2085 +  # Reference in code:
   2086 +  password = data.vault_generic_secret.myapp.data["password"]
   2087 +  ```
   2088 +  
   2089 +  **Prevention**: Pre-commit hooks help catch this, but always review your diffs!
   2090 +  
   2091 +  #### 3. DON'T: Force Push to Main/Master
   2092 +  
   2093 +  **Wrong**:
   2094 +  ```bash
   2095 +  git push --force origin main
   2096 +  ```
   2097 +  
   2098 +  **Why**: Overwrites history, breaks everyone's local copies, loses work.
   2099 +  
   2100 +  **If you need to fix a commit**: Create a new commit or revert.
   2101 +  
   2102 +  #### 4. DON'T: Skip Atlantis Plan Review
   2103 +  
   2104 +  **Wrong**:
   2105 +  ```
   2106 +  # PR is merged
   2107 +  # Comment: atlantis apply
   2108 +  # (without reading the plan)
   2109 +  ```
   2110 +  
   2111 +  **Right**:
   2112 +  ```
   2113 +  # PR is created
   2114 +  # Atlantis posts plan
   2115 +  # READ THE ENTIRE PLAN
   2116 +  # Verify:
   2117 +  #   - Resources to add/change/destroy
   2118 +  #   - No unexpected changes
   2119 +  #   - No secrets in output
   2120 +  # Then approve, merge, apply
   2121 +  ```
   2122 +  
   2123 +  **Why**: Terraform can destroy resources. Always verify plans.
   2124 +  
   2125 +  #### 5. DON'T: Amend Other People's Commits
   2126 +  
   2127 +  **Wrong**:
   2128 +  ```bash
   2129 +  git commit --amend
   2130 +  git push --force
   2131 +  # (on a commit authored by someone else)
   2132 +  ```
   2133 +  
   2134 +  **Why**: Rewrites history, attributes your changes to someone else.
   2135 +  
   2136 +  **Right**: Create a new commit.
   2137 +  
   2138 +  #### 6. DON'T: Forget to Remove Alert Suppressions
   2139 +  
   2140 +  **Problem**: DC maintenance done, but suppression still active = missed alerts
   2141 +  
   2142 +  **Prevention**:
   2143 +  - Set calendar reminder
   2144 +  - Add comment in PR: "Remove suppression after YYYY-MM-DD"
   2145 +  - Create follow-up ticket
   2146 +  
   2147 +  ### Important Gotchas
   2148 +  
   2149 +  #### Gotcha 1: Staging vs Production Branches
   2150 +  
   2151 +  **Staging**: Uses `main` branch of ops/palantir
   2152 +  **Production**: Uses release tags (e.g., `v2.1.0`)
   2153 +  
   2154 +  **Implication**: Changes appear in staging immediately, production only after release.
   2155 +  
   2156 +  **Process**:
   2157 +  1. Merge to main → staging deployed
   2158 +  2. Test in staging
   2159 +  3. Create release tag
   2160 +  4. Production deployed
   2161 +  
   2162 +  #### Gotcha 2: VictoriaMetrics Upgrade Order
   2163 +  
   2164 +  **Wrong Order**: Global Select first, then LTS clusters
   2165 +  
   2166 +  **Right Order**: LTS clusters first, Global Select last
   2167 +  
   2168 +  **Why**: Global Select queries LTS clusters. If versions are incompatible, queries break.
   2169 +  
   2170 +  #### Gotcha 3: Prometheus Sharding is Time-Consuming
   2171 +  
   2172 +  **Time Required**: 2-8 hours
   2173 +  
   2174 +  **Why**:
   2175 +  - Salt configuration changes
   2176 +  - Restarting Prometheus (data replay)
   2177 +  - Validation
   2178 +  
   2179 +  **Plan Ahead**: Don't start Friday afternoon!
   2180 +  
   2181 +  #### Gotcha 4: Cilium Restarts Required for Network Policies
   2182 +  
   2183 +  **Problem**: New Cilium network policy not enforcing
   2184 +  
   2185 +  **Solution**: Restart Cilium pods
   2186 +  ```bash
   2187 +  kubectl -n kube-system delete pod -l k8s-app=cilium
   2188 +  ```
   2189 +  
   2190 +  **Why**: Some policy changes require pod restart to take effect.
   2191 +  
   2192 +  #### Gotcha 5: Management IPs Can Fail Silently
   2193 +  
   2194 +  **Problem**: Linode provisioned but eth1 (management IP) not configured
   2195 +  
   2196 +  **Check**:
   2197 +  ```bash
   2198 +  ssh server
   2199 +  ip addr show eth1
   2200 +  # Should show IP address
   2201 +  ```
   2202 +  
   2203 +  **Fix**: Re-run Ansible or manually configure.
   2204 +  
   2205 +  ### When to Ask for Help
   2206 +  
   2207 +  **Ask immediately if**:
   2208 +  - Critical alert and you don't know how to fix
   2209 +  - About to run a destructive command
   2210 +  - Unsure about Terraform plan
   2211 +  
   2212 +  **Ask within 30 minutes if**:
   2213 +  - Warning alert and runbook doesn't help
   2214 +  - Stuck on troubleshooting
   2215 +  
   2216 +  **Ask in next business day if**:
   2217 +  - Documentation unclear
   2218 +  - Process question
   2219 +  
   2220 +  **Where to Ask**:
   2221 +  - **#o11y-core** (team private channel) - Team questions
   2222 +  - **#sre-observability** (team public channel) - General questions
   2223 +  - **PagerDuty** (escalate alert) - Can't resolve critical alert
   2224 +  - **Tag @team-sre-o11y** - Need team response
   2225 +  
   2226 +  **It's better to ask than guess!**
   2227 +  
   2228 +  ---
   2229 +  
   2230 +  ## 12. Getting Started Checklist
   2231 +  
   2232 +  ### Week 1: Setup & Access
   2233 +  
   2234 +  - [ ] **Access to Systems**
   2235 +    - [ ] Get added to `team.auto.tfvars` by manager/team lead
   2236 +    - [ ] Verify PagerDuty account created
   2237 +    - [ ] Verify GitHub org access (linode-obs)
   2238 +    - [ ] Verify Bits team access (ops/sre-o11y)
   2239 +    - [ ] Grafana access (https://grafana.linode.com)
   2240 +    - [ ] ArgoCD access (prod, staging, dev)
   2241 +  
   2242 +  - [ ] **Local Development Setup**
   2243 +    - [ ] Install Homebrew (macOS)
   2244 +    - [ ] Install asdf or mise: `brew install asdf` or `brew install mise`
   2245 +    - [ ] Install direnv: `brew install direnv`
   2246 +    - [ ] Add to shell config: `eval "$(direnv hook zsh)"`
   2247 +    - [ ] Install linode-cli: `brew install linode-cli`
   2248 +    - [ ] Install kubectl: `brew install kubectl`
   2249 +    - [ ] Install terraform: `brew install terraform`
   2250 +    - [ ] Install pre-commit: `brew install pre-commit`
   2251 +    - [ ] Install Hugo: `brew install hugo`
   2252 +  
   2253 +  - [ ] **Repository Setup**
   2254 +    ```bash
   2255 +    mkdir ~/repos
   2256 +    cd ~/repos
   2257 +  
   2258 +    # Clone main repos
   2259 +    git clone bits.linode.com:ops/terraform-observability-team
   2260 +    git clone bits.linode.com:ops/o11y-helm-charts
   2261 +    git clone bits.linode.com:ops/palantir
   2262 +    git clone bits.linode.com:ops/prometheus_rules
   2263 +  
   2264 +    # Setup terraform-observability-team
   2265 +    cd terraform-observability-team
   2266 +    asdf install  # or mise install
   2267 +    direnv allow
   2268 +    pre-commit install --install-hooks
   2269 +  
   2270 +    # Test Terraform
   2271 +    vault login -method=oidc username=$USER
   2272 +    terraform plan
   2273 +    # Should succeed without errors
   2274 +    ```
   2275 +  
   2276 +  - [ ] **Slack Channels**
   2277 +    - [ ] Join #sre-observability (public)
   2278 +    - [ ] Get added to #o11y-core (private)
   2279 +    - [ ] Join #notification-o11y
   2280 +    - [ ] Join #notification-o11y-prs
   2281 +    - [ ] Join #notification-prometheus
   2282 +  
   2283 +  ### Week 2: Learning the Codebase
   2284 +  
   2285 +  - [ ] **Read Documentation**
   2286 +    - [ ] Repository README
   2287 +    - [ ] Handbook: On-call Guide
   2288 +    - [ ] Handbook: Tools
   2289 +    - [ ] Handbook: Git Conventions
   2290 +    - [ ] Browse Services documentation
   2291 +    - [ ] Read 2-3 recent proposals
   2292 +  
   2293 +  - [ ] **Explore Repositories**
   2294 +    - [ ] Browse terraform-observability-team structure
   2295 +    - [ ] Review team.auto.tfvars (understand team structure)
   2296 +    - [ ] Look at o11y-helm-charts (understand app configuration)
   2297 +    - [ ] Explore palantir (see Kubernetes manifests)
   2298 +  
   2299 +  - [ ] **Shadow Team Members**
   2300 +    - [ ] Shadow current on-call for a week
   2301 +    - [ ] Attend team meetings
   2302 +    - [ ] Watch someone do a PR review
   2303 +    - [ ] Watch someone deploy to staging
   2304 +  
   2305 +  ### Week 3: First Tasks
   2306 +  
   2307 +  - [ ] **Make First PR**
   2308 +    - [ ] Fix a typo in documentation
   2309 +    - [ ] Add yourself to a team meeting doc
   2310 +    - [ ] Practice PR → review → merge workflow
   2311 +  
   2312 +  - [ ] **Learn Key Services**
   2313 +    - [ ] Access Grafana, explore dashboards
   2314 +    - [ ] Access ArgoCD, browse applications
   2315 +    - [ ] Query Prometheus/VictoriaMetrics from Grafana
   2316 +    - [ ] Search logs in Loki
   2317 +  
   2318 +  - [ ] **Attend Post-Mortem** (if one occurs)
   2319 +    - [ ] Observe incident response
   2320 +    - [ ] Understand RCA process
   2321 +  
   2322 +  ### Week 4: Increasing Responsibility
   2323 +  
   2324 +  - [ ] **Handle First Alert**
   2325 +    - [ ] Acknowledge warning alert
   2326 +    - [ ] Follow runbook
   2327 +    - [ ] Document resolution
   2328 +  
   2329 +  - [ ] **Review First PR**
   2330 +    - [ ] Review PR in ops/prometheus_rules or similar
   2331 +    - [ ] Provide feedback
   2332 +    - [ ] Approve or request changes
   2333 +  
   2334 +  - [ ] **Complete First Intake Request**
   2335 +    - [ ] Grant Grafana access
   2336 +    - [ ] Or handle Nagios user request
   2337 +  
   2338 +  ### Month 2: On-call Training
   2339 +  
   2340 +  - [ ] **On-call Preparation**
   2341 +    - [ ] Read all alert runbooks
   2342 +    - [ ] Practice silencing alerts with amtool
   2343 +    - [ ] Review escalation procedures
   2344 +    - [ ] Shadow on-call for 2nd week
   2345 +  
   2346 +  - [ ] **Add to On-call Rotation**
   2347 +    - [ ] Team lead adds you to rotation
   2348 +    - [ ] Receive first on-call shift assignment
   2349 +  
   2350 +  - [ ] **First On-call Shift**
   2351 +    - [ ] Create on-call log
   2352 +    - [ ] Handle alerts (with backup support)
   2353 +    - [ ] Complete handoff
   2354 +  
   2355 +  ### Month 3: Full Team Member
   2356 +  
   2357 +  - [ ] **Lead First Project**
   2358 +    - [ ] Small improvement or automation
   2359 +    - [ ] Write proposal if needed
   2360 +    - [ ] Implement and deploy
   2361 +  
   2362 +  - [ ] **Contribute to Documentation**
   2363 +    - [ ] Update outdated docs
   2364 +    - [ ] Add new runbook
   2365 +    - [ ] Write MOP for procedure you learned
   2366 +  
   2367 +  - [ ] **Mentor Next New Hire**
   2368 +    - [ ] Share this guide
   2369 +    - [ ] Answer questions
   2370 +    - [ ] Pair on tasks
   2371 +  
   2372 +  ---
   2373 +  
   2374 +  ## 13. Glossary
   2375 +  
   2376 +  ### Terms & Acronyms
   2377 +  
   2378 +  **ArgoCD**: GitOps continuous deployment tool for Kubernetes
   2379 +  
   2380 +  **Atlantis**: Terraform automation tool that runs plans/applies on PRs
   2381 +  
   2382 +  **Bits**: Akamai's internal GitHub instance (bits.linode.com)
   2383 +  
   2384 +  **CCM**: Cloud Controller Manager - Kubernetes component managing cloud resources (NodeBalancers, 
        + firewalls)
   2385 +  
   2386 +  **Cilium**: Container Network Interface (CNI) providing network policies and encryption
   2387 +  
   2388 +  **DC**: Datacenter (e.g., ewr1 = Newark, iad3 = Ashburn)
   2389 +  
   2390 +  **direnv**: Tool to load environment variables when entering a directory
   2391 +  
   2392 +  **GitOps**: Deployment methodology using Git as source of truth
   2393 +  
   2394 +  **Grafana**: Visualization platform for metrics and logs
   2395 +  
   2396 +  **Hugo**: Static site generator used for team documentation
   2397 +  
   2398 +  **Kustomize**: Tool for customizing Kubernetes YAML files
   2399 +  
   2400 +  **Linode**: Akamai's cloud computing platform (VMs, Kubernetes, networking)
   2401 +  
   2402 +  **Loki**: Log aggregation system (like Prometheus but for logs)
   2403 +  
   2404 +  **LTS**: Long-Term Storage (VictoriaMetrics clusters storing metrics for 13 months)
   2405 +  
   2406 +  **Mise**: Task runner and tool version manager (like Make + asdf)
   2407 +  
   2408 +  **MOP**: Manual Operations Procedure - step-by-step guide for complex tasks
   2409 +  
   2410 +  **mTLS**: Mutual TLS - two-way certificate authentication
   2411 +  
   2412 +  **Nagios**: Legacy monitoring system (being replaced)
   2413 +  
   2414 +  **NIL**: Network Internet Listener - LoadBalancer service exposed externally
   2415 +  
   2416 +  **On-call**: Engineer responsible for responding to alerts during their shift
   2417 +  
   2418 +  **OpenTelemetry (OTel)**: Observability framework for metrics, logs, traces
   2419 +  
   2420 +  **otelgw**: OpenTelemetry Gateway - aggregates telemetry per datacenter
   2421 +  
   2422 +  **Palantir**: Repository containing Kubernetes manifests for ArgoCD
   2423 +  
   2424 +  **PagerDuty**: Incident management and on-call scheduling platform
   2425 +  
   2426 +  **Preflight**: Checks run before an operation (e.g., Cilium upgrade)
   2427 +  
   2428 +  **Prometheus**: Time-series database for metrics collection
   2429 +  
   2430 +  **PromQL**: Prometheus Query Language
   2431 +  
   2432 +  **Runbook**: Documentation for responding to specific alerts
   2433 +  
   2434 +  **Salt**: Configuration management tool (like Ansible/Puppet)
   2435 +  
   2436 +  **Shard**: One instance in a group of divided workload (e.g., Prometheus shards)
   2437 +  
   2438 +  **SRE**: Site Reliability Engineering
   2439 +  
   2440 +  **Terraform**: Infrastructure-as-Code tool
   2441 +  
   2442 +  **Thanos**: Prometheus long-term storage and query federation
   2443 +  
   2444 +  **Vault**: Secret management platform
   2445 +  
   2446 +  **VictoriaMetrics**: Time-series database optimized for Prometheus metrics
   2447 +  
   2448 +  **vminsert**: VictoriaMetrics component for data ingestion
   2449 +  
   2450 +  **vmselect**: VictoriaMetrics component for queries
   2451 +  
   2452 +  **vmstorage**: VictoriaMetrics component for data storage
   2453 +  
   2454 +  ---
   2455 +  
   2456 +  ## Additional Resources
   2457 +  
   2458 +  ### Official Documentation
   2459 +  
   2460 +  - **Team Docs**: https://bits.linode.com/pages/ops/terraform-observability-team/
   2461 +  - **Confluence**: Search "SRE Observability" for cross-team docs
   2462 +  - **Jira**: https://track.akamai.com/jira/projects/OY
   2463 +  
   2464 +  ### External Tools Documentation
   2465 +  
   2466 +  - **Prometheus**: https://prometheus.io/docs/
   2467 +  - **VictoriaMetrics**: https://docs.victoriametrics.com/
   2468 +  - **Grafana**: https://grafana.com/docs/
   2469 +  - **ArgoCD**: https://argo-cd.readthedocs.io/
   2470 +  - **Kubernetes**: https://kubernetes.io/docs/
   2471 +  - **Terraform**: https://www.terraform.io/docs/
   2472 +  - **Loki**: https://grafana.com/docs/loki/
   2473 +  
   2474 +  ### Internal Systems
   2475 +  
   2476 +  - **Grafana Production**: https://grafana.linode.com
   2477 +  - **ArgoCD Production**: https://argocd.infra-o11y-apps.iad3.us.prod.linode.com
   2478 +  - **ArgoCD Staging**: https://argocd.infra-o11y-apps.rin1.us.staging.linode.com
   2479 +  - **PagerDuty Schedule**: https://akamai.pagerduty.com/schedules/PSFD91L
   2480 +  - **Vault**: https://vault.infra.linode.com
   2481 +  
   2482 +  ### Slack Channels
   2483 +  
   2484 +  - **#sre-observability**: Team public channel
   2485 +  - **#o11y-core**: Team private channel
   2486 +  - **#notification-o11y**: Alert notifications
   2487 +  - **#notification-o11y-prs**: PR notifications
   2488 +  - **#notification-prometheus**: Prometheus-specific alerts
   2489 +  - **#sre-salt**: Salt team (for minion keys, etc.)
   2490 +  - **#sre-infrastructure**: Infrastructure team
   2491 +  
   2492 +  ### People to Know
   2493 +  
   2494 +  Check `team.auto.tfvars` for current team roster.
   2495 +  
   2496 +  **Team Lead/Manager**: Ask in #o11y-core
   2497 +  
   2498 +  **On-call**: `@firechief-sre-o11y` in Slack
   2499 +  
   2500 +  ---
   2501 +  
   2502 +  ## Final Thoughts
   2503 +  
   2504 +  Welcome to the SRE Observability team! This guide covers a lot, and **you're not expected to 
        + memorize everything immediately**. Use it as a reference as you work through your first few weeks.
   2505 +  
   2506 +  **Key Takeaways**:
   2507 +  
   2508 +  1. **terraform-observability-team** manages team permissions and documentation
   2509 +  2. **VictoriaMetrics, Prometheus, Grafana, ArgoCD** are our core services
   2510 +  3. **GitOps workflow**: Code → PR → Review → Merge → Deploy
   2511 +  4. **On-call is a rotation**, and you'll be trained before your first shift
   2512 +  5. **Ask questions** - the team is here to help!
   2513 +  
   2514 +  **Learning Path**:
   2515 +  - Week 1: Setup and access
   2516 +  - Week 2: Read and explore
   2517 +  - Week 3: First small tasks
   2518 +  - Week 4: First alerts and reviews
   2519 +  - Month 2: On-call training
   2520 +  - Month 3: Full contributor
   2521 +  
   2522 +  **Remember**: Everyone on the team started where you are now. It takes time to learn all the 
        + systems, and that's okay. Focus on understanding the fundamentals first, and the details will come 
        + with experience.
   2523 +  
   2524 +  **Questions?** Ask in #o11y-core!
   2525 +  
   2526 +  **Good luck, and welcome to the team!**
   2527 +  
   2528 +  ---
   2529 +  
   2530 +  **Document Version**: 1.0
   2531 +  **Created**: 2025-11-20
   2532 +  **Last Updated**: 2025-11-20
   2533 +  **Maintained By**: SRE Observability Team

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors