GitHub - sgour-akamai/my-work

      1 +  # Complete Onboarding Guide: terraform-observability-team Repository
      2 +  
      3 +  ## Welcome to the SRE Observability Team!
      4 +  
      5 +  This guide will teach you everything you need to know about the `terraform-observability-team` 
        + repository as a new team member. No prior knowledge assumed - we'll start from the basics and go 
        + deep.
      6 +  
      7 +  **Team Motto**: *Mors melius quam Nagios* ("Death is better than Nagios")
      8 +  
      9 +  ---
     10 +  
     11 +  ## Table of Contents
     12 +  
     13 +  1. [What is This Repository?](#1-what-is-this-repository)
     14 +  2. [What Does the Observability Team Do?](#2-what-does-the-observability-team-do)
     15 +  3. [The Big Picture: Observability Architecture](#3-the-big-picture-observability-architecture)
     16 +  4. [Repository Structure Explained](#4-repository-structure-explained)
     17 +  5. [Key Services Deep Dive](#5-key-services-deep-dive)
     18 +  6. [How Deployments Work](#6-how-deployments-work)
     19 +  7. [Important Files You Need to Know](#7-important-files-you-need-to-know)
     20 +  8. [Documentation Structure](#8-documentation-structure)
     21 +  9. [Day-to-Day Operations](#9-day-to-day-operations)
     22 +  10. [Common Tasks with Examples](#10-common-tasks-with-examples)
     23 +  11. [Things to Be Careful About](#11-things-to-be-careful-about)
     24 +  12. [Getting Started Checklist](#12-getting-started-checklist)
     25 +  13. [Glossary](#13-glossary)
     26 +  
     27 +  ---
     28 +  
     29 +  ## 1. What is This Repository?
     30 +  
     31 +  The `terraform-observability-team` repository serves **two main purposes**:
     32 +  
     33 +  ### Purpose 1: Team Permissions Management
     34 +  
     35 +  This repository manages who has access to what across multiple platforms using 
        + **Infrastructure-as-Code (Terraform)**:
     36 +  
     37 +  - **GitHub** (public `linode-obs` organization)
     38 +  - **Bits** (internal Akamai/Linode GitHub - `sre-o11y` teams across ops, Linode, devcloud orgs)
     39 +  - **PagerDuty** (team memberships, on-call schedules, escalation policies)
     40 +  - **Vault** (secret storage for team applications)
     41 +  
     42 +  **Why Terraform?** Because managing permissions manually across 4+ platforms for 10+ people is 
        + error-prone and time-consuming. Terraform allows us to:
     43 +  - Define team membership once
     44 +  - Apply changes consistently everywhere
     45 +  - Track permission changes in git history
     46 +  - Easily onboard/offboard team members
     47 +  
     48 +  ### Purpose 2: Team Documentation Hub
     49 +  
     50 +  This repository contains all team documentation using **Hugo** (a static site generator):
     51 +  
     52 +  - **Handbooks**: How we work (on-call, tools, git conventions)
     53 +  - **Service Documentation**: VictoriaMetrics, Prometheus, Grafana, ArgoCD, etc.
     54 +  - **MOPs** (Manual Operations Procedures): Step-by-step guides for complex tasks
     55 +  - **Proposals**: Design documents for major changes (OP-## format)
     56 +  - **On-call Logs**: Weekly logs of incidents and work completed
     57 +  - **Runbooks**: How to respond to specific alerts
     58 +  
     59 +  The documentation is published to **Bits Pages** (internal documentation hosting) at:
     60 +  `https://bits.linode.com/pages/ops/terraform-observability-team/`
     61 +  
     62 +  ---
     63 +  
     64 +  ## 2. What Does the Observability Team Do?
     65 +  
     66 +  The **SRE Observability team** owns and operates the entire observability infrastructure for 
        + Linode/Akamai. Here's what that means:
     67 +  
     68 +  ### Metrics Collection & Storage
     69 +  
     70 +  **What**: Collecting and storing performance metrics from all infrastructure.
     71 +  
     72 +  **Services We Manage**:
     73 +  - **100+ Prometheus instances** globally (one or more per datacenter)
     74 +  - **VictoriaMetrics clusters** for long-term storage (LTS) of metrics
     75 +  - **Thanos** for query federation across Prometheus instances
     76 +  
     77 +  **Example Metrics**:
     78 +  - CPU/memory/disk usage on servers
     79 +  - Network traffic and errors
     80 +  - Application response times and error rates
     81 +  - Database query performance
     82 +  - Kubernetes pod health
     83 +  
     84 +  ### Logging Infrastructure
     85 +  
     86 +  **What**: Collecting, storing, and querying logs from all systems.
     87 +  
     88 +  **Services We Manage**:
     89 +  - **Loki** clusters for centralized log storage
     90 +  - **OpenTelemetry Collectors** (otelgw) for log aggregation
     91 +  - Integration with Grafana for log querying
     92 +  
     93 +  ### Visualization
     94 +  
     95 +  **What**: Providing dashboards and graphs for engineers to understand system behavior.
     96 +  
     97 +  **Services We Manage**:
     98 +  - **Grafana** instances
     99 +  - **Dashboard management** (via Terraform)
    100 +  - **User permissions** for Grafana
    101 +  
    102 +  ### Alerting
    103 +  
    104 +  **What**: Notifying engineers when things go wrong.
    105 +  
    106 +  **Services We Manage**:
    107 +  - **Alertmanager** (part of Prometheus)
    108 +  - **PagerDuty** integrations
    109 +  - **Alert rules** (prometheus_rules repository)
    110 +  - **Alert routing** (who gets paged for what)
    111 +  
    112 +  ### Deployment & Orchestration
    113 +  
    114 +  **What**: Managing how observability services are deployed and updated.
    115 +  
    116 +  **Services We Manage**:
    117 +  - **ArgoCD** for GitOps deployment to Kubernetes
    118 +  - **Kubernetes clusters** running observability workloads
    119 +  - **Helm charts** for application configuration
    120 +  
    121 +  ### Legacy Systems (Being Phased Out)
    122 +  
    123 +  - **Nagios** - Old monitoring system (being replaced by Prometheus)
    124 +  
    125 +  ### Network Observability
    126 +  
    127 +  **What**: Special monitoring for network devices and traffic.
    128 +  
    129 +  **Services We Manage**:
    130 +  - **Linmon** - Network monitoring platform
    131 +  - Specialized VictoriaMetrics clusters for network metrics
    132 +  
    133 +  ---
    134 +  
    135 +  ## 3. The Big Picture: Observability Architecture
    136 +  
    137 +  Let me explain how all these pieces fit together with a real-world example:
    138 +  
    139 +  ### Example: Monitoring a Linode Customer's VM
    140 +  
    141 +  ```
    142 +  1. DATA COLLECTION
    143 +     ┌────────────────────────────────────────────────────────────┐
    144 +     │  Customer's Linode VM (in Newark datacenter - ewr1)        │
    145 +     │  ├─ node_exporter (exposes system metrics)                 │
    146 +     │  └─ Metrics: CPU, memory, disk, network                    │
    147 +     └────────────────────────┬───────────────────────────────────┘
    148 +                              │ (HTTP scrape every 30s)
    149 +                              ▼
    150 +     ┌────────────────────────────────────────────────────────────┐
    151 +     │  Prometheus Shard (prometheus-1a in ewr1)                  │
    152 +     │  ├─ Scrapes 1000s of targets in this datacenter           │
    153 +     │  ├─ Stores metrics locally (15 days retention)            │
    154 +     │  └─ Evaluates alert rules                                 │
    155 +     └────────────────────────┬───────────────────────────────────┘
    156 +                              │ (remote_write)
    157 +                              ▼
    158 +  2. LONG-TERM STORAGE
    159 +     ┌────────────────────────────────────────────────────────────┐
    160 +     │  VictoriaMetrics LTS (North America - ord2/lax3)           │
    161 +     │  ├─ Receives metrics from all NA datacenters              │
    162 +     │  ├─ Compresses and stores metrics (13 months retention)   │
    163 +     │  └─ Fast queries for historical data                      │
    164 +     └────────────────────────┬───────────────────────────────────┘
    165 +                              │
    166 +                              ▼
    167 +  3. QUERYING & FEDERATION
    168 +     ┌────────────────────────────────────────────────────────────┐
    169 +     │  VictoriaMetrics Global Select                             │
    170 +     │  ├─ Federates queries across all regions (NA, EU, AP)     │
    171 +     │  └─ Single query point for worldwide data                 │
    172 +     └────────────────────────┬───────────────────────────────────┘
    173 +                              │ (queries)
    174 +                              ▼
    175 +  4. VISUALIZATION
    176 +     ┌────────────────────────────────────────────────────────────┐
    177 +     │  Grafana                                                    │
    178 +     │  ├─ Dashboards showing VM performance                      │
    179 +     │  └─ Engineers use this to troubleshoot issues             │
    180 +     └────────────────────────────────────────────────────────────┘
    181 +  
    182 +  5. ALERTING (if something goes wrong)
    183 +     ┌────────────────────────────────────────────────────────────┐
    184 +     │  Prometheus Alert Rules                                    │
    185 +     │  └─ "High CPU usage on VM for 5 minutes"                  │
    186 +     └────────────────────────┬───────────────────────────────────┘
    187 +                              │ (fires alert)
    188 +                              ▼
    189 +     ┌────────────────────────────────────────────────────────────┐
    190 +     │  Alertmanager                                               │
    191 +     │  └─ Routes alert based on severity and team               │
    192 +     └────────────────────────┬───────────────────────────────────┘
    193 +                              │
    194 +                              ▼
    195 +     ┌────────────────────────────────────────────────────────────┐
    196 +     │  PagerDuty                                                  │
    197 +     │  └─ Pages on-call engineer                                │
    198 +     └────────────────────────────────────────────────────────────┘
    199 +  ```
    200 +  
    201 +  ### Regional Architecture
    202 +  
    203 +  We split the world into **three regions** for scalability:
    204 +  
    205 +  ```
    206 +  ┌─────────────────────────────────────────────────────────────────┐
    207 +  │                   GLOBAL ARCHITECTURE                            │
    208 +  │                                                                  │
    209 +  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐          │
    210 +  │  │ North America│  │    Europe    │  │ Asia Pacific │          │
    211 +  │  ├──────────────┤  ├──────────────┤  ├──────────────┤          │
    212 +  │  │ VictoriaMetrics  │ VictoriaMetrics  │ VictoriaMetrics          │
    213 +  │  │ LTS Clusters │  │ LTS Clusters │  │ LTS Clusters │          │
    214 +  │  │              │  │              │  │              │          │
    215 +  │  │ ord2 (primary)│  │ mad2 (primary)│  │ osa1 (primary)│          │
    216 +  │  │ lax3 (backup)│  │ sto2 (backup)│  │ cgk1 (backup)│          │
    217 +  │  │              │  │              │  │ sea1 (backup)│          │
    218 +  │  └──────────────┘  └──────────────┘  └──────────────┘          │
    219 +  │         │                 │                  │                  │
    220 +  │         └─────────────────┴──────────────────┘                  │
    221 +  │                           │                                     │
    222 +  │                           ▼                                     │
    223 +  │              ┌────────────────────────┐                         │
    224 +  │              │ VictoriaMetrics        │                         │
    225 +  │              │ Global Select          │                         │
    226 +  │              │ (Query Federation)     │                         │
    227 +  │              └────────────────────────┘                         │
    228 +  └─────────────────────────────────────────────────────────────────┘
    229 +  ```
    230 +  
    231 +  **Why Regional?**
    232 +  - **Reduced latency**: Data stays close to where it's generated
    233 +  - **Compliance**: Some regions require data to stay in-country
    234 +  - **Scalability**: Spreading load across multiple clusters
    235 +  - **Resilience**: Regional failure doesn't impact other regions
    236 +  
    237 +  ---
    238 +  
    239 +  ## 4. Repository Structure Explained
    240 +  
    241 +  Let's walk through the repository structure directory by directory:
    242 +  
    243 +  ```
    244 +  /Users/sgour/terraform-observability-team/
    245 +  ├── .github/                    # GitHub-specific files
    246 +  │   ├── workflows/             # GitHub Actions CI/CD pipelines
    247 +  │   └── CODEOWNERS             # Who reviews PRs for which files
    248 +  ├── .vale/                      # Vale (prose linting) configuration
    249 +  │   └── styles/                # Custom style rules for docs
    250 +  ├── docs/                       # Hugo documentation site (explained later)
    251 +  ├── hack/                       # Utility scripts
    252 +  ├── img/                        # Images for README
    253 +  ├── modules/                    # Terraform modules (explained below)
    254 +  │   ├── pagerduty/             # PagerDuty configuration
    255 +  │   ├── github/                # GitHub (public) configuration
    256 +  │   └── bits/                  # Bits (internal GitHub) configuration
    257 +  ├── scripts/                    # Helper scripts
    258 +  ├── main.tf                     # Main Terraform configuration
    259 +  ├── provider.tf                 # Terraform provider setup
    260 +  ├── vars.tf                     # Variable definitions
    261 +  ├── team.auto.tfvars           # IMPORTANT: Team member definitions
    262 +  ├── pagerduty_suppression.auto.tfvars  # Alert suppression config
    263 +  ├── atlantis.yaml              # Atlantis (Terraform automation) config
    264 +  ├── .envrc                     # Environment variables (Vault auth)
    265 +  ├── .pre-commit-config.yaml    # Pre-commit hooks for code quality
    266 +  ├── .mise.toml                 # Mise task runner configuration
    267 +  └── README.md                  # Repository overview
    268 +  ```
    269 +  
    270 +  ### The `/modules/` Directory (Terraform Modules)
    271 +  
    272 +  Think of Terraform modules as reusable "functions" for infrastructure. Each module manages a 
        + specific platform:
    273 +  
    274 +  #### `/modules/pagerduty/`
    275 +  
    276 +  Manages all PagerDuty configuration:
    277 +  
    278 +  **Files**:
    279 +  - `team.tf` - Team membership (who's on the team)
    280 +  - `escalation_policy.tf` - Who gets alerted when (primary → secondary → round-robin)
    281 +  - `services.tf` - PagerDuty services (e.g., "Prometheus Alerts")
    282 +  - `schedules.tf` - On-call schedules (primary, secondary)
    283 +  - `orchestration.tf` - Event routing and processing
    284 +  - `slack_connections.tf` - Slack channel integrations
    285 +  - `rules.tf` - Alert suppression rules (for maintenance)
    286 +  
    287 +  **What It Does**:
    288 +  - Creates PagerDuty team
    289 +  - Sets up on-call rotation schedule
    290 +  - Creates escalation policies
    291 +  - Configures alert routing
    292 +  - Manages alert suppression during maintenance
    293 +  
    294 +  **Example**: When you're added to the team, Terraform creates your PagerDuty user and adds you to 
        + schedules.
    295 +  
    296 +  #### `/modules/github/`
    297 +  
    298 +  Manages the **public** `linode-obs` GitHub organization:
    299 +  
    300 +  **What It Does**:
    301 +  - Creates repositories
    302 +  - Manages team memberships
    303 +  - Sets repository permissions
    304 +  - Configures branch protection
    305 +  
    306 +  **Example Repos Managed**:
    307 +  - `linode-obs/victoriametrics-operator`
    308 +  - `linode-obs/prometheus-salt-formula`
    309 +  
    310 +  #### `/modules/bits/`
    311 +  
    312 +  Manages the **internal** Bits (Akamai GitHub) teams across multiple organizations:
    313 +  
    314 +  **Organizations Managed**:
    315 +  - `ops` - Main SRE organization
    316 +  - `Linode` - Linode-specific repos
    317 +  - `devcloud` - Devcloud infrastructure
    318 +  - `ansible-collections` - Ansible collections
    319 +  
    320 +  **What It Does**:
    321 +  - Creates `sre-o11y` team in each org
    322 +  - Manages team memberships (member vs maintainer roles)
    323 +  - Sets repository permissions
    324 +  - Configures webhooks (Atlantis, Slack notifications)
    325 +  - Sets up branch protection
    326 +  
    327 +  **Important Repos Managed**:
    328 +  - `ops/prometheus_rules` - Alert and recording rules
    329 +  - `ops/prometheus-formula` - Prometheus Salt configuration
    330 +  - `ops/o11y-helm-charts` - Helm charts for Kubernetes apps
    331 +  - `ops/palantir` - Kubernetes manifests
    332 +  - `ops/loki_rules` - Loki alert rules
    333 +  - `ops/terraform-grafana-config` - Grafana configuration
    334 +  
    335 +  ### The `/docs/` Directory (Documentation Site)
    336 +  
    337 +  This is a Hugo-based documentation site. Let's break it down:
    338 +  
    339 +  ```
    340 +  docs/
    341 +  ├── archetypes/                # Templates for new content
    342 +  │   ├── default.md
    343 +  │   ├── on-call.md            # Template for on-call logs
    344 +  │   └── proposals.md          # Template for proposals
    345 +  ├── content/                   # ACTUAL DOCUMENTATION (main content)
    346 +  │   ├── _index.md             # Homepage
    347 +  │   ├── Handbooks/            # How the team operates
    348 +  │   │   ├── on-call.md        # On-call guide
    349 +  │   │   ├── tools.md          # Tooling standards
    350 +  │   │   ├── git-conventions.md
    351 +  │   │   └── docs/             # How to write docs
    352 +  │   ├── Services/             # Service documentation
    353 +  │   │   ├── ArgoCD/
    354 +  │   │   ├── VictoriaMetrics/
    355 +  │   │   ├── Prometheus/
    356 +  │   │   ├── Grafana/
    357 +  │   │   ├── Kubernetes Clusters/
    358 +  │   │   ├── Centralized Logging/
    359 +  │   │   ├── Network Observability/
    360 +  │   │   ├── Nagios/
    361 +  │   │   └── ...
    362 +  │   ├── mops/                 # Manual Operations Procedures
    363 +  │   │   ├── prometheus-shard.md
    364 +  │   │   ├── victoriametrics-cluster-upgrade.md
    365 +  │   │   └── ...
    366 +  │   ├── on-call/              # On-call logs by year
    367 +  │   │   ├── 2024/
    368 +  │   │   └── 2025/
    369 +  │   ├── projects/             # Active projects
    370 +  │   ├── proposals/            # Design proposals (OP-##)
    371 +  │   │   ├── OP-01-prometheus-sharding.md
    372 +  │   │   ├── OP-03-o11y-helm-charts.md
    373 +  │   │   └── ...
    374 +  │   └── Runbooks/             # Alert response guides
    375 +  │       ├── HighMemoryUsage.md
    376 +  │       └── ...
    377 +  ├── layouts/                  # Custom Hugo templates
    378 +  ├── static/                   # Static files (images, CSS)
    379 +  ├── config.toml               # Hugo configuration
    380 +  └── go.mod                    # Hugo module dependencies
    381 +  ```
    382 +  
    383 +  **How It Works**:
    384 +  1. You write documentation in Markdown in `/docs/content/`
    385 +  2. Hugo builds it into HTML
    386 +  3. GitHub Actions deploys it to Bits Pages
    387 +  4. Team members access it at `https://bits.linode.com/pages/ops/terraform-observability-team/`
    388 +  
    389 +  ---
    390 +  
    391 +  ## 5. Key Services Deep Dive
    392 +  
    393 +  Now let's understand the major services the team manages:
    394 +  
    395 +  ### VictoriaMetrics (Long-Term Metrics Storage)
    396 +  
    397 +  **What is it?**
    398 +  VictoriaMetrics is a time-series database optimized for storing Prometheus metrics long-term. Think
        +  of it as "Prometheus but faster and with more storage capacity."
    399 +  
    400 +  **Why do we use it?**
    401 +  - **Compression**: Stores data 10x more efficiently than Prometheus
    402 +  - **Fast queries**: Queries historical data much faster
    403 +  - **Long retention**: We keep metrics for 13 months vs Prometheus's 15 days
    404 +  - **Compatible**: Works with Prometheus query language (PromQL)
    405 +  
    406 +  **Architecture**:
    407 +  
    408 +  VictoriaMetrics runs as a **cluster** with three components:
    409 +  
    410 +  ```
    411 +  ┌────────────────────────────────────────────────────────────────┐
    412 +  │              VictoriaMetrics Cluster Architecture               │
    413 +  │                                                                 │
    414 +  │  ┌─────────────┐                                               │
    415 +  │  │  vminsert   │ ← Receives metrics from Prometheus            │
    416 +  │  │  (3 replicas)│    (via remote_write)                        │
    417 +  │  └──────┬──────┘                                               │
    418 +  │         │ Distributes data across storage nodes                │
    419 +  │         ▼                                                       │
    420 +  │  ┌─────────────┐                                               │
    421 +  │  │  vmstorage  │ ← Stores compressed metrics on disk           │
    422 +  │  │  (3+ replicas)│   (each replica has full dataset)           │
    423 +  │  └──────┬──────┘                                               │
    424 +  │         │ Serves data to query nodes                           │
    425 +  │         ▼                                                       │
    426 +  │  ┌─────────────┐                                               │
    427 +  │  │  vmselect   │ ← Handles queries from Grafana                │
    428 +  │  │  (2 replicas)│   (PromQL compatible)                        │
    429 +  │  └─────────────┘                                               │
    430 +  └────────────────────────────────────────────────────────────────┘
    431 +  ```
    432 +  
    433 +  **Components Explained**:
    434 +  
    435 +  1. **vminsert** - Ingestion
    436 +     - Receives metrics from Prometheus instances
    437 +     - Validates and processes data
    438 +     - Distributes data across vmstorage nodes
    439 +     - **Replicas**: 3 (for redundancy)
    440 +  
    441 +  2. **vmstorage** - Storage
    442 +     - Stores compressed metrics on disk
    443 +     - Each replica has the full dataset
    444 +     - Handles compaction and retention
    445 +     - **Replicas**: 3-6 depending on cluster size
    446 +  
    447 +  3. **vmselect** - Queries
    448 +     - Receives queries from Grafana
    449 +     - Fetches data from vmstorage nodes
    450 +     - Aggregates results
    451 +     - **Replicas**: 2 (for load balancing)
    452 +  
    453 +  **Cluster Types**:
    454 +  
    455 +  We have two types of VictoriaMetrics clusters:
    456 +  
    457 +  1. **LTS (Long-Term Storage) Clusters** - Regional
    458 +     - One per continent (NA, EU, AP)
    459 +     - Stores metrics from Prometheus instances in that region
    460 +     - **Retention**: 13 months
    461 +     - **Examples**:
    462 +       - `victoriametrics-ord2-us-prod` (North America primary)
    463 +       - `victoriametrics-lax3-us-prod` (North America backup)
    464 +       - `victoriametrics-mad2-es-prod` (Europe primary)
    465 +  
    466 +  2. **Global Select Cluster** - Worldwide
    467 +     - Federates queries across all LTS clusters
    468 +     - Allows querying global data from one place
    469 +     - **Does NOT store data** - just proxies queries
    470 +     - **Example**: `victoriametrics-global-select-iad3-us-prod`
    471 +  
    472 +  **Data Flow Example**:
    473 +  
    474 +  ```
    475 +  Prometheus in Newark (ewr1)
    476 +      │ remote_write
    477 +      ▼
    478 +  VictoriaMetrics LTS (ord2) - North America
    479 +      │
    480 +      │ Query from Grafana
    481 +      ▼
    482 +  VictoriaMetrics Global Select
    483 +      │ Queries all regions
    484 +      ├─── ord2 (North America)
    485 +      ├─── mad2 (Europe)
    486 +      └─── osa1 (Asia Pacific)
    487 +      │
    488 +      ▼ Combined results
    489 +  Grafana Dashboard
    490 +  ```
    491 +  
    492 +  **How It's Deployed**:
    493 +  - Runs in **Kubernetes**
    494 +  - Managed by **ArgoCD** (GitOps)
    495 +  - Configuration in `ops/o11y-helm-charts`
    496 +  - Manifests in `ops/palantir`
    497 +  
    498 +  **Important Files**:
    499 +  - Configuration: `o11y-helm-charts/values-files/victoriametrics-{dc}-{country}-{env}.yaml`
    500 +  - Documentation: `terraform-observability-team/docs/content/Services/VictoriaMetrics/`
    501 +  
    502 +  ### Prometheus (Real-Time Metrics Collection)
    503 +  
    504 +  **What is it?**
    505 +  Prometheus is a time-series database that **scrapes** (pulls) metrics from servers and 
        + applications.
    506 +  
    507 +  **Why do we use it?**
    508 +  - **Industry standard** for metrics collection
    509 +  - **Pull-based model** - Prometheus scrapes targets, they don't push to it
    510 +  - **Service discovery** - Automatically finds what to monitor
    511 +  - **Alert evaluation** - Runs alert rules in real-time
    512 +  
    513 +  **Deployment Model - Sharding**:
    514 +  
    515 +  We run **multiple Prometheus instances per datacenter** to handle scale. This is called 
        + **sharding**.
    516 +  
    517 +  ```
    518 +  Datacenter: Newark (ewr1)
    519 +     │
    520 +     ├─ prometheus-1a  (shard 0) ─ Monitors targets with hash % 3 == 0
    521 +     ├─ prometheus-1b  (shard 1) ─ Monitors targets with hash % 3 == 1
    522 +     └─ prometheus-1c  (shard 2) ─ Monitors targets with hash % 3 == 2
    523 +  ```
    524 +  
    525 +  **How Sharding Works**:
    526 +  1. Each Prometheus instance is assigned a **shard number**
    527 +  2. Targets (servers, apps) are distributed across shards using a hash function
    528 +  3. Each shard monitors ~1/3 of the targets
    529 +  
    530 +  **Shard Naming Convention**:
    531 +  - `prometheus-1a` → Shard 0
    532 +  - `prometheus-1b` → Shard 1
    533 +  - `prometheus-1c` → Shard 2
    534 +  - Pattern: `a=0, b=1, c=2, d=3, ...`
    535 +  
    536 +  **High Availability**:
    537 +  - Each shard runs **2 replicas**
    538 +  - If one replica fails, the other keeps working
    539 +  - Both replicas scrape the same targets (duplicate data, but ensures availability)
    540 +  
    541 +  **Configuration**:
    542 +  - Managed by **Salt formula**: `ops/prometheus-formula`
    543 +  - Alert rules in: `ops/prometheus_rules`
    544 +  - Recording rules in: `ops/prometheus_rules`
    545 +  
    546 +  **Where Data Goes**:
    547 +  - **Local storage**: 15 days retention
    548 +  - **Remote write**: Sent to VictoriaMetrics for long-term storage
    549 +  
    550 +  **Important Concepts**:
    551 +  
    552 +  1. **Scraping**: Prometheus pulls metrics from targets every 15-30 seconds
    553 +     ```
    554 +     Prometheus → HTTP GET http://server:9100/metrics → node_exporter
    555 +     ```
    556 +  
    557 +  2. **Service Discovery**: Prometheus automatically finds what to monitor
    558 +     - **Salt grains** - Server metadata tells Prometheus what the server does
    559 +     - **Kubernetes SD** - Auto-discovers pods in Kubernetes
    560 +  
    561 +  3. **Relabeling**: Modifying metric labels before storage
    562 +     - Add datacenter label
    563 +     - Add team ownership label
    564 +     - Filter out unwanted metrics
    565 +  
    566 +  **Important Files**:
    567 +  - Documentation: `terraform-observability-team/docs/content/Services/Prometheus/`
    568 +  - Sharding guide: `terraform-observability-team/docs/content/Services/Prometheus/sharding.md`
    569 +  
    570 +  ### Grafana (Visualization)
    571 +  
    572 +  **What is it?**
    573 +  Grafana is a web application for creating dashboards and visualizations from metrics and logs.
    574 +  
    575 +  **What do we manage?**
    576 +  - **Grafana instances** (production, staging, dev)
    577 +  - **Datasources** (connections to Prometheus, VictoriaMetrics, Loki)
    578 +  - **Dashboards** (via Terraform)
    579 +  - **User permissions** (teams, folders, access control)
    580 +  
    581 +  **How Users Access It**:
    582 +  - **Production**: https://grafana.linode.com
    583 +  - **Staging**: https://grafana-staging.linode.com
    584 +  
    585 +  **Datasource Types**:
    586 +  
    587 +  1. **Prometheus** - Real-time data (last 15 days)
    588 +     - Example: `prometheus-ewr1-1a-us-prod`
    589 +  
    590 +  2. **VictoriaMetrics** - Historical data (13 months)
    591 +     - Example: `victoriametrics-ord2-us-prod`
    592 +  
    593 +  3. **Loki** - Logs
    594 +     - Example: `loki-na-prod`
    595 +  
    596 +  4. **Thanos Query** - Federated Prometheus queries
    597 +     - Queries across multiple Prometheus instances
    598 +  
    599 +  **Dashboard Management**:
    600 +  - Dashboards stored as JSON
    601 +  - Managed in: `ops/terraform-grafana-config`
    602 +  - Changes deployed via Terraform
    603 +  
    604 +  **Permissions**:
    605 +  - Users organized into **teams**
    606 +  - Teams granted access to **folders**
    607 +  - Folders contain dashboards
    608 +  - We manage permissions via Terraform
    609 +  
    610 +  **Important Files**:
    611 +  - Configuration: `ops/terraform-grafana-config`
    612 +  - Documentation: `terraform-observability-team/docs/content/Services/Grafana/`
    613 +  
    614 +  ### ArgoCD (GitOps Deployment)
    615 +  
    616 +  **What is it?**
    617 +  ArgoCD is a GitOps tool that deploys applications to Kubernetes by syncing with Git repositories.
    618 +  
    619 +  **GitOps Concept**:
    620 +  - **Desired state** is defined in Git
    621 +  - ArgoCD **continuously monitors** Git
    622 +  - When Git changes, ArgoCD **automatically syncs** to Kubernetes
    623 +  - Git is the **single source of truth**
    624 +  
    625 +  **How It Works**:
    626 +  
    627 +  ```
    628 +  1. Engineer makes change
    629 +     ↓
    630 +  2. Git repository updated (ops/palantir)
    631 +     ↓
    632 +  3. ArgoCD detects change
    633 +     ↓
    634 +  4. ArgoCD compares Git vs Kubernetes
    635 +     ↓
    636 +  5. ArgoCD applies changes to Kubernetes
    637 +     ↓
    638 +  6. Application updated
    639 +  ```
    640 +  
    641 +  **ArgoCD Instances**:
    642 +  
    643 +  We run one ArgoCD per environment:
    644 +  
    645 +  - **Production**: https://argocd.infra-o11y-apps.iad3.us.prod.linode.com
    646 +    - Manages ~20 production Kubernetes clusters
    647 +    - Uses Git **release tags** (stable)
    648 +  
    649 +  - **Staging**: https://argocd.infra-o11y-apps.rin1.us.staging.linode.com
    650 +    - Manages ~10 staging clusters
    651 +    - Uses Git **main branch** (latest)
    652 +  
    653 +  - **Dev**: https://argocd.infra-o11y-apps.rin1.us.dev.linode.com
    654 +    - Manages ~5 dev clusters
    655 +    - Uses Git **main branch**
    656 +  
    657 +  **Key Concepts**:
    658 +  
    659 +  1. **Application** - A deployed workload
    660 +     - Example: VictoriaMetrics cluster in ord2
    661 +     - Defined by: Name, Git repo, path, target cluster
    662 +  
    663 +  2. **ApplicationSet** - Template for creating multiple similar Applications
    664 +     - Example: Deploy VictoriaMetrics to all LTS clusters
    665 +     - Uses generators to create Applications dynamically
    666 +  
    667 +  3. **Sync Policy**
    668 +     - **Manual**: Changes require clicking "Sync" button
    669 +     - **Automatic**: Changes deploy automatically
    670 +  
    671 +  4. **Health**
    672 +     - **Healthy**: All resources running correctly
    673 +     - **Degraded**: Some resources have issues
    674 +     - **Progressing**: Deployment in progress
    675 +  
    676 +  **Repositories ArgoCD Watches**:
    677 +  
    678 +  1. **ops/palantir** - Primary repo for Kubernetes manifests
    679 +     - Kustomize-based overlays
    680 +     - Bootstrap configs
    681 +     - Core components
    682 +  
    683 +  2. **ops/o11y-helm-charts** - Generates Application manifests
    684 +     - Helm chart with values files per cluster
    685 +     - Generates ArgoCD Application YAML
    686 +  
    687 +  **Important Files**:
    688 +  - Documentation: `terraform-observability-team/docs/content/Services/ArgoCD/`
    689 +  
    690 +  ### Kubernetes Clusters
    691 +  
    692 +  **What Clusters Do We Manage?**
    693 +  
    694 +  The team manages **30+ Kubernetes clusters** across three types:
    695 +  
    696 +  #### 1. infra-o11y-apps Clusters
    697 +  
    698 +  **Purpose**: Run centralized observability services
    699 +  
    700 +  **What Runs Here**:
    701 +  - ArgoCD
    702 +  - Grafana
    703 +  - Some Prometheus instances
    704 +  - Support services (cert-manager, external-dns)
    705 +  
    706 +  **Naming**: `infra-o11y-apps-{dc}-{country}-{env}`
    707 +  - Example: `infra-o11y-apps-iad3-us-prod`
    708 +  
    709 +  **Count**: ~5 clusters (1 per environment + extras)
    710 +  
    711 +  #### 2. VictoriaMetrics Clusters
    712 +  
    713 +  **Purpose**: Run VictoriaMetrics LTS clusters
    714 +  
    715 +  **What Runs Here**:
    716 +  - vminsert pods
    717 +  - vmstorage pods
    718 +  - vmselect pods
    719 +  - Monitoring exporters
    720 +  
    721 +  **Naming**: `victoriametrics-{dc}-{country}-{env}`
    722 +  - Example: `victoriametrics-ord2-us-prod`
    723 +  
    724 +  **Count**: ~15 clusters (NA, EU, AP regions × staging/prod)
    725 +  
    726 +  #### 3. infra-logging Clusters
    727 +  
    728 +  **Purpose**: Run centralized logging infrastructure
    729 +  
    730 +  **What Runs Here**:
    731 +  - Loki distributors
    732 +  - Loki ingesters
    733 +  - Loki queriers
    734 +  - OpenTelemetry collectors (otelgw)
    735 +  
    736 +  **Naming**: `infra-logging-{dc}-{country}-{env}`
    737 +  - Example: `infra-logging-cjj1-us-staging`
    738 +  
    739 +  **Count**: ~10 clusters
    740 +  
    741 +  **Cluster Deployment Process**:
    742 +  
    743 +  Deploying a new cluster involves multiple teams and repositories:
    744 +  
    745 +  ```
    746 +  1. Terraform (Infrastructure)
    747 +     ├─ Repo: Linode/terraform-module-infra
    748 +     ├─ Action: Provision Linodes, IPs, DNS
    749 +     └─ Output: Server nodes created
    750 +  
    751 +  2. Salt (Server Configuration)
    752 +     ├─ Repo: ops/salt-pillar
    753 +     ├─ Action: Accept minion keys, set grains
    754 +     └─ Output: Servers configured
    755 +  
    756 +  3. Vault (Secrets Management)
    757 +     ├─ Action: Store approle secret, kubeconfig
    758 +     ├─ PKI: Add cluster domain to allowed list
    759 +     └─ Output: Secrets available for apps
    760 +  
    761 +  4. Ansible (Kubernetes Installation)
    762 +     ├─ Repo: ops/ansible-playbooks
    763 +     ├─ Playbook: Kubespray
    764 +     └─ Output: Kubernetes cluster running
    765 +  
    766 +  5. ArgoCD (Join Cluster)
    767 +     ├─ Repo: ops/palantir
    768 +     ├─ Command: make join-cluster
    769 +     └─ Output: ArgoCD can deploy to cluster
    770 +  
    771 +  6. Application Configuration
    772 +     ├─ Repo: ops/o11y-helm-charts
    773 +     ├─ Action: Create values file
    774 +     └─ Repo: ops/palantir
    775 +         ├─ Action: Create overlays
    776 +         └─ Output: Apps deployed via ArgoCD
    777 +  ```
    778 +  
    779 +  **Important Files**:
    780 +  - Documentation: `terraform-observability-team/docs/content/Services/Kubernetes 
        + Clusters/cluster-deployment.md`
    781 +  
    782 +  ### Centralized Logging
    783 +  
    784 +  **Purpose**: Collect logs from all infrastructure in one place for searching and alerting.
    785 +  
    786 +  **Architecture**:
    787 +  
    788 +  ```
    789 +  Servers (all DCs)
    790 +      │ Send logs via Fluent Bit / Promtail
    791 +      ▼
    792 +  OpenTelemetry Collector (otelgw) - Per DC
    793 +      │ Aggregates logs, adds labels
    794 +      ▼
    795 +  Loki Distributor - Regional
    796 +      │ Receives logs, hashes by labels
    797 +      ▼
    798 +  Loki Ingester - Regional
    799 +      │ Buffers logs, creates chunks
    800 +      ▼
    801 +  Object Storage (S3)
    802 +      │ Long-term log storage
    803 +      ▼
    804 +  Loki Querier - Regional
    805 +      │ Queries logs from ingesters + S3
    806 +      ▼
    807 +  Grafana - Global
    808 +      │ User searches logs
    809 +  ```
    810 +  
    811 +  **Components**:
    812 +  
    813 +  1. **otelgw** (OpenTelemetry Gateway)
    814 +     - One per datacenter
    815 +     - Aggregates logs from local servers
    816 +     - Adds metadata (datacenter, environment)
    817 +     - Forwards to Loki
    818 +  
    819 +  2. **Loki**
    820 +     - Distributed log aggregation system
    821 +     - Like "Prometheus but for logs"
    822 +     - **Doesn't index log content** (indexes labels only)
    823 +     - Very cost-effective
    824 +  
    825 +  **Loki Components**:
    826 +  - **Distributor**: Receives logs, validates, forwards
    827 +  - **Ingester**: Buffers logs, creates chunks, stores to S3
    828 +  - **Querier**: Queries logs from ingesters and S3
    829 +  - **Query Frontend**: Splits large queries, caching
    830 +  
    831 +  **Log Flow Example**:
    832 +  
    833 +  ```
    834 +  1. Web server generates log:
    835 +     "2025-11-20 10:00:00 GET /api/v4/linodes 200 45ms"
    836 +  
    837 +  2. Fluent Bit on server sends to otelgw:
    838 +     {
    839 +       "log": "GET /api/v4/linodes 200 45ms",
    840 +       "timestamp": "2025-11-20T10:00:00Z"
    841 +     }
    842 +  
    843 +  3. otelgw adds labels:
    844 +     {
    845 +       "log": "GET /api/v4/linodes 200 45ms",
    846 +       "datacenter": "ewr1",
    847 +       "service": "api",
    848 +       "environment": "prod"
    849 +     }
    850 +  
    851 +  4. Loki stores with labels:
    852 +     Labels: {datacenter="ewr1", service="api", environment="prod"}
    853 +     Log Line: "GET /api/v4/linodes 200 45ms"
    854 +  
    855 +  5. Engineer searches in Grafana:
    856 +     Query: {service="api", datacenter="ewr1"}
    857 +  ```
    858 +  
    859 +  **Important Files**:
    860 +  - Documentation: `terraform-observability-team/docs/content/Services/Centralized Logging/`
    861 +  
    862 +  ---
    863 +  
    864 +  ## 6. How Deployments Work
    865 +  
    866 +  This section explains how changes move from your laptop to production.
    867 +  
    868 +  ### Deployment Workflow: This Repository
    869 +  
    870 +  **For terraform-observability-team**:
    871 +  
    872 +  ```
    873 +  ┌─────────────────────────────────────────────────────────────────┐
    874 +  │ Step 1: Developer Creates PR                                    │
    875 +  ├─────────────────────────────────────────────────────────────────┤
    876 +  │  git checkout -b add-new-team-member                            │
    877 +  │  vim team.auto.tfvars  # Add new person                         │
    878 +  │  git commit -m "Add Jane Doe to team"                           │
    879 +  │  git push origin add-new-team-member                            │
    880 +  │  # Create PR on Bits                                            │
    881 +  └─────────────────────────────────────────────────────────────────┘
    882 +                              ▼
    883 +  ┌─────────────────────────────────────────────────────────────────┐
    884 +  │ Step 2: Atlantis Automatically Runs Plan                        │
    885 +  ├─────────────────────────────────────────────────────────────────┤
    886 +  │  Atlantis (bot) comments on PR:                                 │
    887 +  │                                                                  │
    888 +  │  Terraform Plan:                                                │
    889 +  │  + pagerduty_user.jane_doe                                      │
    890 +  │  + github_team_membership.jane_doe                              │
    891 +  │  + bits_team_membership.jane_doe                                │
    892 +  │                                                                  │
    893 +  │  Plan: 3 to add, 0 to change, 0 to destroy                      │
    894 +  └─────────────────────────────────────────────────────────────────┘
    895 +                              ▼
    896 +  ┌─────────────────────────────────────────────────────────────────┐
    897 +  │ Step 3: Team Reviews PR                                         │
    898 +  ├─────────────────────────────────────────────────────────────────┤
    899 +  │  Required Reviewers: ops/sre-o11y team                          │
    900 +  │  ✅ Check Terraform plan looks correct                          │
    901 +  │  ✅ Approve PR                                                  │
    902 +  └─────────────────────────────────────────────────────────────────┘
    903 +                              ▼
    904 +  ┌─────────────────────────────────────────────────────────────────┐
    905 +  │ Step 4: Merge PR                                                │
    906 +  ├─────────────────────────────────────────────────────────────────┤
    907 +  │  PR is merged to main branch                                    │
    908 +  │  (Atlantis does NOT apply automatically)                        │
    909 +  └─────────────────────────────────────────────────────────────────┘
    910 +                              ▼
    911 +  ┌─────────────────────────────────────────────────────────────────┐
    912 +  │ Step 5: Apply Changes                                           │
    913 +  ├─────────────────────────────────────────────────────────────────┤
    914 +  │  Comment on merged PR: "atlantis apply"                         │
    915 +  │                                                                  │
    916 +  │  Atlantis runs:                                                 │
    917 +  │  + Creates PagerDuty user                                       │
    918 +  │  + Adds to GitHub teams                                         │
    919 +  │  + Adds to Bits teams                                           │
    920 +  │                                                                  │
    921 +  │  Apply Complete! Resources: 3 added, 0 changed, 0 destroyed     │
    922 +  └─────────────────────────────────────────────────────────────────┘
    923 +                              ▼
    924 +  ┌─────────────────────────────────────────────────────────────────┐
    925 +  │ Step 6: Verify                                                  │
    926 +  ├─────────────────────────────────────────────────────────────────┤
    927 +  │  ✅ Jane receives PagerDuty invite                              │
    928 +  │  ✅ Jane appears in GitHub org                                  │
    929 +  │  ✅ Jane can access Bits repos                                  │
    930 +  └─────────────────────────────────────────────────────────────────┘
    931 +  ```
    932 +  
    933 +  **Key Points**:
    934 +  - **Atlantis automates Terraform**
    935 +  - **Plan is automatic**, apply is manual
    936 +  - **Always review the plan** before applying
    937 +  - **Vault secrets** are automatically loaded
    938 +  - **Apply happens via comment**: `atlantis apply`
    939 +  
    940 +  ### Deployment Workflow: Application Changes (Kubernetes)
    941 +  
    942 +  **For ops/o11y-helm-charts → ops/palantir → ArgoCD**:
    943 +  
    944 +  ```
    945 +  ┌─────────────────────────────────────────────────────────────────┐
    946 +  │ Step 1: Change Application Configuration                        │
    947 +  ├─────────────────────────────────────────────────────────────────┤
    948 +  │  Repo: ops/o11y-helm-charts                                     │
    949 +  │                                                                  │
    950 +  │  Example: Upgrade VictoriaMetrics version                       │
    951 +  │                                                                  │
    952 +  │  Edit: values-files/victoriametrics-ord2-us-staging.yaml        │
    953 +  │  Change:                                                         │
    954 +  │    victoriametrics:                                             │
    955 +  │      version: v1.99.0  →  v1.100.0                              │
    956 +  │                                                                  │
    957 +  │  git commit -m "Upgrade VictoriaMetrics to v1.100.0"            │
    958 +  │  git push, create PR, get review, merge                         │
    959 +  └─────────────────────────────────────────────────────────────────┘
    960 +                              ▼
    961 +  ┌─────────────────────────────────────────────────────────────────┐
    962 +  │ Step 2: GitHub Action Auto-Creates Palantir PR                  │
    963 +  ├─────────────────────────────────────────────────────────────────┤
    964 +  │  GitHub Action in o11y-helm-charts runs:                        │
    965 +  │    1. Renders Helm chart with new values                        │
    966 +  │    2. Generates updated Application manifests                   │
    967 +  │    3. Creates PR in ops/palantir with changes                   │
    968 +  │                                                                  │
    969 +  │  Palantir PR shows:                                             │
    970 +  │    - Updated VictoriaMetrics Application YAML                   │
    971 +  │    - New container image version                                │
    972 +  └─────────────────────────────────────────────────────────────────┘
    973 +                              ▼
    974 +  ┌─────────────────────────────────────────────────────────────────┐
    975 +  │ Step 3: Review and Merge Palantir PR                            │
    976 +  ├─────────────────────────────────────────────────────────────────┤
    977 +  │  Team reviews Palantir PR:                                      │
    978 +  │    ✅ Check image version is correct                            │
    979 +  │    ✅ Verify only staging cluster affected                      │
    980 +  │    ✅ Approve and merge                                         │
    981 +  └─────────────────────────────────────────────────────────────────┘
    982 +                              ▼
    983 +  ┌─────────────────────────────────────────────────────────────────┐
    984 +  │ Step 4: ArgoCD Detects Change                                   │
    985 +  ├─────────────────────────────────────────────────────────────────┤
    986 +  │  ArgoCD polls ops/palantir every 3 minutes                      │
    987 +  │  Detects: VictoriaMetrics Application changed                  │
    988 +  │  Status: "OutOfSync"                                            │
    989 +  └─────────────────────────────────────────────────────────────────┘
    990 +                              ▼
    991 +  ┌─────────────────────────────────────────────────────────────────┐
    992 +  │ Step 5: Manual Sync in ArgoCD                                   │
    993 +  ├─────────────────────────────────────────────────────────────────┤
    994 +  │  Engineer logs into ArgoCD UI                                   │
    995 +  │  Clicks "Sync" on VictoriaMetrics Application                   │
    996 +  │  ArgoCD applies changes to Kubernetes:                          │
    997 +  │    - Rolls out new vminsert pods (v1.100.0)                     │
    998 +  │    - Rolls out new vmselect pods (v1.100.0)                     │
    999 +  │    - Rolls out new vmstorage pods (v1.100.0)                    │
   1000 +  │  Status: "Synced" + "Healthy"                                   │
   1001 +  └─────────────────────────────────────────────────────────────────┘
   1002 +                              ▼
   1003 +  ┌─────────────────────────────────────────────────────────────────┐
   1004 +  │ Step 6: Verify Deployment                                       │
   1005 +  ├─────────────────────────────────────────────────────────────────┤
   1006 +  │  kubectl get pods -n victoriametrics                            │
   1007 +  │  → All pods running with new image                              │
   1008 +  │                                                                  │
   1009 +  │  Check Grafana dashboards                                       │
   1010 +  │  → Metrics still flowing                                        │
   1011 +  │  → No errors in logs                                            │
   1012 +  └─────────────────────────────────────────────────────────────────┘
   1013 +  ```
   1014 +  
   1015 +  **Key Points**:
   1016 +  - **Two PRs required**: One in o11y-helm-charts, one auto-generated in palantir
   1017 +  - **Staging uses main branch**, production uses release tags
   1018 +  - **Manual sync in ArgoCD** (staging/dev auto-sync, prod is manual)
   1019 +  - **Always verify** after deployment
   1020 +  
   1021 +  ### Repository Relationship Diagram
   1022 +  
   1023 +  ```
   1024 +  ┌────────────────────────────────────────────────────────────────┐
   1025 +  │                    SOURCE OF TRUTH                              │
   1026 +  │                                                                 │
   1027 +  │  ┌──────────────────────┐                                      │
   1028 +  │  │ ops/o11y-helm-charts │ ← Engineers make changes here        │
   1029 +  │  │ (Helm Chart + Values)│                                      │
   1030 +  │  └──────────┬───────────┘                                      │
   1031 +  │             │                                                   │
   1032 +  │             │ GitHub Action                                    │
   1033 +  │             │ (helm template + render)                         │
   1034 +  │             ▼                                                   │
   1035 +  │  ┌──────────────────────┐                                      │
   1036 +  │  │  ops/palantir        │ ← Auto-generated manifests           │
   1037 +  │  │  (Kubernetes YAML)   │                                      │
   1038 +  │  └──────────┬───────────┘                                      │
   1039 +  │             │                                                   │
   1040 +  │             │ ArgoCD polls every 3min                          │
   1041 +  │             ▼                                                   │
   1042 +  │  ┌──────────────────────┐                                      │
   1043 +  │  │  ArgoCD              │ ← Detects changes, syncs             │
   1044 +  │  └──────────┬───────────┘                                      │
   1045 +  │             │                                                   │
   1046 +  │             │ kubectl apply                                    │
   1047 +  │             ▼                                                   │
   1048 +  │  ┌──────────────────────┐                                      │
   1049 +  │  │  Kubernetes Cluster  │ ← Applications running               │
   1050 +  │  └──────────────────────┘                                      │
   1051 +  └────────────────────────────────────────────────────────────────┘
   1052 +  ```
   1053 +  
   1054 +  ---
   1055 +  
   1056 +  ## 7. Important Files You Need to Know
   1057 +  
   1058 +  ### Critical Files (Edit With Care!)
   1059 +  
   1060 +  #### `/team.auto.tfvars`
   1061 +  
   1062 +  **What**: Single source of truth for team membership
   1063 +  
   1064 +  **When to Edit**:
   1065 +  - Adding/removing team members
   1066 +  - Changing on-call rotation
   1067 +  - Modifying repository permissions
   1068 +  
   1069 +  **Structure**:
   1070 +  
   1071 +  ```hcl
   1072 +  # Team Members
   1073 +  observability_members = {
   1074 +    jdoe = {
   1075 +      name                 = "Jane Doe"
   1076 +      email                = "jane.doe@akamai.com"
   1077 +      job_title            = "Senior SRE"
   1078 +      github_username      = "jdoe"
   1079 +      github_admin         = false          # Admin on linode-obs GitHub org
   1080 +      bits_team_maintainer = true           # Maintainer role on Bits teams
   1081 +      bits_orgs            = ["ops", "Linode"]  # Which Bits orgs
   1082 +      pd_enabled           = true           # Add to PagerDuty
   1083 +    }
   1084 +    # ... more team members
   1085 +  }
   1086 +  
   1087 +  # On-call Rotation (Primary)
   1088 +  observability_oncall_primary = [
   1089 +    "current_oncall",   # MUST be first (currently on-call)
   1090 +    "person2",
   1091 +    "person3",
   1092 +    "new_person"        # Add new people at END
   1093 +  ]
   1094 +  
   1095 +  # On-call Schedule Start Time
   1096 +  # IMPORTANT: Set to Monday 11am ET of current on-call's shift
   1097 +  pagerduty_schedule_starttime = "2025-11-18T11:00:00-05:00"
   1098 +  
   1099 +  # Repository Configurations
   1100 +  observability_bits_repos = {
   1101 +    ops = [
   1102 +      {
   1103 +        name  = "prometheus_rules"
   1104 +        hooks = ["notification-o11y-prs", "atlantis"]
   1105 +        permissions = {
   1106 +          pull = false
   1107 +          push = true
   1108 +          admin = false
   1109 +        }
   1110 +        branch_protection = {
   1111 +          pattern = "main"
   1112 +          require_code_owners_review = true
   1113 +          required_approving_review_count = 1
   1114 +        }
   1115 +      }
   1116 +      # ... more repos
   1117 +    ]
   1118 +  }
   1119 +  ```
   1120 +  
   1121 +  **Important Notes**:
   1122 +  - **On-call order matters!** Current on-call must be first
   1123 +  - **Adding to on-call**: Put new person LAST
   1124 +  - **Update schedule_starttime** when changing on-call
   1125 +  - **Review carefully** - affects real permissions
   1126 +  
   1127 +  #### `/pagerduty_suppression.auto.tfvars`
   1128 +  
   1129 +  **What**: Alert suppression during maintenance
   1130 +  
   1131 +  **When to Edit**:
   1132 +  - DC scaling events (temporarily suppress alerts)
   1133 +  - Maintenance windows
   1134 +  
   1135 +  **Structure**:
   1136 +  
   1137 +  ```hcl
   1138 +  # DCs with alerts suppressed
   1139 +  observability_suppressed_dcs = [
   1140 +    "iad5",    # Include both formats
   1141 +    "iad05",   # (some alerts use zero-padded)
   1142 +  ]
   1143 +  
   1144 +  # Services with suppression rules
   1145 +  observability_suppressed_services = [
   1146 +    "PJ95HJI",  # Prometheus Alerts service ID
   1147 +    # ... more service IDs
   1148 +  ]
   1149 +  ```
   1150 +  
   1151 +  **Process**:
   1152 +  1. Edit file to add DC
   1153 +  2. `atlantis plan` to verify
   1154 +  3. Merge and `atlantis apply`
   1155 +  4. Suppression active
   1156 +  5. **IMPORTANT**: Remove DC after maintenance!
   1157 +  
   1158 +  #### `/atlantis.yaml`
   1159 +  
   1160 +  **What**: Atlantis automation configuration
   1161 +  
   1162 +  **When to Edit**: Rarely (only if changing Terraform workflow)
   1163 +  
   1164 +  **Key Settings**:
   1165 +  
   1166 +  ```yaml
   1167 +  version: 3
   1168 +  automerge: false
   1169 +  delete_source_branch_on_merge: false
   1170 +  
   1171 +  projects:
   1172 +  - name: terraform-observability-team
   1173 +    dir: .
   1174 +    workspace: default
   1175 +    terraform_version: v1.5.7
   1176 +  
   1177 +    # Automatic plan on PR
   1178 +    autoplan:
   1179 +      when_modified: ["*.tf", "*.tfvars"]
   1180 +      enabled: true
   1181 +  
   1182 +    # Requirements before apply
   1183 +    apply_requirements:
   1184 +      - approved      # PR must be approved
   1185 +      - mergeable     # PR must be mergeable
   1186 +  
   1187 +    workflow: observability-team
   1188 +  
   1189 +  workflows:
   1190 +    observability-team:
   1191 +      plan:
   1192 +        steps:
   1193 +        - run: vault login -method=oidc  # Authenticate to Vault
   1194 +        - init
   1195 +        - plan
   1196 +      apply:
   1197 +        steps:
   1198 +        - run: vault login -method=oidc
   1199 +        - apply
   1200 +  ```
   1201 +  
   1202 +  **Don't Touch Unless**: You know what you're doing with Atlantis
   1203 +  
   1204 +  ### Configuration Files
   1205 +  
   1206 +  #### `/.envrc`
   1207 +  
   1208 +  **What**: Environment variables for local development (loaded by direnv)
   1209 +  
   1210 +  **Contents**:
   1211 +  
   1212 +  ```bash
   1213 +  export VAULT_ADDR="https://vault.infra.linode.com"
   1214 +  export VAULT_NAMESPACE="infra"
   1215 +  
   1216 +  # Auto-login to Vault when entering directory
   1217 +  vault login -method=oidc username=$USER
   1218 +  ```
   1219 +  
   1220 +  **Usage**:
   1221 +  1. Install direnv: `brew install direnv`
   1222 +  2. Add to shell: `eval "$(direnv hook zsh)"`
   1223 +  3. Allow in this directory: `direnv allow`
   1224 +  4. Now auto-authenticated to Vault when you `cd` here
   1225 +  
   1226 +  #### `/.pre-commit-config.yaml`
   1227 +  
   1228 +  **What**: Automated code quality checks before commits
   1229 +  
   1230 +  **Hooks**:
   1231 +  - `terraform fmt` - Format Terraform code
   1232 +  - `terraform validate` - Validate Terraform syntax
   1233 +  - `markdownlint` - Lint Markdown docs
   1234 +  - `vale` - Prose style checking
   1235 +  - `prettier` - Format YAML/JSON/Markdown
   1236 +  - `trailing-whitespace` - Remove trailing whitespace
   1237 +  - `end-of-file-fixer` - Ensure files end with newline
   1238 +  
   1239 +  **Setup**:
   1240 +  ```bash
   1241 +  pre-commit install --install-hooks
   1242 +  ```
   1243 +  
   1244 +  **Usage**: Runs automatically on `git commit`
   1245 +  
   1246 +  **Skip if needed** (not recommended):
   1247 +  ```bash
   1248 +  git commit --no-verify
   1249 +  ```
   1250 +  
   1251 +  #### `/.mise.toml`
   1252 +  
   1253 +  **What**: Task runner configuration (like Makefile but better)
   1254 +  
   1255 +  **Common Tasks**:
   1256 +  
   1257 +  ```bash
   1258 +  # Create new on-call log for this week
   1259 +  mise run oncall
   1260 +  
   1261 +  # Create new proposal
   1262 +  mise run proposal
   1263 +  
   1264 +  # Run pre-commit on all files
   1265 +  mise run pre-commit-all
   1266 +  
   1267 +  # Serve documentation locally
   1268 +  mise run docs-serve
   1269 +  ```
   1270 +  
   1271 +  **View all tasks**:
   1272 +  ```bash
   1273 +  mise tasks
   1274 +  ```
   1275 +  
   1276 +  ### Documentation Files
   1277 +  
   1278 +  #### `/docs/config.toml`
   1279 +  
   1280 +  **What**: Hugo site configuration
   1281 +  
   1282 +  **Key Settings**:
   1283 +  
   1284 +  ```toml
   1285 +  baseURL = "https://bits.linode.com/pages/ops/terraform-observability-team/"
   1286 +  title = "SRE Observability Team"
   1287 +  theme = "docsy"
   1288 +  
   1289 +  [params]
   1290 +    description = "SRE Observability Team Documentation"
   1291 +    github_repo = "https://bits.linode.com/ops/terraform-observability-team"
   1292 +    github_branch = "main"
   1293 +  ```
   1294 +  
   1295 +  **When to Edit**: Changing site metadata, theme settings
   1296 +  
   1297 +  ---
   1298 +  
   1299 +  ## 8. Documentation Structure
   1300 +  
   1301 +  ### How Documentation is Organized
   1302 +  
   1303 +  The team uses **Hugo** with the **Docsy** theme for documentation.
   1304 +  
   1305 +  **Why Hugo?**
   1306 +  - Static site generator (fast, secure)
   1307 +  - Markdown-based (easy to write)
   1308 +  - Version controlled (in git)
   1309 +  - Searchable
   1310 +  - Navigation sidebar auto-generated
   1311 +  
   1312 +  ### Content Types
   1313 +  
   1314 +  #### Handbooks (`/docs/content/Handbooks/`)
   1315 +  
   1316 +  **Purpose**: How the team operates
   1317 +  
   1318 +  **Files**:
   1319 +  - `on-call.md` - On-call guide (schedule, PagerDuty setup, responsibilities)
   1320 +  - `tools.md` - Standardized tooling (Go, Jsonnet, Kubernetes, etc.)
   1321 +  - `git-conventions.md` - Commit message format, branching strategy
   1322 +  - `docs/` - How to write documentation
   1323 +  
   1324 +  **When to Update**: Process changes, new tools adopted
   1325 +  
   1326 +  #### Services (`/docs/content/Services/`)
   1327 +  
   1328 +  **Purpose**: Documentation for each service we manage
   1329 +  
   1330 +  **Structure**:
   1331 +  ```
   1332 +  Services/
   1333 +  ├── ArgoCD/
   1334 +  │   ├── _index.md              # Overview
   1335 +  │   ├── repositories.md        # Repo management
   1336 +  │   └── troubleshooting.md
   1337 +  ├── VictoriaMetrics/
   1338 +  │   ├── _index.md
   1339 +  │   ├── architecture.md
   1340 +  │   ├── upgrade.md
   1341 +  │   └── troubleshooting.md
   1342 +  ├── Prometheus/
   1343 +  │   ├── _index.md
   1344 +  │   ├── sharding.md
   1345 +  │   ├── Updating/
   1346 +  │   │   └── updating.md
   1347 +  │   └── troubleshooting.md
   1348 +  └── ...
   1349 +  ```
   1350 +  
   1351 +  **When to Update**: Service upgrades, architecture changes, new troubleshooting steps
   1352 +  
   1353 +  #### MOPs (`/docs/content/mops/`)
   1354 +  
   1355 +  **Purpose**: Manual Operations Procedures - step-by-step guides for complex tasks
   1356 +  
   1357 +  **MOP Structure**:
   1358 +  
   1359 +  ```markdown
   1360 +  # MOP: Prometheus Shard
   1361 +  
   1362 +  ## Overview
   1363 +  High-level description of the procedure.
   1364 +  
   1365 +  ## Prerequisites
   1366 +  - [ ] Access to Salt master
   1367 +  - [ ] 2-4 hours of focused time
   1368 +  - [ ] Approval from team lead
   1369 +  
   1370 +  ## Procedure
   1371 +  
   1372 +  ### Step 1: Prepare
   1373 +  Detailed instructions...
   1374 +  
   1375 +  ### Step 2: Execute
   1376 +  More instructions...
   1377 +  
   1378 +  ## Verification
   1379 +  How to verify the procedure succeeded.
   1380 +  
   1381 +  ## Rollback Plan
   1382 +  How to undo changes if something goes wrong.
   1383 +  
   1384 +  ## References
   1385 +  - [Related Documentation](link)
   1386 +  ```
   1387 +  
   1388 +  **Examples**:
   1389 +  - `prometheus-shard.md` - Adding a new Prometheus shard
   1390 +  - `victoriametrics-cluster-upgrade.md` - Upgrading VictoriaMetrics
   1391 +  
   1392 +  **When to Create**: Complex multi-step procedures that are done infrequently
   1393 +  
   1394 +  #### Proposals (`/docs/content/proposals/`)
   1395 +  
   1396 +  **Purpose**: Design documents for major changes
   1397 +  
   1398 +  **Proposal Naming**: `OP-##-descriptive-name.md`
   1399 +  - OP = Observability Proposal
   1400 +  - ## = Sequential number (01, 02, 03...)
   1401 +  
   1402 +  **Proposal Template**:
   1403 +  
   1404 +  ```markdown
   1405 +  ---
   1406 +  title: "OP-05: My Proposal Title"
   1407 +  status: "accepted"  # draft | accepted | rejected | done
   1408 +  date: 2025-11-20
   1409 +  ---
   1410 +  
   1411 +  ## Summary
   1412 +  One paragraph summary.
   1413 +  
   1414 +  ## Motivation
   1415 +  Why are we doing this?
   1416 +  
   1417 +  ## Proposal
   1418 +  Detailed design.
   1419 +  
   1420 +  ## Alternatives Considered
   1421 +  What else did we think about?
   1422 +  
   1423 +  ## Implementation Plan
   1424 +  How will we do this?
   1425 +  
   1426 +  ## Success Metrics
   1427 +  How do we know it worked?
   1428 +  ```
   1429 +  
   1430 +  **Statuses**:
   1431 +  - `draft` - Being written
   1432 +  - `accepted` - Approved, not implemented
   1433 +  - `done` - Implemented
   1434 +  - `rejected` - Not approved
   1435 +  
   1436 +  **Creating a Proposal**:
   1437 +  ```bash
   1438 +  cd docs/
   1439 +  mise run proposal
   1440 +  # OR
   1441 +  hugo new proposals/OP-##-my-proposal.md
   1442 +  ```
   1443 +  
   1444 +  **Approval Process**:
   1445 +  1. Create proposal as draft
   1446 +  2. Share with team for feedback
   1447 +  3. Present in team meeting
   1448 +  4. Minimum 2 approvals required
   1449 +  5. Update status to "accepted"
   1450 +  
   1451 +  #### On-call Logs (`/docs/content/on-call/YYYY/`)
   1452 +  
   1453 +  **Purpose**: Weekly logs of on-call work
   1454 +  
   1455 +  **Structure**:
   1456 +  
   1457 +  ```markdown
   1458 +  ---
   1459 +  title: "On-call: 2025-11-18"
   1460 +  date: 2025-11-18
   1461 +  author: "Jane Doe"
   1462 +  ---
   1463 +  
   1464 +  ## Summary
   1465 +  Brief summary of the week.
   1466 +  
   1467 +  ## Incidents
   1468 +  ### [INC-1234] Production Prometheus Down
   1469 +  - **When**: 2025-11-18 14:00 UTC
   1470 +  - **Impact**: 5 minutes of data loss
   1471 +  - **Root Cause**: Out of disk space
   1472 +  - **Resolution**: Cleaned up old WAL files
   1473 +  - **Follow-up**: Created ticket to increase disk size
   1474 +  
   1475 +  ## Reliability Improvements
   1476 +  - Automated disk cleanup script
   1477 +  - Added disk space alerting
   1478 +  
   1479 +  ## Intake & Requests
   1480 +  - Granted Grafana access to 3 new users
   1481 +  - Helped Product team with dashboard creation
   1482 +  
   1483 +  ## Notes
   1484 +  - VictoriaMetrics upgrade planned for next week
   1485 +  - Need to review Prometheus sharding in iad5
   1486 +  ```
   1487 +  
   1488 +  **Creating On-call Log**:
   1489 +  ```bash
   1490 +  cd docs/
   1491 +  mise run oncall
   1492 +  # Creates: on-call/2025/2025-11-18.md (for current Monday)
   1493 +  ```
   1494 +  
   1495 +  **Handoff Process**:
   1496 +  1. Monday 11am ET - on-call shift changes
   1497 +  2. Outgoing on-call fills out summary
   1498 +  3. Posts link in #o11y-core Slack thread
   1499 +  4. Incoming on-call reads to catch up
   1500 +  
   1501 +  #### Runbooks (`/docs/content/Runbooks/`)
   1502 +  
   1503 +  **Purpose**: How to respond to specific alerts
   1504 +  
   1505 +  **Runbook Structure**:
   1506 +  
   1507 +  ```markdown
   1508 +  # Alert: HighMemoryUsage
   1509 +  
   1510 +  ## Summary
   1511 +  This alert fires when a server's memory usage exceeds 90% for 5 minutes.
   1512 +  
   1513 +  ## Impact
   1514 +  - Potential performance degradation
   1515 +  - Risk of OOM killer terminating processes
   1516 +  
   1517 +  ## Investigation Steps
   1518 +  1. Check which process is using memory:
   1519 +     ```bash
   1520 +     ssh server
   1521 +     top -o %MEM
   1522 +     ```
   1523 +  
   1524 +  2. Check for memory leaks:
   1525 +     ```bash
   1526 +     ps aux --sort=-%mem | head -n 10
   1527 +     ```
   1528 +  
   1529 +  ## Resolution
   1530 +  - Restart offending process
   1531 +  - Increase memory if consistently high
   1532 +  - Check for memory leak in application
   1533 +  
   1534 +  ## Escalation
   1535 +  If you can't resolve in 30 minutes, escalate to:
   1536 +  - #team-infrastructure
   1537 +  ```
   1538 +  
   1539 +  **When to Create**: For any alert that pages
   1540 +  
   1541 +  ### Writing Documentation
   1542 +  
   1543 +  **Creating New Pages**:
   1544 +  
   1545 +  ```bash
   1546 +  cd docs/
   1547 +  
   1548 +  # New service documentation
   1549 +  hugo new Services/MyService/_index.md
   1550 +  
   1551 +  # New MOP
   1552 +  hugo new mops/my-procedure.md
   1553 +  
   1554 +  # New proposal
   1555 +  mise run proposal
   1556 +  
   1557 +  # New on-call log
   1558 +  mise run oncall
   1559 +  ```
   1560 +  
   1561 +  **Markdown Tips**:
   1562 +  
   1563 +  ```markdown
   1564 +  # Headers
   1565 +  Use # for title, ## for sections, ### for subsections
   1566 +  
   1567 +  # Links
   1568 +  [Link Text](https://example.com)
   1569 +  [Internal Link]({{< ref "path/to/page.md" >}})
   1570 +  
   1571 +  # Code Blocks
   1572 +  ```bash
   1573 +  command here
   1574 +  ```
   1575 +  
   1576 +  # Images
   1577 +  ![Alt Text](/static/images/diagram.png)
   1578 +  
   1579 +  # Admonitions (special boxes)
   1580 +  {{< alert title="Warning" >}}
   1581 +  This is important!
   1582 +  {{< /alert >}}
   1583 +  ```
   1584 +  
   1585 +  **Documentation Standards**:
   1586 +  1. **Clarity**: Write for someone unfamiliar with the topic
   1587 +  2. **Examples**: Include real examples, not just theory
   1588 +  3. **Up-to-date**: Update docs when processes change
   1589 +  4. **Searchable**: Use descriptive headers and titles
   1590 +  
   1591 +  **Building Locally**:
   1592 +  
   1593 +  ```bash
   1594 +  cd docs/
   1595 +  mise run docs-serve
   1596 +  # Open http://localhost:1313
   1597 +  ```
   1598 +  
   1599 +  **Publishing**:
   1600 +  - Merged to `main` branch → GitHub Action builds → Published to Bits Pages
   1601 +  
   1602 +  ---
   1603 +  
   1604 +  ## 9. Day-to-Day Operations
   1605 +  
   1606 +  ### On-call Responsibilities
   1607 +  
   1608 +  **On-call Shift**: Monday 11am ET → Next Monday 11am ET
   1609 +  
   1610 +  **Primary On-call Duties**:
   1611 +  
   1612 +  1. **Respond to Pages** (Critical alerts)
   1613 +     - Acknowledgment: ≤ 5 minutes
   1614 +     - Begin investigation immediately
   1615 +     - Update incident ticket
   1616 +     - Escalate if needed
   1617 +  
   1618 +  2. **Monitor Warnings** (Non-critical alerts)
   1619 +     - Acknowledgment: ≤ 12 hours
   1620 +     - Investigate during business hours
   1621 +     - Create tickets for follow-up
   1622 +  
   1623 +  3. **Handle Intake Requests**
   1624 +     - Grafana permission requests
   1625 +     - Nagios user management
   1626 +     - Quick questions in Slack
   1627 +  
   1628 +  4. **PR Reviews**
   1629 +     - Review PRs for ops/sre-o11y repos
   1630 +     - Priority: Blocking changes first
   1631 +  
   1632 +  5. **Reliability Improvement**
   1633 +     - Choose ONE reliability task per week
   1634 +     - Examples: Automate toil, improve documentation, fix flaky alerts
   1635 +  
   1636 +  6. **Attend Post-Mortems**
   1637 +     - For incidents you responded to
   1638 +     - Share learnings with team
   1639 +  
   1640 +  **On-call Handoff**:
   1641 +  1. Fill out on-call log summary
   1642 +  2. Post in #o11y-core Slack thread
   1643 +  3. Highlight any ongoing issues
   1644 +  4. Transfer any active incidents
   1645 +  
   1646 +  ### Common Daily Tasks
   1647 +  
   1648 +  #### Reviewing PRs
   1649 +  
   1650 +  **Repositories We Review**:
   1651 +  - `ops/prometheus_rules`
   1652 +  - `ops/prometheus-formula`
   1653 +  - `ops/o11y-helm-charts`
   1654 +  - `ops/palantir`
   1655 +  - `ops/loki_rules`
   1656 +  - `ops/terraform-grafana-config`
   1657 +  - `terraform-observability-team`
   1658 +  
   1659 +  **Review Checklist**:
   1660 +  - [ ] Read PR description
   1661 +  - [ ] Check CI passes
   1662 +  - [ ] Review code changes
   1663 +  - [ ] Verify Terraform plan (if applicable)
   1664 +  - [ ] Check for secrets in diff
   1665 +  - [ ] Ensure tests added (if applicable)
   1666 +  - [ ] Approve or request changes
   1667 +  
   1668 +  **Slack Notifications**: #notification-o11y-prs
   1669 +  
   1670 +  #### Granting Grafana Access
   1671 +  
   1672 +  **Request Format**: Usually in #sre-observability or Jira (OY project)
   1673 +  
   1674 +  **Process**:
   1675 +  1. Determine access level needed
   1676 +     - Viewer: Read dashboards
   1677 +     - Editor: Create/edit dashboards
   1678 +     - Admin: Manage users (rare)
   1679 +  
   1680 +  2. Add user in Terraform:
   1681 +     ```bash
   1682 +     cd ~/path/to/terraform-grafana-config
   1683 +     vim users.tf
   1684 +     # Add user definition
   1685 +     git commit -m "Add Jane Doe to Grafana"
   1686 +     git push, create PR
   1687 +     ```
   1688 +  
   1689 +  3. After PR merged:
   1690 +     - Atlantis applies
   1691 +     - User receives invite email
   1692 +  
   1693 +  4. Notify requester
   1694 +  
   1695 +  **Time-sensitive**: Try to complete within 1 business day
   1696 +  
   1697 +  #### Monitoring Alerts
   1698 +  
   1699 +  **Alert Channels**:
   1700 +  - **#notification-o11y** - All observability alerts
   1701 +  - **PagerDuty** - Critical alerts (pages)
   1702 +  - **Email** - Low-priority warnings
   1703 +  
   1704 +  **Triage Process**:
   1705 +  
   1706 +  1. **Check Alert Severity**
   1707 +     - **Critical**: Page immediately
   1708 +     - **Warning**: Investigate during business hours
   1709 +     - **Info**: Log for awareness
   1710 +  
   1711 +  2. **Check Runbook**
   1712 +     - Most alerts link to runbook
   1713 +     - Follow investigation steps
   1714 +  
   1715 +  3. **Create Incident Ticket** (if needed)
   1716 +     - Jira project: OY
   1717 +     - Include alert details
   1718 +     - Track resolution
   1719 +  
   1720 +  4. **Silence if Necessary**
   1721 +     ```bash
   1722 +     # Silence alert for maintenance
   1723 +     amtool silence add \
   1724 +       --alertmanager.url=https://alertmanager.linode.com \
   1725 +       --comment="Datacenter maintenance" \
   1726 +       --duration=2h \
   1727 +       alertname="HighCPU" \
   1728 +       datacenter="iad5"
   1729 +     ```
   1730 +  
   1731 +  5. **Post-Incident**
   1732 +     - Document in on-call log
   1733 +     - Create follow-up tickets
   1734 +     - Update runbook if needed
   1735 +  
   1736 +  ### Weekly Team Rituals
   1737 +  
   1738 +  **Monday 11am ET: On-call Handoff**
   1739 +  - Outgoing on-call posts summary
   1740 +  - Incoming on-call reviews
   1741 +  
   1742 +  **Wednesday 10am ET: Team Sync** (if scheduled)
   1743 +  - Current work updates
   1744 +  - Blocker discussion
   1745 +  - Knowledge sharing
   1746 +  
   1747 +  **Fridays: Reliability Improvement Time**
   1748 +  - Work on tech debt
   1749 +  - Improve automation
   1750 +  - Update documentation
   1751 +  
   1752 +  ---
   1753 +  
   1754 +  ## 10. Common Tasks with Examples
   1755 +  
   1756 +  Let me show you how to do common tasks step-by-step.
   1757 +  
   1758 +  ### Task 1: Add a New Team Member
   1759 +  
   1760 +  **Scenario**: Jane Doe is joining the team.
   1761 +  
   1762 +  **Steps**:
   1763 +  
   1764 +  ```bash
   1765 +  # 1. Clone repo (if not already)
   1766 +  cd ~/repos
   1767 +  git clone bits.linode.com:ops/terraform-observability-team
   1768 +  cd terraform-observability-team
   1769 +  
   1770 +  # 2. Create branch
   1771 +  git checkout -b add-jane-doe
   1772 +  
   1773 +  # 3. Edit team.auto.tfvars
   1774 +  vim team.auto.tfvars
   1775 +  
   1776 +  # Add to observability_members:
   1777 +  observability_members = {
   1778 +    # ... existing members ...
   1779 +  
   1780 +    jdoe = {
   1781 +      name                 = "Jane Doe"
   1782 +      email                = "jane.doe@akamai.com"
   1783 +      job_title            = "SRE II"
   1784 +      github_username      = "jdoe-akamai"
   1785 +      github_admin         = false
   1786 +      bits_team_maintainer = true
   1787 +      bits_orgs            = ["ops", "Linode"]
   1788 +      pd_enabled           = true
   1789 +    }
   1790 +  }
   1791 +  
   1792 +  # 4. Commit and push
   1793 +  git add team.auto.tfvars
   1794 +  git commit -m "Add Jane Doe to SRE O11y team"
   1795 +  git push origin add-jane-doe
   1796 +  
   1797 +  # 5. Create PR on Bits
   1798 +  # Visit: bits.linode.com/ops/terraform-observability-team
   1799 +  # Click "Create Pull Request"
   1800 +  
   1801 +  # 6. Wait for Atlantis to run plan
   1802 +  # Review plan in PR comments
   1803 +  
   1804 +  # 7. Get PR approved by team member
   1805 +  
   1806 +  # 8. Merge PR
   1807 +  
   1808 +  # 9. Apply changes
   1809 +  # Comment on merged PR: "atlantis apply"
   1810 +  
   1811 +  # 10. Verify
   1812 +  # - Check PagerDuty: Jane should appear in team
   1813 +  # - Check GitHub: Jane should be in linode-obs org
   1814 +  # - Check Bits: Jane should be in ops/sre-o11y team
   1815 +  ```
   1816 +  
   1817 +  ### Task 2: Add Someone to On-call Rotation
   1818 +  
   1819 +  **Scenario**: Jane is trained and ready for on-call.
   1820 +  
   1821 +  **Important**: On-call rotation order matters!
   1822 +  
   1823 +  **Steps**:
   1824 +  
   1825 +  ```bash
   1826 +  # 1. Determine current on-call
   1827 +  # Check PagerDuty schedule or ask in Slack
   1828 +  
   1829 +  # 2. Create branch
   1830 +  git checkout -b add-jane-oncall
   1831 +  
   1832 +  # 3. Edit team.auto.tfvars
   1833 +  vim team.auto.tfvars
   1834 +  
   1835 +  # BEFORE:
   1836 +  observability_oncall_primary = [
   1837 +    "current_oncall",
   1838 +    "person2",
   1839 +    "person3",
   1840 +  ]
   1841 +  pagerduty_schedule_starttime = "2025-11-18T11:00:00-05:00"
   1842 +  
   1843 +  # AFTER:
   1844 +  observability_oncall_primary = [
   1845 +    "current_oncall",   # Must be first!
   1846 +    "person2",
   1847 +    "person3",
   1848 +    "jdoe"              # Add new person LAST
   1849 +  ]
   1850 +  # Update to Monday of current on-call's shift:
   1851 +  pagerduty_schedule_starttime = "2025-11-18T11:00:00-05:00"
   1852 +  
   1853 +  # 4. Commit, push, PR, merge, apply (same as Task 1)
   1854 +  ```
   1855 +  
   1856 +  **Why Order Matters**:
   1857 +  - PagerDuty schedule starts at `pagerduty_schedule_starttime`
   1858 +  - Rotates through the list in order
   1859 +  - If order changes, rotation gets messed up
   1860 +  - **Always**: Current on-call first, new person last
   1861 +  
   1862 +  ### Task 3: Suppress Alerts for Datacenter Maintenance
   1863 +  
   1864 +  **Scenario**: Datacenter iad5 is being scaled, expect alerts.
   1865 +  
   1866 +  **Steps**:
   1867 +  
   1868 +  ```bash
   1869 +  # 1. Create branch
   1870 +  git checkout -b suppress-iad5
   1871 +  
   1872 +  # 2. Edit pagerduty_suppression.auto.tfvars
   1873 +  vim pagerduty_suppression.auto.tfvars
   1874 +  
   1875 +  # Add DCs (include both formats!):
   1876 +  observability_suppressed_dcs = [
   1877 +    "iad5",
   1878 +    "iad05",   # Some alerts use zero-padded
   1879 +  ]
   1880 +  
   1881 +  # 3. Commit and push
   1882 +  git add pagerduty_suppression.auto.tfvars
   1883 +  git commit -m "Suppress alerts for iad5 during scaling"
   1884 +  git push origin suppress-iad5
   1885 +  
   1886 +  # 4. Create PR, get approved, merge
   1887 +  
   1888 +  # 5. Apply immediately
   1889 +  # Comment: "atlantis apply"
   1890 +  
   1891 +  # 6. After maintenance completes, REMOVE suppression
   1892 +  git checkout -b unsuppress-iad5
   1893 +  vim pagerduty_suppression.auto.tfvars
   1894 +  # Remove iad5, iad05 from list
   1895 +  git commit -m "Remove iad5 alert suppression"
   1896 +  # Push, PR, merge, apply
   1897 +  ```
   1898 +  
   1899 +  **Important**: Don't forget to remove suppression after!
   1900 +  
   1901 +  ### Task 4: Upgrade VictoriaMetrics
   1902 +  
   1903 +  **Scenario**: New VictoriaMetrics version released, need to upgrade staging.
   1904 +  
   1905 +  **Steps**:
   1906 +  
   1907 +  ```bash
   1908 +  # 1. Review changelog
   1909 +  # Check: github.com/VictoriaMetrics/VictoriaMetrics/releases
   1910 +  
   1911 +  # 2. Clone o11y-helm-charts
   1912 +  cd ~/repos
   1913 +  git clone bits.linode.com:ops/o11y-helm-charts
   1914 +  cd o11y-helm-charts
   1915 +  
   1916 +  # 3. Create branch
   1917 +  git checkout -b victoriametrics-v1.100.0
   1918 +  
   1919 +  # 4. Update staging cluster values
   1920 +  vim values-files/victoriametrics-ord2-us-staging.yaml
   1921 +  
   1922 +  # Change version:
   1923 +  victoriametrics:
   1924 +    version: v1.99.0  →  v1.100.0
   1925 +  
   1926 +  # 5. Commit and push
   1927 +  git add values-files/victoriametrics-ord2-us-staging.yaml
   1928 +  git commit -m "Upgrade VictoriaMetrics staging to v1.100.0"
   1929 +  git push origin victoriametrics-v1.100.0
   1930 +  
   1931 +  # 6. Create PR, get reviewed, merge
   1932 +  
   1933 +  # 7. Wait for GitHub Action to create Palantir PR
   1934 +  # Check: bits.linode.com/ops/palantir/pulls
   1935 +  
   1936 +  # 8. Review and merge Palantir PR
   1937 +  
   1938 +  # 9. Sync in ArgoCD
   1939 +  # - Visit ArgoCD staging: argocd.infra-o11y-apps.rin1.us.staging.linode.com
   1940 +  # - Find victoriametrics-ord2 application
   1941 +  # - Click "Sync"
   1942 +  # - Wait for deployment to complete
   1943 +  
   1944 +  # 10. Verify
   1945 +  kubectl --context victoriametrics-ord2-us-staging get pods -n victoriametrics
   1946 +  # All pods should show new image version
   1947 +  
   1948 +  # Check Grafana dashboards for errors
   1949 +  
   1950 +  # 11. If successful, repeat for production clusters
   1951 +  ```
   1952 +  
   1953 +  ### Task 5: Create On-call Log
   1954 +  
   1955 +  **Scenario**: It's Monday, time to start your on-call shift.
   1956 +  
   1957 +  **Steps**:
   1958 +  
   1959 +  ```bash
   1960 +  # 1. Navigate to docs directory
   1961 +  cd ~/repos/terraform-observability-team/docs
   1962 +  
   1963 +  # 2. Create on-call log
   1964 +  mise run oncall
   1965 +  # This creates: content/on-call/2025/2025-11-18.md
   1966 +  
   1967 +  # 3. Edit throughout the week
   1968 +  vim content/on-call/2025/2025-11-18.md
   1969 +  
   1970 +  # Add incidents, improvements, notes
   1971 +  
   1972 +  # 4. At end of week, fill out summary
   1973 +  vim content/on-call/2025/2025-11-18.md
   1974 +  
   1975 +  ## Summary
   1976 +  Quiet week. Responded to 2 minor alerts. Improved disk cleanup automation.
   1977 +  
   1978 +  # 5. Commit and push
   1979 +  git add content/on-call/2025/2025-11-18.md
   1980 +  git commit -m "On-call log: 2025-11-18"
   1981 +  git push origin main
   1982 +  
   1983 +  # 6. Post link in Slack
   1984 +  # In #o11y-core:
   1985 +  # "On-call handoff: 
        + https://bits.linode.com/pages/ops/terraform-observability-team/on-call/2025/2025-11-18/"
   1986 +  ```
   1987 +  
   1988 +  ### Task 6: Deploy New Kubernetes Cluster
   1989 +  
   1990 +  **This is a complex multi-day task involving multiple teams.**
   1991 +  
   1992 +  **Prerequisites**:
   1993 +  - Approval from management
   1994 +  - Infrastructure planned (node count, sizes, datacenter)
   1995 +  - Cluster name decided
   1996 +  
   1997 +  **Steps** (abbreviated - see full MOP in docs):
   1998 +  
   1999 +  ```bash
   2000 +  # Day 1: Infrastructure Provisioning
   2001 +  # 1. Create Terraform PR in Linode/terraform-module-infra
   2002 +  # 2. Add cluster definition
   2003 +  # 3. Apply Terraform
   2004 +  # 4. Linodes created
   2005 +  
   2006 +  # Day 2: Server Configuration
   2007 +  # 5. Ask #sre-salt to accept minion keys
   2008 +  # 6. Set Salt grains for cluster nodes
   2009 +  # 7. Run highstate
   2010 +  
   2011 +  # Day 3: Vault & Secrets
   2012 +  # 8. Store cluster approle secret in Vault
   2013 +  # 9. Store kubeconfig in Vault
   2014 +  # 10. Update Vault PKI to allow cluster domain
   2015 +  
   2016 +  # Day 4: Kubernetes Installation
   2017 +  # 11. Run Kubespray Ansible playbook
   2018 +  # 12. Wait 1-2 hours for completion
   2019 +  # 13. Verify cluster accessible
   2020 +  
   2021 +  # Day 5: ArgoCD Integration
   2022 +  # 14. Join cluster to ArgoCD (make join-cluster)
   2023 +  # 15. Create cluster labels
   2024 +  
   2025 +  # Day 6: Application Configuration
   2026 +  # 16. Create values file in o11y-helm-charts
   2027 +  # 17. Create overlays in palantir
   2028 +  # 18. Merge PRs
   2029 +  
   2030 +  # Day 7: Deploy Applications
   2031 +  # 19. Sync applications in ArgoCD
   2032 +  # 20. Verify all apps healthy
   2033 +  # 21. Update documentation
   2034 +  ```
   2035 +  
   2036 +  **This task requires coordination with**:
   2037 +  - SRE Infrastructure (Terraform)
   2038 +  - SRE Salt (minion keys)
   2039 +  - SRE Observability (that's you!)
   2040 +  
   2041 +  ---
   2042 +  
   2043 +  ## 11. Things to Be Careful About
   2044 +  
   2045 +  ### Critical Mistakes to Avoid
   2046 +  
   2047 +  #### 1. DON'T: Edit On-call Rotation Without Updating Start Time
   2048 +  
   2049 +  **Wrong**:
   2050 +  ```hcl
   2051 +  observability_oncall_primary = [
   2052 +    "person2",  # Reordered list
   2053 +    "person1",
   2054 +    "person3",
   2055 +  ]
   2056 +  pagerduty_schedule_starttime = "2025-11-04T11:00:00-05:00"  # OLD DATE
   2057 +  ```
   2058 +  
   2059 +  **Right**:
   2060 +  ```hcl
   2061 +  observability_oncall_primary = [
   2062 +    "person1",  # Current on-call FIRST
   2063 +    "person2",
   2064 +    "person3",
   2065 +  ]
   2066 +  pagerduty_schedule_starttime = "2025-11-18T11:00:00-05:00"  # Current Monday
   2067 +  ```
   2068 +  
   2069 +  **Why**: PagerDuty rotation starts from the date and follows the list. Wrong date = wrong person 
        + on-call.
   2070 +  
   2071 +  #### 2. DON'T: Commit Secrets to Git
   2072 +  
   2073 +  **Wrong**:
   2074 +  ```bash
   2075 +  # In a file:
   2076 +  VAULT_TOKEN=hvs.1234567890abcdef
   2077 +  DATABASE_PASSWORD=supersecret123
   2078 +  ```
   2079 +  
   2080 +  **Right**:
   2081 +  ```bash
   2082 +  # Store in Vault:
   2083 +  vault kv put infra/prod/myapp/secrets password=supersecret123
   2084 +  
   2085 +  # Reference in code:
   2086 +  password = data.vault_generic_secret.myapp.data["password"]
   2087 +  ```
   2088 +  
   2089 +  **Prevention**: Pre-commit hooks help catch this, but always review your diffs!
   2090 +  
   2091 +  #### 3. DON'T: Force Push to Main/Master
   2092 +  
   2093 +  **Wrong**:
   2094 +  ```bash
   2095 +  git push --force origin main
   2096 +  ```
   2097 +  
   2098 +  **Why**: Overwrites history, breaks everyone's local copies, loses work.
   2099 +  
   2100 +  **If you need to fix a commit**: Create a new commit or revert.
   2101 +  
   2102 +  #### 4. DON'T: Skip Atlantis Plan Review
   2103 +  
   2104 +  **Wrong**:
   2105 +  ```
   2106 +  # PR is merged
   2107 +  # Comment: atlantis apply
   2108 +  # (without reading the plan)
   2109 +  ```
   2110 +  
   2111 +  **Right**:
   2112 +  ```
   2113 +  # PR is created
   2114 +  # Atlantis posts plan
   2115 +  # READ THE ENTIRE PLAN
   2116 +  # Verify:
   2117 +  #   - Resources to add/change/destroy
   2118 +  #   - No unexpected changes
   2119 +  #   - No secrets in output
   2120 +  # Then approve, merge, apply
   2121 +  ```
   2122 +  
   2123 +  **Why**: Terraform can destroy resources. Always verify plans.
   2124 +  
   2125 +  #### 5. DON'T: Amend Other People's Commits
   2126 +  
   2127 +  **Wrong**:
   2128 +  ```bash
   2129 +  git commit --amend
   2130 +  git push --force
   2131 +  # (on a commit authored by someone else)
   2132 +  ```
   2133 +  
   2134 +  **Why**: Rewrites history, attributes your changes to someone else.
   2135 +  
   2136 +  **Right**: Create a new commit.
   2137 +  
   2138 +  #### 6. DON'T: Forget to Remove Alert Suppressions
   2139 +  
   2140 +  **Problem**: DC maintenance done, but suppression still active = missed alerts
   2141 +  
   2142 +  **Prevention**:
   2143 +  - Set calendar reminder
   2144 +  - Add comment in PR: "Remove suppression after YYYY-MM-DD"
   2145 +  - Create follow-up ticket
   2146 +  
   2147 +  ### Important Gotchas
   2148 +  
   2149 +  #### Gotcha 1: Staging vs Production Branches
   2150 +  
   2151 +  **Staging**: Uses `main` branch of ops/palantir
   2152 +  **Production**: Uses release tags (e.g., `v2.1.0`)
   2153 +  
   2154 +  **Implication**: Changes appear in staging immediately, production only after release.
   2155 +  
   2156 +  **Process**:
   2157 +  1. Merge to main → staging deployed
   2158 +  2. Test in staging
   2159 +  3. Create release tag
   2160 +  4. Production deployed
   2161 +  
   2162 +  #### Gotcha 2: VictoriaMetrics Upgrade Order
   2163 +  
   2164 +  **Wrong Order**: Global Select first, then LTS clusters
   2165 +  
   2166 +  **Right Order**: LTS clusters first, Global Select last
   2167 +  
   2168 +  **Why**: Global Select queries LTS clusters. If versions are incompatible, queries break.
   2169 +  
   2170 +  #### Gotcha 3: Prometheus Sharding is Time-Consuming
   2171 +  
   2172 +  **Time Required**: 2-8 hours
   2173 +  
   2174 +  **Why**:
   2175 +  - Salt configuration changes
   2176 +  - Restarting Prometheus (data replay)
   2177 +  - Validation
   2178 +  
   2179 +  **Plan Ahead**: Don't start Friday afternoon!
   2180 +  
   2181 +  #### Gotcha 4: Cilium Restarts Required for Network Policies
   2182 +  
   2183 +  **Problem**: New Cilium network policy not enforcing
   2184 +  
   2185 +  **Solution**: Restart Cilium pods
   2186 +  ```bash
   2187 +  kubectl -n kube-system delete pod -l k8s-app=cilium
   2188 +  ```
   2189 +  
   2190 +  **Why**: Some policy changes require pod restart to take effect.
   2191 +  
   2192 +  #### Gotcha 5: Management IPs Can Fail Silently
   2193 +  
   2194 +  **Problem**: Linode provisioned but eth1 (management IP) not configured
   2195 +  
   2196 +  **Check**:
   2197 +  ```bash
   2198 +  ssh server
   2199 +  ip addr show eth1
   2200 +  # Should show IP address
   2201 +  ```
   2202 +  
   2203 +  **Fix**: Re-run Ansible or manually configure.
   2204 +  
   2205 +  ### When to Ask for Help
   2206 +  
   2207 +  **Ask immediately if**:
   2208 +  - Critical alert and you don't know how to fix
   2209 +  - About to run a destructive command
   2210 +  - Unsure about Terraform plan
   2211 +  
   2212 +  **Ask within 30 minutes if**:
   2213 +  - Warning alert and runbook doesn't help
   2214 +  - Stuck on troubleshooting
   2215 +  
   2216 +  **Ask in next business day if**:
   2217 +  - Documentation unclear
   2218 +  - Process question
   2219 +  
   2220 +  **Where to Ask**:
   2221 +  - **#o11y-core** (team private channel) - Team questions
   2222 +  - **#sre-observability** (team public channel) - General questions
   2223 +  - **PagerDuty** (escalate alert) - Can't resolve critical alert
   2224 +  - **Tag @team-sre-o11y** - Need team response
   2225 +  
   2226 +  **It's better to ask than guess!**
   2227 +  
   2228 +  ---
   2229 +  
   2230 +  ## 12. Getting Started Checklist
   2231 +  
   2232 +  ### Week 1: Setup & Access
   2233 +  
   2234 +  - [ ] **Access to Systems**
   2235 +    - [ ] Get added to `team.auto.tfvars` by manager/team lead
   2236 +    - [ ] Verify PagerDuty account created
   2237 +    - [ ] Verify GitHub org access (linode-obs)
   2238 +    - [ ] Verify Bits team access (ops/sre-o11y)
   2239 +    - [ ] Grafana access (https://grafana.linode.com)
   2240 +    - [ ] ArgoCD access (prod, staging, dev)
   2241 +  
   2242 +  - [ ] **Local Development Setup**
   2243 +    - [ ] Install Homebrew (macOS)
   2244 +    - [ ] Install asdf or mise: `brew install asdf` or `brew install mise`
   2245 +    - [ ] Install direnv: `brew install direnv`
   2246 +    - [ ] Add to shell config: `eval "$(direnv hook zsh)"`
   2247 +    - [ ] Install linode-cli: `brew install linode-cli`
   2248 +    - [ ] Install kubectl: `brew install kubectl`
   2249 +    - [ ] Install terraform: `brew install terraform`
   2250 +    - [ ] Install pre-commit: `brew install pre-commit`
   2251 +    - [ ] Install Hugo: `brew install hugo`
   2252 +  
   2253 +  - [ ] **Repository Setup**
   2254 +    ```bash
   2255 +    mkdir ~/repos
   2256 +    cd ~/repos
   2257 +  
   2258 +    # Clone main repos
   2259 +    git clone bits.linode.com:ops/terraform-observability-team
   2260 +    git clone bits.linode.com:ops/o11y-helm-charts
   2261 +    git clone bits.linode.com:ops/palantir
   2262 +    git clone bits.linode.com:ops/prometheus_rules
   2263 +  
   2264 +    # Setup terraform-observability-team
   2265 +    cd terraform-observability-team
   2266 +    asdf install  # or mise install
   2267 +    direnv allow
   2268 +    pre-commit install --install-hooks
   2269 +  
   2270 +    # Test Terraform
   2271 +    vault login -method=oidc username=$USER
   2272 +    terraform plan
   2273 +    # Should succeed without errors
   2274 +    ```
   2275 +  
   2276 +  - [ ] **Slack Channels**
   2277 +    - [ ] Join #sre-observability (public)
   2278 +    - [ ] Get added to #o11y-core (private)
   2279 +    - [ ] Join #notification-o11y
   2280 +    - [ ] Join #notification-o11y-prs
   2281 +    - [ ] Join #notification-prometheus
   2282 +  
   2283 +  ### Week 2: Learning the Codebase
   2284 +  
   2285 +  - [ ] **Read Documentation**
   2286 +    - [ ] Repository README
   2287 +    - [ ] Handbook: On-call Guide
   2288 +    - [ ] Handbook: Tools
   2289 +    - [ ] Handbook: Git Conventions
   2290 +    - [ ] Browse Services documentation
   2291 +    - [ ] Read 2-3 recent proposals
   2292 +  
   2293 +  - [ ] **Explore Repositories**
   2294 +    - [ ] Browse terraform-observability-team structure
   2295 +    - [ ] Review team.auto.tfvars (understand team structure)
   2296 +    - [ ] Look at o11y-helm-charts (understand app configuration)
   2297 +    - [ ] Explore palantir (see Kubernetes manifests)
   2298 +  
   2299 +  - [ ] **Shadow Team Members**
   2300 +    - [ ] Shadow current on-call for a week
   2301 +    - [ ] Attend team meetings
   2302 +    - [ ] Watch someone do a PR review
   2303 +    - [ ] Watch someone deploy to staging
   2304 +  
   2305 +  ### Week 3: First Tasks
   2306 +  
   2307 +  - [ ] **Make First PR**
   2308 +    - [ ] Fix a typo in documentation
   2309 +    - [ ] Add yourself to a team meeting doc
   2310 +    - [ ] Practice PR → review → merge workflow
   2311 +  
   2312 +  - [ ] **Learn Key Services**
   2313 +    - [ ] Access Grafana, explore dashboards
   2314 +    - [ ] Access ArgoCD, browse applications
   2315 +    - [ ] Query Prometheus/VictoriaMetrics from Grafana
   2316 +    - [ ] Search logs in Loki
   2317 +  
   2318 +  - [ ] **Attend Post-Mortem** (if one occurs)
   2319 +    - [ ] Observe incident response
   2320 +    - [ ] Understand RCA process
   2321 +  
   2322 +  ### Week 4: Increasing Responsibility
   2323 +  
   2324 +  - [ ] **Handle First Alert**
   2325 +    - [ ] Acknowledge warning alert
   2326 +    - [ ] Follow runbook
   2327 +    - [ ] Document resolution
   2328 +  
   2329 +  - [ ] **Review First PR**
   2330 +    - [ ] Review PR in ops/prometheus_rules or similar
   2331 +    - [ ] Provide feedback
   2332 +    - [ ] Approve or request changes
   2333 +  
   2334 +  - [ ] **Complete First Intake Request**
   2335 +    - [ ] Grant Grafana access
   2336 +    - [ ] Or handle Nagios user request
   2337 +  
   2338 +  ### Month 2: On-call Training
   2339 +  
   2340 +  - [ ] **On-call Preparation**
   2341 +    - [ ] Read all alert runbooks
   2342 +    - [ ] Practice silencing alerts with amtool
   2343 +    - [ ] Review escalation procedures
   2344 +    - [ ] Shadow on-call for 2nd week
   2345 +  
   2346 +  - [ ] **Add to On-call Rotation**
   2347 +    - [ ] Team lead adds you to rotation
   2348 +    - [ ] Receive first on-call shift assignment
   2349 +  
   2350 +  - [ ] **First On-call Shift**
   2351 +    - [ ] Create on-call log
   2352 +    - [ ] Handle alerts (with backup support)
   2353 +    - [ ] Complete handoff
   2354 +  
   2355 +  ### Month 3: Full Team Member
   2356 +  
   2357 +  - [ ] **Lead First Project**
   2358 +    - [ ] Small improvement or automation
   2359 +    - [ ] Write proposal if needed
   2360 +    - [ ] Implement and deploy
   2361 +  
   2362 +  - [ ] **Contribute to Documentation**
   2363 +    - [ ] Update outdated docs
   2364 +    - [ ] Add new runbook
   2365 +    - [ ] Write MOP for procedure you learned
   2366 +  
   2367 +  - [ ] **Mentor Next New Hire**
   2368 +    - [ ] Share this guide
   2369 +    - [ ] Answer questions
   2370 +    - [ ] Pair on tasks
   2371 +  
   2372 +  ---
   2373 +  
   2374 +  ## 13. Glossary
   2375 +  
   2376 +  ### Terms & Acronyms
   2377 +  
   2378 +  **ArgoCD**: GitOps continuous deployment tool for Kubernetes
   2379 +  
   2380 +  **Atlantis**: Terraform automation tool that runs plans/applies on PRs
   2381 +  
   2382 +  **Bits**: Akamai's internal GitHub instance (bits.linode.com)
   2383 +  
   2384 +  **CCM**: Cloud Controller Manager - Kubernetes component managing cloud resources (NodeBalancers, 
        + firewalls)
   2385 +  
   2386 +  **Cilium**: Container Network Interface (CNI) providing network policies and encryption
   2387 +  
   2388 +  **DC**: Datacenter (e.g., ewr1 = Newark, iad3 = Ashburn)
   2389 +  
   2390 +  **direnv**: Tool to load environment variables when entering a directory
   2391 +  
   2392 +  **GitOps**: Deployment methodology using Git as source of truth
   2393 +  
   2394 +  **Grafana**: Visualization platform for metrics and logs
   2395 +  
   2396 +  **Hugo**: Static site generator used for team documentation
   2397 +  
   2398 +  **Kustomize**: Tool for customizing Kubernetes YAML files
   2399 +  
   2400 +  **Linode**: Akamai's cloud computing platform (VMs, Kubernetes, networking)
   2401 +  
   2402 +  **Loki**: Log aggregation system (like Prometheus but for logs)
   2403 +  
   2404 +  **LTS**: Long-Term Storage (VictoriaMetrics clusters storing metrics for 13 months)
   2405 +  
   2406 +  **Mise**: Task runner and tool version manager (like Make + asdf)
   2407 +  
   2408 +  **MOP**: Manual Operations Procedure - step-by-step guide for complex tasks
   2409 +  
   2410 +  **mTLS**: Mutual TLS - two-way certificate authentication
   2411 +  
   2412 +  **Nagios**: Legacy monitoring system (being replaced)
   2413 +  
   2414 +  **NIL**: Network Internet Listener - LoadBalancer service exposed externally
   2415 +  
   2416 +  **On-call**: Engineer responsible for responding to alerts during their shift
   2417 +  
   2418 +  **OpenTelemetry (OTel)**: Observability framework for metrics, logs, traces
   2419 +  
   2420 +  **otelgw**: OpenTelemetry Gateway - aggregates telemetry per datacenter
   2421 +  
   2422 +  **Palantir**: Repository containing Kubernetes manifests for ArgoCD
   2423 +  
   2424 +  **PagerDuty**: Incident management and on-call scheduling platform
   2425 +  
   2426 +  **Preflight**: Checks run before an operation (e.g., Cilium upgrade)
   2427 +  
   2428 +  **Prometheus**: Time-series database for metrics collection
   2429 +  
   2430 +  **PromQL**: Prometheus Query Language
   2431 +  
   2432 +  **Runbook**: Documentation for responding to specific alerts
   2433 +  
   2434 +  **Salt**: Configuration management tool (like Ansible/Puppet)
   2435 +  
   2436 +  **Shard**: One instance in a group of divided workload (e.g., Prometheus shards)
   2437 +  
   2438 +  **SRE**: Site Reliability Engineering
   2439 +  
   2440 +  **Terraform**: Infrastructure-as-Code tool
   2441 +  
   2442 +  **Thanos**: Prometheus long-term storage and query federation
   2443 +  
   2444 +  **Vault**: Secret management platform
   2445 +  
   2446 +  **VictoriaMetrics**: Time-series database optimized for Prometheus metrics
   2447 +  
   2448 +  **vminsert**: VictoriaMetrics component for data ingestion
   2449 +  
   2450 +  **vmselect**: VictoriaMetrics component for queries
   2451 +  
   2452 +  **vmstorage**: VictoriaMetrics component for data storage
   2453 +  
   2454 +  ---
   2455 +  
   2456 +  ## Additional Resources
   2457 +  
   2458 +  ### Official Documentation
   2459 +  
   2460 +  - **Team Docs**: https://bits.linode.com/pages/ops/terraform-observability-team/
   2461 +  - **Confluence**: Search "SRE Observability" for cross-team docs
   2462 +  - **Jira**: https://track.akamai.com/jira/projects/OY
   2463 +  
   2464 +  ### External Tools Documentation
   2465 +  
   2466 +  - **Prometheus**: https://prometheus.io/docs/
   2467 +  - **VictoriaMetrics**: https://docs.victoriametrics.com/
   2468 +  - **Grafana**: https://grafana.com/docs/
   2469 +  - **ArgoCD**: https://argo-cd.readthedocs.io/
   2470 +  - **Kubernetes**: https://kubernetes.io/docs/
   2471 +  - **Terraform**: https://www.terraform.io/docs/
   2472 +  - **Loki**: https://grafana.com/docs/loki/
   2473 +  
   2474 +  ### Internal Systems
   2475 +  
   2476 +  - **Grafana Production**: https://grafana.linode.com
   2477 +  - **ArgoCD Production**: https://argocd.infra-o11y-apps.iad3.us.prod.linode.com
   2478 +  - **ArgoCD Staging**: https://argocd.infra-o11y-apps.rin1.us.staging.linode.com
   2479 +  - **PagerDuty Schedule**: https://akamai.pagerduty.com/schedules/PSFD91L
   2480 +  - **Vault**: https://vault.infra.linode.com
   2481 +  
   2482 +  ### Slack Channels
   2483 +  
   2484 +  - **#sre-observability**: Team public channel
   2485 +  - **#o11y-core**: Team private channel
   2486 +  - **#notification-o11y**: Alert notifications
   2487 +  - **#notification-o11y-prs**: PR notifications
   2488 +  - **#notification-prometheus**: Prometheus-specific alerts
   2489 +  - **#sre-salt**: Salt team (for minion keys, etc.)
   2490 +  - **#sre-infrastructure**: Infrastructure team
   2491 +  
   2492 +  ### People to Know
   2493 +  
   2494 +  Check `team.auto.tfvars` for current team roster.
   2495 +  
   2496 +  **Team Lead/Manager**: Ask in #o11y-core
   2497 +  
   2498 +  **On-call**: `@firechief-sre-o11y` in Slack
   2499 +  
   2500 +  ---
   2501 +  
   2502 +  ## Final Thoughts
   2503 +  
   2504 +  Welcome to the SRE Observability team! This guide covers a lot, and **you're not expected to 
        + memorize everything immediately**. Use it as a reference as you work through your first few weeks.
   2505 +  
   2506 +  **Key Takeaways**:
   2507 +  
   2508 +  1. **terraform-observability-team** manages team permissions and documentation
   2509 +  2. **VictoriaMetrics, Prometheus, Grafana, ArgoCD** are our core services
   2510 +  3. **GitOps workflow**: Code → PR → Review → Merge → Deploy
   2511 +  4. **On-call is a rotation**, and you'll be trained before your first shift
   2512 +  5. **Ask questions** - the team is here to help!
   2513 +  
   2514 +  **Learning Path**:
   2515 +  - Week 1: Setup and access
   2516 +  - Week 2: Read and explore
   2517 +  - Week 3: First small tasks
   2518 +  - Week 4: First alerts and reviews
   2519 +  - Month 2: On-call training
   2520 +  - Month 3: Full contributor
   2521 +  
   2522 +  **Remember**: Everyone on the team started where you are now. It takes time to learn all the 
        + systems, and that's okay. Focus on understanding the fundamentals first, and the details will come 
        + with experience.
   2523 +  
   2524 +  **Questions?** Ask in #o11y-core!
   2525 +  
   2526 +  **Good luck, and welcome to the team!**
   2527 +  
   2528 +  ---
   2529 +  
   2530 +  **Document Version**: 1.0
   2531 +  **Created**: 2025-11-20
   2532 +  **Last Updated**: 2025-11-20
   2533 +  **Maintained By**: SRE Observability Team

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
DRIFT_DETECTION_COMPLETE_GUIDE.md		DRIFT_DETECTION_COMPLETE_GUIDE.md
DRIFT_DETECTION_PRODUCTION_CHANGES.md		DRIFT_DETECTION_PRODUCTION_CHANGES.md
OY-2079-vmstorage-disk-scaling-strategy.md		OY-2079-vmstorage-disk-scaling-strategy.md
OpenTelemetry-Implementation-Guide.md		OpenTelemetry-Implementation-Guide.md
README.md		README.md
Teleport-Security-Access-Control-Guide.md		Teleport-Security-Access-Control-Guide.md
VictoriaMetrics-Complete-Guide.md		VictoriaMetrics-Complete-Guide.md
o11y teleport		o11y teleport
o11y-k8s-security-controls.md		o11y-k8s-security-controls.md
o11y-k8s-security-controls_updated.md		o11y-k8s-security-controls_updated.md
observability-infrastructure.md		observability-infrastructure.md
terraform-observability-team-guide.md		terraform-observability-team-guide.md
victoriametricsguide		victoriametricsguide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages