Skip to content

Monitoring and Health

Eric Fitzgerald edited this page Apr 8, 2026 · 3 revisions

Monitoring and Health

This guide covers health checks, metrics collection, log aggregation, alerting, and performance and security monitoring for TMI deployments.

Overview

Effective monitoring is critical for maintaining TMI's availability, performance, and security. This guide provides practical procedures for:

  • Health checks and availability monitoring
  • Metrics collection and visualization
  • Log aggregation and analysis
  • Alerting configuration
  • Performance monitoring
  • Security event monitoring

Quick Health Checks

Server Health

# Basic health check (the root endpoint returns API info with health status)
curl https://tmi.example.com/

# Expected response structure:
{
  "status": {
    "code": "ok",              # "ok", "degraded", or "error"
    "time": "2025-01-24T..."
  },
  "service": {
    "name": "TMI",
    "build": "1.3.2-abc1234"   # format: version[-prerelease][+commit]
  },
  "api": {
    "version": "1.4.0",        # from OpenAPI spec, follows semver
    "specification": "https://github.com/ericfitz/tmi/blob/main/api-schema/tmi-openapi.json"
  },
  "operator": {                # optional, present only if configured
    "name": "Acme Corp",
    "contact": "ops@acme.com"
  }
}

# When status is "degraded", the response includes health details:
# "health": {
#   "database": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 3, "message": "..." },
#   "redis":    { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 1, "message": "..." }
# }

# Check OAuth providers
curl https://tmi.example.com/oauth2/providers

Database Health

# PostgreSQL connection test
psql -h postgres-host -U tmi_user -d tmi -c "SELECT 1"

# Check database size
psql -h postgres-host -U tmi_user -d tmi -c "
  SELECT pg_size_pretty(pg_database_size('tmi'))"

# Check table row counts
psql -h postgres-host -U tmi_user -d tmi -c "
  SELECT schemaname, tablename, n_live_tup
  FROM pg_stat_user_tables
  ORDER BY n_live_tup DESC"

Redis Health

# Connection test
redis-cli -h redis-host -p 6379 -a password ping
# Expected: PONG

# Check memory usage
redis-cli -h redis-host -a password info memory | grep used_memory_human

# Check key count
redis-cli -h redis-host -a password DBSIZE

# Check cache hit rate
redis-cli -h redis-host -a password info stats | grep keyspace_hits

Monitoring Architecture

Observability Stack

[TMI Application] --> [Metrics Collection] --> [Time Series DB]
                 --> [Log Aggregation]     --> [Log Storage]
                 --> [Health Checks]       --> [Alerting System]

Key Components

Component Purpose Recommended Tool
Metrics Collection Application and system metrics Prometheus
Log Aggregation Centralized logging Grafana Alloy/Loki (recommended), ELK Stack (alternative)
Health Monitoring Service availability and performance Built-in health endpoint
Alerting Proactive issue notification Prometheus AlertManager
Dashboards Visualization Grafana

Note: Promtail reached End-of-Life on March 2, 2026. Use Grafana Alloy for new deployments. See TMI-Promtail-Logger for the legacy Promtail setup reference.

Metrics Collection

Application Metrics

TMI tracks performance metrics internally through api/performance_monitor.go.

HTTP Metrics

Key metrics to track:

  • Request rates (requests per second)
  • Response times (P50, P95, P99 percentiles)
  • Error rates (4xx and 5xx responses)
  • Request and response sizes
  • Concurrent request count

WebSocket Metrics

Track real-time collaboration health:

  • Active WebSocket connections
  • Connection establishment rate
  • Message throughput (messages per second)
  • Connection duration
  • WebSocket errors and disconnections

For details on the WebSocket protocol, see WebSocket-API-Reference.

Business Metrics

Monitor feature usage:

  • User activity (daily and monthly active users)
  • Threat model creation rate
  • Diagram creation and editing activity
  • Collaboration session counts
  • API client integration health

Database Metrics

PostgreSQL Monitoring

-- Active connections
SELECT count(*) FROM pg_stat_activity;

-- Long-running queries (over 5 minutes)
SELECT
  pid,
  now() - query_start AS duration,
  query,
  state
FROM pg_stat_activity
WHERE (now() - query_start) > interval '5 minutes';

-- Table sizes
SELECT
  schemaname,
  tablename,
  pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

-- Index usage statistics
SELECT
  schemaname,
  tablename,
  indexname,
  idx_scan,
  idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

-- Database size and growth
SELECT
  pg_database.datname,
  pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;

For more on database administration, see Database-Operations.

Redis Monitoring

# Memory usage and stats
redis-cli -h redis-host -a password info memory

# Key distribution by pattern
redis-cli -h redis-host -a password --scan --pattern "cache:*" | wc -l
redis-cli -h redis-host -a password --scan --pattern "session:*" | wc -l

# Cache hit rate
redis-cli -h redis-host -a password info stats | grep -E "keyspace_hits|keyspace_misses"

# Slow queries
redis-cli -h redis-host -a password slowlog get 10

# Client connections
redis-cli -h redis-host -a password client list

System Metrics

Monitor the following infrastructure resources:

Metric What to Watch
CPU Utilization Overall and per-core usage
Memory Usage Application memory, available memory, swap usage
Disk I/O Read/write operations, disk latency
Network Bandwidth utilization, connection counts
File Descriptors Open file descriptor count

Log Aggregation

Structured Logging

TMI uses structured JSON logging through the internal/slogging package.

Log Configuration

# Configuration options from internal/config/config.go (LoggingConfig struct)
logging:
  level: "info"                          # debug, info, warn, error
  is_dev: true                           # Development mode flag (default: true)
  is_test: false                         # Test mode flag (default: false)
  log_dir: "logs"                        # Default: "logs"
  max_age_days: 7                        # Log retention (default: 7)
  max_size_mb: 100                       # Max file size (default: 100)
  max_backups: 10                        # Number of rotated files (default: 10)
  also_log_to_console: true              # Dual logging (default: true)
  log_api_requests: false                # Request logging
  log_api_responses: false               # Response logging
  log_websocket_messages: false          # WebSocket message logging
  redact_auth_tokens: false              # Security redaction of auth tokens
  suppress_unauthenticated_logs: true    # Suppress logs for unauthenticated requests (default: true)

For the complete set of configuration options, see Configuration-Reference.

Log Categories

Category Description
Application Logs Business logic events
Access Logs HTTP request and response records
Security Logs Authentication and authorization events
Error Logs Exceptions and error conditions
Performance Logs Request timing and resource usage

Promtail Log Collection (Legacy)

TMI documentation includes a containerized Promtail setup for shipping logs to Grafana Cloud or Loki.

Important: Promtail reached End-of-Life on March 2, 2026. Grafana Alloy is the recommended replacement for new deployments. The Promtail container setup documentation is retained for reference at docs/migrated/developer/setup/promtail-container.md. See also TMI-Promtail-Logger.

Starting Promtail

The Promtail Make targets (build-promtail, start-promtail) are no longer included in the project Makefile. To run Promtail manually with Docker:

# Build and run the Promtail container manually
# (see docs/migrated/developer/setup/promtail-container.md)
# Or run with explicit credentials:
LOKI_URL="https://user:pass@logs.grafana.net/api/prom/push" docker run ...

# Check Promtail status
docker logs promtail

Promtail Configuration

Promtail monitors these log locations:

  • Development: ./logs/tmi.log, ./logs/server.log
  • Production: /var/log/tmi/tmi.log

Configuration details are documented in docs/migrated/developer/setup/promtail-container.md.

Verifying Log Collection

# Confirm Promtail is collecting logs
docker logs promtail 2>&1 | grep "Adding target"

# Expected output:
# level=info msg="Adding target" key="/logs/tmi.log:..."
# level=info msg="Adding target" key="/var/log/tmi/tmi.log:..."

# Check for errors
docker logs promtail 2>&1 | grep -i error

ELK Stack Integration

You can use Elasticsearch, Logstash, and Kibana as an alternative log aggregation stack.

Logstash Configuration

# logstash.conf
input {
  file {
    path => "/var/log/tmi/*.log"
    start_position => "beginning"
    codec => json
  }
}

filter {
  if [level] == "error" {
    mutate {
      add_tag => ["error"]
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "tmi-logs-%{+YYYY.MM.dd}"
  }
}

Querying Logs in Kibana

Common queries:

# All errors
level: "error"

# Authentication failures
level: "error" AND message: "authentication"

# Slow requests (over 2 seconds)
duration_ms: >2000

# Specific user activity
user_email: "user@example.com"

# WebSocket events
message: "websocket"

Health Checks

Service Health Endpoints

API Health Check

The root endpoint (/) provides comprehensive health information:

# The root endpoint returns API info with health status
curl https://tmi.example.com/

# Response structure:
{
  "status": {
    "code": "ok",              # "ok" when healthy, "degraded" when issues detected, "error" on critical failure
    "time": "2025-01-24T12:00:00Z"
  },
  "service": {
    "name": "TMI",
    "build": "1.3.2-abc1234"   # format: version[-prerelease][+commit]
  },
  "api": {
    "version": "1.4.0",        # from OpenAPI spec, follows semver
    "specification": "https://github.com/ericfitz/tmi/blob/main/api-schema/tmi-openapi.json"
  },
  "operator": {                # optional, present only if configured by operator
    "name": "...",
    "contact": "..."
  }
}

# When "degraded", health details are included:
# "health": {
#   "database": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 3, "message": "..." },
#   "redis":    { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 1, "message": "..." }
# }

# OAuth provider check
curl https://tmi.example.com/oauth2/providers
# Response lists the enabled providers

Automated Health Monitoring

Create a health check script:

#!/bin/bash
# health-check.sh

HEALTH_URL="https://tmi.example.com/"
LOG_FILE="/var/log/tmi/health-check.log"

RESPONSE=$(curl -s "$HEALTH_URL")
STATUS=$(echo "$RESPONSE" | jq -r '.status.code')

if [ "$STATUS" = "ok" ]; then
    echo "$(date): TMI server is healthy" >> "$LOG_FILE"
    exit 0
else
    echo "$(date): TMI server status: $STATUS" >> "$LOG_FILE"
    # Send alert
    curl -X POST https://alerts.example.com/webhook \
      -d '{"service": "tmi", "status": "'"$STATUS"'"}'
    exit 1
fi

Schedule the script with cron:

# Check every 5 minutes
*/5 * * * * /usr/local/bin/health-check.sh

Kubernetes Probes

If you are running TMI on Kubernetes, configure liveness and readiness probes:

livenessProbe:
  httpGet:
    path: /
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
  failureThreshold: 2

For container deployment details, see OCI-Container-Deployment.

Alerting Configuration

Alert Categories

Critical Alerts (Immediate Response)

Alert Trigger Condition
Service Down TMI server unavailable
Database Failure PostgreSQL connection failures
Authentication Outage OAuth provider failures
High Error Rate >5% error rate sustained for 5+ minutes
Resource Exhaustion >90% CPU or memory usage

Warning Alerts (Monitored Response)

Alert Trigger Condition
Performance Degradation Response times >2x baseline
Cache Issues Redis connection problems or high miss rate
Storage Issues Disk usage >80%
Backup Failures Database backup failures
Integration Issues Client integration problems

Info Alerts (Awareness Only)

  • Capacity planning: resource usage trends
  • Performance trends: gradual performance changes
  • Usage patterns: user activity changes
  • Security events: unusual authentication patterns
  • Maintenance reminders: certificate renewal, updates

Alert Examples

Service Unavailable

# Prometheus AlertManager example
- alert: TMIServiceDown
  expr: up{job="tmi-server"} == 0
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "TMI service is down"
    description: "TMI server has been down for more than 2 minutes"

High Error Rate

- alert: HighErrorRate
  expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
  for: 5m
  labels:
    severity: critical
  annotations:
    summary: "High error rate detected"
    description: "Error rate is {{ $value | humanizePercentage }}"

Database Connection Failure

- alert: DatabaseConnectionFailure
  expr: postgresql_up{job="postgres"} == 0
  for: 1m
  labels:
    severity: critical
  annotations:
    summary: "Database connection failure"
    description: "Cannot connect to PostgreSQL database"

Notification Channels

Configure multiple notification channels for comprehensive coverage:

Channel Use Case
Email Critical alerts
Slack / Teams Team notifications
PagerDuty / OpsGenie On-call escalation
Webhooks Custom integrations (see Webhook-Integration)

Performance Monitoring

Application Performance

Response Time Monitoring

Track response time percentiles against the following targets:

Percentile Target
P50 (Median) <100ms
P95 <500ms
P99 <1000ms
P99.9 Track for outlier detection

Throughput Monitoring

Monitor requests per second:

  • Baseline throughput under normal load
  • Peak throughput capacity
  • Sustained throughput over time

For additional performance tuning guidance, see Performance-and-Scaling.

Resource Usage

Track application resource consumption:

# For a systemd service
systemctl status tmi

# For a containerized deployment
docker stats tmi-server

# For Kubernetes
kubectl top pod -n tmi

Database Performance

Query Performance Analysis

-- Enable the pg_stat_statements extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;

-- Find slow queries (over 100ms average)
SELECT
  query,
  mean_time,
  calls,
  total_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 20;

-- Query performance by table
SELECT
  schemaname,
  tablename,
  seq_scan,
  idx_scan,
  n_tup_ins,
  n_tup_upd,
  n_tup_del
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;

Connection Pool Monitoring

Monitor database connection usage:

-- Active connections by state
SELECT
  state,
  count(*)
FROM pg_stat_activity
GROUP BY state;

-- Long-running transactions (over 1 minute)
SELECT
  pid,
  now() - xact_start AS duration,
  state,
  query
FROM pg_stat_activity
WHERE xact_start < now() - interval '1 minute'
ORDER BY duration DESC;

Cache Performance

Redis Performance Metrics

# Cache hit rate calculation
redis-cli -h redis-host -a password info stats | \
  awk '/keyspace_hits|keyspace_misses/ {
    split($0,a,":");
    if ($1 ~ /hits/) hits=a[2];
    if ($1 ~ /misses/) misses=a[2]
  }
  END {
    total=hits+misses;
    rate=(hits/total)*100;
    printf "Hit Rate: %.2f%%\n", rate
  }'

# Monitor cache latency
redis-cli -h redis-host -a password --latency-history

# Check slow commands
redis-cli -h redis-host -a password slowlog get 10

Security Monitoring

Authentication Events

Monitor authentication and authorization activity:

# View authentication logs
tail -f /var/log/tmi/tmi.log | grep -E "authentication|authorization"

# Count failed login attempts
grep "authentication failed" /var/log/tmi/tmi.log | wc -l

# Identify suspicious activity (group failures by source)
grep "authentication failed" /var/log/tmi/tmi.log | \
  awk '{print $NF}' | sort | uniq -c | sort -rn

For complete security monitoring procedures, see Security-Operations.

Security Alerts

Set up alerts for the following security events:

  • Failed authentication attempts (more than 5 in 5 minutes)
  • Unauthorized access attempts
  • Suspicious API usage patterns
  • Certificate expiration warnings
  • Unusual data access patterns

See also Security-Best-Practices for hardening recommendations.

Dashboards

Grafana Dashboard Examples

System Overview Dashboard

Include the following panels:

  • Service uptime percentage
  • Request rate (requests per second)
  • Response time percentiles
  • Error rate percentage
  • Active users
  • Database connection count
  • Redis memory usage
  • CPU and memory utilization

Database Dashboard

Include the following panels:

  • Connection count over time
  • Query performance metrics
  • Table sizes
  • Index usage
  • Replication lag (if applicable)
  • Database size growth

Application Dashboard

Include the following panels:

  • HTTP request rate by endpoint
  • WebSocket connection count
  • User activity (threat models and diagrams created)
  • API error rates by endpoint
  • OAuth authentication success rate

Troubleshooting Monitoring Issues

Metrics Not Appearing

Symptom: Metrics are not showing in your monitoring system.

Note: TMI does not expose a /metrics endpoint natively. You need to configure external metrics collection.

Steps to check:

# Verify that the TMI root endpoint is responding
curl http://localhost:8080/

# Check Prometheus scrape configuration and targets
curl http://prometheus:9090/api/v1/targets

Log Collection Failing

Symptom: Logs are not appearing in your log aggregation system.

For Promtail:

# Check the Promtail container status
docker logs promtail

# Verify that log files exist and are readable
ls -la /var/log/tmi/

# Check the Promtail configuration
docker exec promtail cat /tmp/promtail-config.yaml

For ELK:

# Check Logstash status
systemctl status logstash

# Test Elasticsearch connectivity
curl http://elasticsearch:9200/_cluster/health

# Check the Logstash pipeline
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'

Alerts Not Firing

Steps to check:

# Verify AlertManager configuration
curl http://alertmanager:9093/api/v2/status

# Check alert rules
curl http://prometheus:9090/api/v1/rules

# Test notification channels by sending a test alert through your webhook or email

Best Practices

Monitoring Checklist

  • Health checks configured and running
  • Metrics collection enabled
  • Log aggregation configured
  • Critical alerts defined and tested
  • Dashboards created and shared with the team
  • Alert notification channels tested
  • Runbooks created for common issues
  • On-call rotation established
  • Regular review of monitoring data scheduled
  • Capacity planning based on trends

Retention Policies

Configure appropriate retention periods:

Data Type Retention
Metrics 30-90 days (high-resolution), 1 year (aggregated)
Logs 30-90 days (compliance dependent)
Alerts 90 days of alert history
Dashboards Version-controlled in Git

Security Considerations

  • Protect monitoring endpoints with authentication
  • Encrypt metrics and log data in transit
  • Sanitize logs to remove sensitive data (see the redact_auth_tokens configuration option)
  • Restrict access to monitoring dashboards
  • Audit monitoring system access

Related Documentation

Additional Resources

Clone this wiki locally