-
Notifications
You must be signed in to change notification settings - Fork 2
Monitoring and Health
This guide covers health checks, metrics collection, log aggregation, alerting, and performance and security monitoring for TMI deployments.
Effective monitoring is critical for maintaining TMI's availability, performance, and security. This guide provides practical procedures for:
- Health checks and availability monitoring
- Metrics collection and visualization
- Log aggregation and analysis
- Alerting configuration
- Performance monitoring
- Security event monitoring
# Basic health check (the root endpoint returns API info with health status)
curl https://tmi.example.com/
# Expected response structure:
{
"status": {
"code": "ok", # "ok", "degraded", or "error"
"time": "2025-01-24T..."
},
"service": {
"name": "TMI",
"build": "1.3.2-abc1234" # format: version[-prerelease][+commit]
},
"api": {
"version": "1.4.0", # from OpenAPI spec, follows semver
"specification": "https://github.com/ericfitz/tmi/blob/main/api-schema/tmi-openapi.json"
},
"operator": { # optional, present only if configured
"name": "Acme Corp",
"contact": "ops@acme.com"
}
}
# When status is "degraded", the response includes health details:
# "health": {
# "database": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 3, "message": "..." },
# "redis": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 1, "message": "..." }
# }
# Check OAuth providers
curl https://tmi.example.com/oauth2/providers# PostgreSQL connection test
psql -h postgres-host -U tmi_user -d tmi -c "SELECT 1"
# Check database size
psql -h postgres-host -U tmi_user -d tmi -c "
SELECT pg_size_pretty(pg_database_size('tmi'))"
# Check table row counts
psql -h postgres-host -U tmi_user -d tmi -c "
SELECT schemaname, tablename, n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC"# Connection test
redis-cli -h redis-host -p 6379 -a password ping
# Expected: PONG
# Check memory usage
redis-cli -h redis-host -a password info memory | grep used_memory_human
# Check key count
redis-cli -h redis-host -a password DBSIZE
# Check cache hit rate
redis-cli -h redis-host -a password info stats | grep keyspace_hits[TMI Application] --> [Metrics Collection] --> [Time Series DB]
--> [Log Aggregation] --> [Log Storage]
--> [Health Checks] --> [Alerting System]
| Component | Purpose | Recommended Tool |
|---|---|---|
| Metrics Collection | Application and system metrics | Prometheus |
| Log Aggregation | Centralized logging | Grafana Alloy/Loki (recommended), ELK Stack (alternative) |
| Health Monitoring | Service availability and performance | Built-in health endpoint |
| Alerting | Proactive issue notification | Prometheus AlertManager |
| Dashboards | Visualization | Grafana |
Note: Promtail reached End-of-Life on March 2, 2026. Use Grafana Alloy for new deployments. See TMI-Promtail-Logger for the legacy Promtail setup reference.
TMI tracks performance metrics internally through api/performance_monitor.go.
Key metrics to track:
- Request rates (requests per second)
- Response times (P50, P95, P99 percentiles)
- Error rates (4xx and 5xx responses)
- Request and response sizes
- Concurrent request count
Track real-time collaboration health:
- Active WebSocket connections
- Connection establishment rate
- Message throughput (messages per second)
- Connection duration
- WebSocket errors and disconnections
For details on the WebSocket protocol, see WebSocket-API-Reference.
Monitor feature usage:
- User activity (daily and monthly active users)
- Threat model creation rate
- Diagram creation and editing activity
- Collaboration session counts
- API client integration health
-- Active connections
SELECT count(*) FROM pg_stat_activity;
-- Long-running queries (over 5 minutes)
SELECT
pid,
now() - query_start AS duration,
query,
state
FROM pg_stat_activity
WHERE (now() - query_start) > interval '5 minutes';
-- Table sizes
SELECT
schemaname,
tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS size
FROM pg_tables
WHERE schemaname = 'public'
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
-- Index usage statistics
SELECT
schemaname,
tablename,
indexname,
idx_scan,
idx_tup_read
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;
-- Database size and growth
SELECT
pg_database.datname,
pg_size_pretty(pg_database_size(pg_database.datname)) AS size
FROM pg_database
ORDER BY pg_database_size(pg_database.datname) DESC;For more on database administration, see Database-Operations.
# Memory usage and stats
redis-cli -h redis-host -a password info memory
# Key distribution by pattern
redis-cli -h redis-host -a password --scan --pattern "cache:*" | wc -l
redis-cli -h redis-host -a password --scan --pattern "session:*" | wc -l
# Cache hit rate
redis-cli -h redis-host -a password info stats | grep -E "keyspace_hits|keyspace_misses"
# Slow queries
redis-cli -h redis-host -a password slowlog get 10
# Client connections
redis-cli -h redis-host -a password client listMonitor the following infrastructure resources:
| Metric | What to Watch |
|---|---|
| CPU Utilization | Overall and per-core usage |
| Memory Usage | Application memory, available memory, swap usage |
| Disk I/O | Read/write operations, disk latency |
| Network | Bandwidth utilization, connection counts |
| File Descriptors | Open file descriptor count |
TMI uses structured JSON logging through the internal/slogging package.
# Configuration options from internal/config/config.go (LoggingConfig struct)
logging:
level: "info" # debug, info, warn, error
is_dev: true # Development mode flag (default: true)
is_test: false # Test mode flag (default: false)
log_dir: "logs" # Default: "logs"
max_age_days: 7 # Log retention (default: 7)
max_size_mb: 100 # Max file size (default: 100)
max_backups: 10 # Number of rotated files (default: 10)
also_log_to_console: true # Dual logging (default: true)
log_api_requests: false # Request logging
log_api_responses: false # Response logging
log_websocket_messages: false # WebSocket message logging
redact_auth_tokens: false # Security redaction of auth tokens
suppress_unauthenticated_logs: true # Suppress logs for unauthenticated requests (default: true)For the complete set of configuration options, see Configuration-Reference.
| Category | Description |
|---|---|
| Application Logs | Business logic events |
| Access Logs | HTTP request and response records |
| Security Logs | Authentication and authorization events |
| Error Logs | Exceptions and error conditions |
| Performance Logs | Request timing and resource usage |
TMI documentation includes a containerized Promtail setup for shipping logs to Grafana Cloud or Loki.
Important: Promtail reached End-of-Life on March 2, 2026. Grafana Alloy is the recommended replacement for new deployments. The Promtail container setup documentation is retained for reference at
docs/migrated/developer/setup/promtail-container.md. See also TMI-Promtail-Logger.
The Promtail Make targets (build-promtail, start-promtail) are no longer included in the project Makefile. To run Promtail manually with Docker:
# Build and run the Promtail container manually
# (see docs/migrated/developer/setup/promtail-container.md)
# Or run with explicit credentials:
LOKI_URL="https://user:pass@logs.grafana.net/api/prom/push" docker run ...
# Check Promtail status
docker logs promtailPromtail monitors these log locations:
-
Development:
./logs/tmi.log,./logs/server.log -
Production:
/var/log/tmi/tmi.log
Configuration details are documented in docs/migrated/developer/setup/promtail-container.md.
# Confirm Promtail is collecting logs
docker logs promtail 2>&1 | grep "Adding target"
# Expected output:
# level=info msg="Adding target" key="/logs/tmi.log:..."
# level=info msg="Adding target" key="/var/log/tmi/tmi.log:..."
# Check for errors
docker logs promtail 2>&1 | grep -i errorYou can use Elasticsearch, Logstash, and Kibana as an alternative log aggregation stack.
# logstash.conf
input {
file {
path => "/var/log/tmi/*.log"
start_position => "beginning"
codec => json
}
}
filter {
if [level] == "error" {
mutate {
add_tag => ["error"]
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "tmi-logs-%{+YYYY.MM.dd}"
}
}Common queries:
# All errors
level: "error"
# Authentication failures
level: "error" AND message: "authentication"
# Slow requests (over 2 seconds)
duration_ms: >2000
# Specific user activity
user_email: "user@example.com"
# WebSocket events
message: "websocket"
The root endpoint (/) provides comprehensive health information:
# The root endpoint returns API info with health status
curl https://tmi.example.com/
# Response structure:
{
"status": {
"code": "ok", # "ok" when healthy, "degraded" when issues detected, "error" on critical failure
"time": "2025-01-24T12:00:00Z"
},
"service": {
"name": "TMI",
"build": "1.3.2-abc1234" # format: version[-prerelease][+commit]
},
"api": {
"version": "1.4.0", # from OpenAPI spec, follows semver
"specification": "https://github.com/ericfitz/tmi/blob/main/api-schema/tmi-openapi.json"
},
"operator": { # optional, present only if configured by operator
"name": "...",
"contact": "..."
}
}
# When "degraded", health details are included:
# "health": {
# "database": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 3, "message": "..." },
# "redis": { "status": "healthy"|"unhealthy"|"unknown", "latency_ms": 1, "message": "..." }
# }
# OAuth provider check
curl https://tmi.example.com/oauth2/providers
# Response lists the enabled providersCreate a health check script:
#!/bin/bash
# health-check.sh
HEALTH_URL="https://tmi.example.com/"
LOG_FILE="/var/log/tmi/health-check.log"
RESPONSE=$(curl -s "$HEALTH_URL")
STATUS=$(echo "$RESPONSE" | jq -r '.status.code')
if [ "$STATUS" = "ok" ]; then
echo "$(date): TMI server is healthy" >> "$LOG_FILE"
exit 0
else
echo "$(date): TMI server status: $STATUS" >> "$LOG_FILE"
# Send alert
curl -X POST https://alerts.example.com/webhook \
-d '{"service": "tmi", "status": "'"$STATUS"'"}'
exit 1
fiSchedule the script with cron:
# Check every 5 minutes
*/5 * * * * /usr/local/bin/health-check.shIf you are running TMI on Kubernetes, configure liveness and readiness probes:
livenessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
failureThreshold: 3
readinessProbe:
httpGet:
path: /
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
failureThreshold: 2For container deployment details, see OCI-Container-Deployment.
| Alert | Trigger Condition |
|---|---|
| Service Down | TMI server unavailable |
| Database Failure | PostgreSQL connection failures |
| Authentication Outage | OAuth provider failures |
| High Error Rate | >5% error rate sustained for 5+ minutes |
| Resource Exhaustion | >90% CPU or memory usage |
| Alert | Trigger Condition |
|---|---|
| Performance Degradation | Response times >2x baseline |
| Cache Issues | Redis connection problems or high miss rate |
| Storage Issues | Disk usage >80% |
| Backup Failures | Database backup failures |
| Integration Issues | Client integration problems |
- Capacity planning: resource usage trends
- Performance trends: gradual performance changes
- Usage patterns: user activity changes
- Security events: unusual authentication patterns
- Maintenance reminders: certificate renewal, updates
# Prometheus AlertManager example
- alert: TMIServiceDown
expr: up{job="tmi-server"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "TMI service is down"
description: "TMI server has been down for more than 2 minutes"- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"- alert: DatabaseConnectionFailure
expr: postgresql_up{job="postgres"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Database connection failure"
description: "Cannot connect to PostgreSQL database"Configure multiple notification channels for comprehensive coverage:
| Channel | Use Case |
|---|---|
| Critical alerts | |
| Slack / Teams | Team notifications |
| PagerDuty / OpsGenie | On-call escalation |
| Webhooks | Custom integrations (see Webhook-Integration) |
Track response time percentiles against the following targets:
| Percentile | Target |
|---|---|
| P50 (Median) | <100ms |
| P95 | <500ms |
| P99 | <1000ms |
| P99.9 | Track for outlier detection |
Monitor requests per second:
- Baseline throughput under normal load
- Peak throughput capacity
- Sustained throughput over time
For additional performance tuning guidance, see Performance-and-Scaling.
Track application resource consumption:
# For a systemd service
systemctl status tmi
# For a containerized deployment
docker stats tmi-server
# For Kubernetes
kubectl top pod -n tmi-- Enable the pg_stat_statements extension
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
-- Find slow queries (over 100ms average)
SELECT
query,
mean_time,
calls,
total_time
FROM pg_stat_statements
WHERE mean_time > 100
ORDER BY mean_time DESC
LIMIT 20;
-- Query performance by table
SELECT
schemaname,
tablename,
seq_scan,
idx_scan,
n_tup_ins,
n_tup_upd,
n_tup_del
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;Monitor database connection usage:
-- Active connections by state
SELECT
state,
count(*)
FROM pg_stat_activity
GROUP BY state;
-- Long-running transactions (over 1 minute)
SELECT
pid,
now() - xact_start AS duration,
state,
query
FROM pg_stat_activity
WHERE xact_start < now() - interval '1 minute'
ORDER BY duration DESC;# Cache hit rate calculation
redis-cli -h redis-host -a password info stats | \
awk '/keyspace_hits|keyspace_misses/ {
split($0,a,":");
if ($1 ~ /hits/) hits=a[2];
if ($1 ~ /misses/) misses=a[2]
}
END {
total=hits+misses;
rate=(hits/total)*100;
printf "Hit Rate: %.2f%%\n", rate
}'
# Monitor cache latency
redis-cli -h redis-host -a password --latency-history
# Check slow commands
redis-cli -h redis-host -a password slowlog get 10Monitor authentication and authorization activity:
# View authentication logs
tail -f /var/log/tmi/tmi.log | grep -E "authentication|authorization"
# Count failed login attempts
grep "authentication failed" /var/log/tmi/tmi.log | wc -l
# Identify suspicious activity (group failures by source)
grep "authentication failed" /var/log/tmi/tmi.log | \
awk '{print $NF}' | sort | uniq -c | sort -rnFor complete security monitoring procedures, see Security-Operations.
Set up alerts for the following security events:
- Failed authentication attempts (more than 5 in 5 minutes)
- Unauthorized access attempts
- Suspicious API usage patterns
- Certificate expiration warnings
- Unusual data access patterns
See also Security-Best-Practices for hardening recommendations.
Include the following panels:
- Service uptime percentage
- Request rate (requests per second)
- Response time percentiles
- Error rate percentage
- Active users
- Database connection count
- Redis memory usage
- CPU and memory utilization
Include the following panels:
- Connection count over time
- Query performance metrics
- Table sizes
- Index usage
- Replication lag (if applicable)
- Database size growth
Include the following panels:
- HTTP request rate by endpoint
- WebSocket connection count
- User activity (threat models and diagrams created)
- API error rates by endpoint
- OAuth authentication success rate
Symptom: Metrics are not showing in your monitoring system.
Note: TMI does not expose a /metrics endpoint natively. You need to configure external metrics collection.
Steps to check:
# Verify that the TMI root endpoint is responding
curl http://localhost:8080/
# Check Prometheus scrape configuration and targets
curl http://prometheus:9090/api/v1/targetsSymptom: Logs are not appearing in your log aggregation system.
For Promtail:
# Check the Promtail container status
docker logs promtail
# Verify that log files exist and are readable
ls -la /var/log/tmi/
# Check the Promtail configuration
docker exec promtail cat /tmp/promtail-config.yamlFor ELK:
# Check Logstash status
systemctl status logstash
# Test Elasticsearch connectivity
curl http://elasticsearch:9200/_cluster/health
# Check the Logstash pipeline
curl -XGET 'localhost:9600/_node/stats/pipelines?pretty'Steps to check:
# Verify AlertManager configuration
curl http://alertmanager:9093/api/v2/status
# Check alert rules
curl http://prometheus:9090/api/v1/rules
# Test notification channels by sending a test alert through your webhook or email- Health checks configured and running
- Metrics collection enabled
- Log aggregation configured
- Critical alerts defined and tested
- Dashboards created and shared with the team
- Alert notification channels tested
- Runbooks created for common issues
- On-call rotation established
- Regular review of monitoring data scheduled
- Capacity planning based on trends
Configure appropriate retention periods:
| Data Type | Retention |
|---|---|
| Metrics | 30-90 days (high-resolution), 1 year (aggregated) |
| Logs | 30-90 days (compliance dependent) |
| Alerts | 90 days of alert history |
| Dashboards | Version-controlled in Git |
- Protect monitoring endpoints with authentication
- Encrypt metrics and log data in transit
- Sanitize logs to remove sensitive data (see the
redact_auth_tokensconfiguration option) - Restrict access to monitoring dashboards
- Audit monitoring system access
- Database-Operations -- Database management and monitoring
- Security-Operations -- Security monitoring and auditing
- Performance-and-Scaling -- Performance tuning guidance
- Post-Deployment -- Initial deployment verification
- Common-Issues -- Troubleshooting common problems
- Debugging-Guide -- Diagnostic procedures
- Promtail Container Setup -- Detailed Promtail configuration (note: Promtail is EOL; use Grafana Alloy for new deployments)
- Prometheus Documentation
- Grafana Documentation
- PostgreSQL Monitoring
- Using TMI for Threat Modeling
- Accessing TMI
- Authentication
- Creating Your First Threat Model
- Understanding the User Interface
- Working with Data Flow Diagrams
- Managing Threats
- Collaborative Threat Modeling
- Using Notes and Documentation
- Timmy AI Assistant
- Metadata and Extensions
- Planning Your Deployment
- Terraform Deployment (AWS, OCI, GCP, Azure)
- Deploying TMI Server
- OCI Container Deployment
- Certificate Automation
- Deploying TMI Web Application
- Setting Up Authentication
- Database Setup
- Component Integration
- Post-Deployment
- Branding and Customization
- Monitoring and Health
- Cloud Logging
- Database Operations
- Security Operations
- Performance and Scaling
- Maintenance Tasks
- Getting Started with Development
- Architecture and Design
- API Integration
- Testing
- Contributing
- Extending TMI
- Dependency Upgrade Plans
- DFD Graphing Library Reference
- Migration Instructions