feat(connectivity_check): add automated monitoring with CloudWatch metrics and alarms #79

sai-praveen-os · 2026-01-19T11:15:35Z

Add Monitoring Features to connectivity_check Module

Problem

The connectivity_check module currently supports on-demand connectivity testing via script invocation. We need automated scheduled monitoring with CloudWatch metrics and alarms for proactive alerting.

Solution

Extend the module with optional monitoring capabilities while maintaining backward compatibility.

Changes

monitoring.tf (new):

EventBridge rule for scheduled Lambda invocation (configurable, default: 1 min)
CloudWatch alarms for endpoint failures (per-critical-endpoint + aggregate)
CloudWatch alarm for Lambda execution errors
IAM permissions for CloudWatch metrics and EventBridge
Alarms are optional - only created when SNS topics are provided (supports Datadog-only monitoring)

lambda/handler.ts:

Publish CloudWatch metrics: EndpointConnectivity (1=up, 0=down), EndpointLatency (ms)
Metrics include dimensions: FunctionName, Endpoint, Critical
Include critical field directly in test function results
Removed unused failOnConnectivityLoss logic

lambda/package.json:

Added @aws-sdk/client-cloudwatch dependency

variables.tf:

Added monitoring configuration variables: enable_monitoring, monitoring_schedule, monitoring_targets, alarm_sns_topic_arns, alarm_evaluation_periods
Hard-coded CloudWatch namespace to "connectivity" for consistency across all consumers
Improved alarm timing: 1-minute schedule with 3 evaluation periods (3-min alert time)

Usage Examples

With CloudWatch Alarms

module "connectivity_monitor" {
  source = "path/to/connectivity_check"
  
  enable_monitoring    = true
  monitoring_schedule  = "rate(1 minute)"  # Default
  
  monitoring_targets = [
    { host = "api.example.com", port = 443, protocol = "https", critical = true }
  ]
  
  alarm_sns_topic_arns = [aws_sns_topic.alerts.arn]
}

Datadog-Only Monitoring (No CloudWatch Alarms)

module "connectivity_monitor" {
  source = "path/to/connectivity_check"
  
  enable_monitoring = true
  
  monitoring_targets = [
    { host = "api.example.com", port = 443, protocol = "https", critical = true }
  ]
  
  # Leave alarm_sns_topic_arns empty - metrics exported to Datadog, no CloudWatch alarms
}

Key Features

Fast alerting: Alarms fire after 3 minutes of consecutive failures (vs 10 minutes previously)
Datadog-friendly: Hard-coded namespace ensures consistent cross-account/cross-team dashboards
Flexible monitoring: Choose CloudWatch alarms OR Datadog-only monitoring
Backward compatible: All monitoring features are optional

…trics and alarms - Add monitoring.tf with EventBridge scheduling and CloudWatch alarms - Extend Lambda handler to publish EndpointConnectivity and EndpointLatency metrics - Add @aws-sdk/client-cloudwatch dependency to package.json - Add monitoring configuration variables (enable_monitoring, monitoring_schedule, monitoring_targets, alarm_sns_topic_arns) - Create per-critical-endpoint alarms, aggregate alarm, and Lambda error alarm - All monitoring features are optional and backward compatible Task: Enable proactive monitoring for Janus external dependencies after identity service outage incident

modules/connectivity_check/lambda/handler.ts

kevin-secrist · 2026-01-20T15:08:05Z

modules/connectivity_check/variables.tf

+variable "monitoring_schedule" {
+  description = "EventBridge schedule expression for monitoring (e.g., 'rate(5 minutes)')"
+  type        = string
+  default     = "rate(5 minutes)"


More granularity might be a good idea. As configured it'll take 10 minutes for the alarm to fire. Running it every minute and requiring 3 or 4 evaluation periods would invoke an alarm much sooner and have a smaller false positive rate.

Done! Updated to:

monitoring_schedule default: rate(1 minute) (was 5 minutes)

alarm_evaluation_periods default: 3 (was 2)

All alarm period values: 60 seconds (was 300)

Result: Alarms now fire after 3 minutes of consecutive failures instead of 10 minutes, with better false positive protection.

If we want to optimize for timely alerting using datadog, we would either need for this module to publish metrics directly to datadog, or look into setting up metric streams. The CloudWatch metrics being published will take up to 15-20 minutes to be ingested by datadog: https://docs.datadoghq.com/integrations/guide/cloud-metric-delay/#aws

AWS charges based on the number of metric updates on the CloudWatch Metric Stream and the data volume sent to the Amazon Data Firehose. As such, there is a potential to see an increased CloudWatch cost for the subset of metrics you are streaming. For this reason, Datadog recommends using metric streams for the AWS metrics, services, regions, and accounts where you most need the lower latency, and polling for others.

AWS CloudWatch custom metrics cost significantly more that Datadog custom metrics. AWS charges $0.30 per metric while Datadog charges $0.10 per 100 metrics ($.001 per metric).

Considering all this, and the fact that our teams are already utilizing Datadog for monitoring/alerting, we could skip CloudWatch integration altogether, and have this module publish only to Datadog.

kevin-secrist · 2026-01-20T16:30:28Z

modules/connectivity_check/variables.tf

+variable "cloudwatch_namespace" {
+  description = "CloudWatch namespace for custom metrics"
+  type        = string
+  default     = "janus/connectivity"


Suggested change

default = "janus/connectivity"

default = "connectivity"

Or make the default empty string and it can be the switch to turn metrics on/off

Another option is hard-coding it so that all consumers have the same namespace, so when they get exported to datadog they'll all be consistent which might be useful for making cross-account/cross-team dashboards, since the metrics at that point will also (I think) be tagged with the account number, environment, etc.

Done! Went with the hard-coding approach:

Removed cloudwatch_namespace variable entirely

Added locals { cloudwatch_namespace = "connectivity" } with explanatory comment

All consumers now use the same namespace for Datadog consistency

Metrics are still differentiated by FunctionName, Endpoint, and AWS account tags

kevin-secrist · 2026-01-20T16:59:49Z

modules/connectivity_check/monitoring.tf

+  description = "Number of periods to evaluate for alarms"
+  type        = number
+  default     = 2
+}


The variables in this file can be removed

kevin-secrist · 2026-01-20T17:01:27Z

modules/connectivity_check/monitoring.tf

+    targets                = var.monitoring_targets
+    publishMetrics         = true
+    cloudwatchNamespace    = var.cloudwatch_namespace
+    failOnConnectivityLoss = false  # Don't fail Lambda, just publish metrics


If this is always false we don't need the code for it. For now I think it makes sense to just remove it, if it's not being used. We're generally going to be using datadog for monitoring the metrics anyway, since those will be available.

Done! Removed failOnConnectivityLoss entirely:

Removed from LambdaEvent interface in handler.ts

Removed all related logic (critical failure checking and error throwing)

Simplified EventBridge target input in monitoring.tf

kevin-secrist · 2026-01-20T17:01:49Z

modules/connectivity_check/monitoring.tf

+  arn   = module.lambda.lambda_function_arn
+  input = jsonencode({
+    targets                = var.monitoring_targets
+    publishMetrics         = true


This should be configurable, or it should depend on if cloudwatch_namespace is empty.

Done! With the hard-coded namespace approach, publishMetrics is now always true when monitoring is enabled. This is appropriate since:

Metrics are always useful (even if just for Datadog export)

The namespace is consistent across all consumers

CloudWatch alarms are optional (controlled by alarm_sns_topic_arns)

kevin-secrist · 2026-01-20T17:20:12Z

modules/connectivity_check/monitoring.tf

+
+# CloudWatch alarms for critical endpoint failures
+# Creates one alarm per critical target
+resource "aws_cloudwatch_metric_alarm" "critical_endpoint_failure" {


Might be good to try out what it'd look like to consume these in datadog as a PoC. I think we'd probably want monitors on all the hosts, but have different severities based on if the host is critical or not. Inevitably we may end up making a dashboard that aggregates all of these across all accounts.

Also might be worth looking into if these are necessary at all. We export metrics to datadog, so we could use that do the alerting on that side. Maybe skip making the cloudwatch alarms if var.alarm_sns_topic_arns is empty? That gives consumers a choice if they want to put datadog into the equation. Without datadog, I guess you'd put a SNS topic that hits alertops or something here.

Done! CloudWatch alarms are now optional and Datadog-friendly:

Alarms only created when length(var.alarm_sns_topic_arns) > 0

If SNS topics are empty, metrics are still published but no CloudWatch alarms are created

Added comments explaining this supports Datadog-only monitoring

Consumers can choose: CloudWatch alarms (with SNS) OR Datadog monitors (leave SNS empty)

Usage patterns:

Datadog-only: Set enable_monitoring = true, leave alarm_sns_topic_arns = [] → Metrics exported to Datadog, no CloudWatch alarms

CloudWatch alerting: Set enable_monitoring = true, provide alarm_sns_topic_arns → Both metrics and alarms

No monitoring: Set enable_monitoring = false → Nothing created

The hard-coded namespace ensures all connectivity checks aggregate cleanly in Datadog dashboards across accounts.

- Include critical field directly in testTcp() and testHttp() results - Improve alarm granularity: 1-min schedule, 3 evaluation periods (3-min alert vs 10-min) - Hard-code CloudWatch namespace to 'connectivity' for Datadog consistency - Remove duplicate variable declarations from monitoring.tf (42 lines) - Remove unused failOnConnectivityLoss logic - Make CloudWatch alarms optional when alarm_sns_topic_arns is empty - Support Datadog-only monitoring (metrics without CloudWatch alarms) Changes reduce code by 56 lines while improving responsiveness and flexibility. JIRA: PLATEXP-11106

…g integration Replace CloudWatch custom metrics with direct Datadog integration using janus-core library and OpenTelemetry. This addresses performance and cost concerns raised in PR review. Changes: - Lambda handler now uses @janus.team/janus-core for metrics/logging - Metrics reach Datadog in seconds (not 15-20 minutes) - Cost reduced from $0.30/metric (CloudWatch) to $0.001/metric (Datadog) - Added OpenTelemetry layer and required environment variables - Removed all CloudWatch alarm resources (117 lines) - Simplified monitoring.tf to only handle EventBridge scheduling - Changed package.json to commonjs (required for janus-core) - Fixed npm_requirements flag to true (now has dependencies) Metrics published: - connectivity.endpoint.status (gauge: 1=success, 0=failure) - connectivity.endpoint.latency (gauge: milliseconds) - connectivity.endpoint.success.count (counter) - connectivity.endpoint.error.count (counter) PLATEXP-11106

…da_function_tags The terraform-aws-modules/lambda module does not expose lambda_function_tags as an output. Use inline tags matching the Lambda function tags instead. PLATEXP-11106

smayberry reviewed Jan 20, 2026

View reviewed changes

modules/connectivity_check/lambda/handler.ts Outdated Show resolved Hide resolved

kevin-secrist reviewed Jan 20, 2026

View reviewed changes

sai-praveen-os added 3 commits January 22, 2026 15:11

fix(connectivity_check): use inline tags instead of non-existent lamb…

2e0b910

…da_function_tags The terraform-aws-modules/lambda module does not expose lambda_function_tags as an output. Use inline tags matching the Lambda function tags instead. PLATEXP-11106

feat(connectivity_check): add automated monitoring with CloudWatch metrics and alarms #79

Are you sure you want to change the base?

feat(connectivity_check): add automated monitoring with CloudWatch metrics and alarms #79

Uh oh!

Conversation

sai-praveen-os commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add Monitoring Features to connectivity_check Module

Problem

Solution

Changes

Usage Examples

With CloudWatch Alarms

Datadog-Only Monitoring (No CloudWatch Alarms)

Key Features

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sai-praveen-os Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sai-praveen-os commented Jan 19, 2026 •

edited

Loading

sai-praveen-os Jan 22, 2026 •

edited

Loading