Skip to content

Add post-deployment health checks for critical infrastructure #17270

@CamSoper

Description

@CamSoper

Priority

High - Prevents production incidents from going undetected

Problem Statement

The site lacks automated post-deployment validation and runtime monitoring for critical infrastructure components, allowing a Lambda@Edge failure to break all SDK documentation in production without immediate detection.

What went wrong: PR #17190 deployed successfully to production, but Lambda@Edge functions began failing at runtime with "ReferenceError: URLPattern is not defined". This broke all SDK documentation (Node.js, Python, .NET) with 503 errors. The deployment appeared successful because:

  1. Pulumi deployment succeeded (infrastructure created)
  2. Cypress tests passed (only check CSS/JS loading on homepage and getting-started)
  3. No post-deployment health checks validated Lambda@Edge execution
  4. No CloudWatch alarms detected Lambda@Edge errors

Current Testing:

  • Cypress tests (cypress/e2e/site.cy.js) only validate:
    • Homepage CSS loading
    • JavaScript execution/carousel behavior
    • Getting started page OS chooser
  • No tests for SDK documentation endpoints
  • No tests for redirect functionality (Lambda@Edge logic)
  • No validation that Lambda@Edge functions execute successfully

Current Monitoring:

  • No CloudWatch alarms for Lambda@Edge errors
  • Search URL validation disabled (.github/workflows/check-search-urls.yml disabled due to rate limiting issues)

Proposed Solution

Part A: Post-Deployment Health Checks (Required)

Add a new workflow .github/workflows/post-deployment-health-check.yml that runs after successful deployment and validates:

  • SDK documentation endpoints return 200 (not 503)
  • Redirect functionality works (Lambda@Edge executes)
  • Critical pages load successfully

Example workflow structure:

  • Trigger on workflow_run completion
  • Determine target environment (production vs testing)
  • Test SDK docs for Node.js, Python, .NET, Java
  • Test redirect logic
  • Send Slack notification on failure

Part B: CloudWatch Alarms for Lambda@Edge (Required)

Add CloudWatch alarms to infrastructure/index.ts to detect Lambda@Edge failures:

// After Lambda@Edge function creation
const lambdaEdgeErrorAlarm = new aws.cloudwatch.MetricAlarm("lambda-edge-error-alarm", {
    name: "pulumi-docs-lambda-edge-errors",
    comparisonOperator: "GreaterThanThreshold",
    evaluationPeriods: 1,
    metricName: "Errors",
    namespace: "AWS/Lambda",
    period: 300, // 5 minutes
    statistic: "Sum",
    threshold: 10,
    alarmDescription: "Lambda@Edge function errors detected",
    dimensions: {
        FunctionName: redirectsLambda.name,
    },
});

Part C: Enhanced Cypress Tests (Optional but Recommended)

Add SDK documentation endpoint tests to cypress/e2e/site.cy.js:

describe("SDK documentation", () => {
    const sdkUrls = [
        "/docs/reference/pkg/nodejs/pulumi/pulumi/",
        "/docs/reference/pkg/python/pulumi/",
        "/docs/reference/pkg/dotnet/Pulumi/Pulumi.html",
        "/docs/reference/pkg/java/",
    ];

    sdkUrls.forEach(url => {
        it(`loads ${url}`, () => {
            cy.request(url).its('status').should('eq', 200);
        });
    });
});

Acceptance Criteria

Required (Must Have):

  • Post-deployment health check workflow created and runs after production/testing deployments
  • Health check validates SDK documentation endpoints return 200 (not 503/500)
  • Health check validates redirect functionality (Lambda@Edge execution)
  • Health check sends Slack notification on failure
  • CloudWatch alarms created for Lambda@Edge errors
  • CloudWatch alarms integrated with SNS for notifications
  • Documentation updated in BUILD-AND-DEPLOY.md describing health checks and alarms

Optional (Nice to Have):

  • Cypress tests expanded to cover SDK documentation endpoints
  • Cypress tests validate redirect functionality

Technical Details

Files to create:

  • .github/workflows/post-deployment-health-check.yml

Files to modify:

  • infrastructure/index.ts (CloudWatch alarms)
  • infrastructure/lambdaEdge.ts (export metrics/names for alarms)
  • cypress/e2e/site.cy.js (optional: SDK tests)
  • BUILD-AND-DEPLOY.md (document health checks)

Related Issues/PRs

Estimated Effort

Medium (4-6 hours)

  • Post-deployment health check: 2 hours
  • CloudWatch alarms: 2 hours
  • Cypress test expansion: 1-2 hours (optional)
  • Documentation: 1 hour

Metadata

Metadata

Assignees

Labels

Projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions