-
Notifications
You must be signed in to change notification settings - Fork 261
Description
Priority
High - Prevents production incidents from going undetected
Problem Statement
The site lacks automated post-deployment validation and runtime monitoring for critical infrastructure components, allowing a Lambda@Edge failure to break all SDK documentation in production without immediate detection.
What went wrong: PR #17190 deployed successfully to production, but Lambda@Edge functions began failing at runtime with "ReferenceError: URLPattern is not defined". This broke all SDK documentation (Node.js, Python, .NET) with 503 errors. The deployment appeared successful because:
- Pulumi deployment succeeded (infrastructure created)
- Cypress tests passed (only check CSS/JS loading on homepage and getting-started)
- No post-deployment health checks validated Lambda@Edge execution
- No CloudWatch alarms detected Lambda@Edge errors
Current Testing:
- Cypress tests (
cypress/e2e/site.cy.js) only validate:- Homepage CSS loading
- JavaScript execution/carousel behavior
- Getting started page OS chooser
- No tests for SDK documentation endpoints
- No tests for redirect functionality (Lambda@Edge logic)
- No validation that Lambda@Edge functions execute successfully
Current Monitoring:
- No CloudWatch alarms for Lambda@Edge errors
- Search URL validation disabled (
.github/workflows/check-search-urls.ymldisabled due to rate limiting issues)
Proposed Solution
Part A: Post-Deployment Health Checks (Required)
Add a new workflow .github/workflows/post-deployment-health-check.yml that runs after successful deployment and validates:
- SDK documentation endpoints return 200 (not 503)
- Redirect functionality works (Lambda@Edge executes)
- Critical pages load successfully
Example workflow structure:
- Trigger on workflow_run completion
- Determine target environment (production vs testing)
- Test SDK docs for Node.js, Python, .NET, Java
- Test redirect logic
- Send Slack notification on failure
Part B: CloudWatch Alarms for Lambda@Edge (Required)
Add CloudWatch alarms to infrastructure/index.ts to detect Lambda@Edge failures:
// After Lambda@Edge function creation
const lambdaEdgeErrorAlarm = new aws.cloudwatch.MetricAlarm("lambda-edge-error-alarm", {
name: "pulumi-docs-lambda-edge-errors",
comparisonOperator: "GreaterThanThreshold",
evaluationPeriods: 1,
metricName: "Errors",
namespace: "AWS/Lambda",
period: 300, // 5 minutes
statistic: "Sum",
threshold: 10,
alarmDescription: "Lambda@Edge function errors detected",
dimensions: {
FunctionName: redirectsLambda.name,
},
});Part C: Enhanced Cypress Tests (Optional but Recommended)
Add SDK documentation endpoint tests to cypress/e2e/site.cy.js:
describe("SDK documentation", () => {
const sdkUrls = [
"/docs/reference/pkg/nodejs/pulumi/pulumi/",
"/docs/reference/pkg/python/pulumi/",
"/docs/reference/pkg/dotnet/Pulumi/Pulumi.html",
"/docs/reference/pkg/java/",
];
sdkUrls.forEach(url => {
it(`loads ${url}`, () => {
cy.request(url).its('status').should('eq', 200);
});
});
});Acceptance Criteria
Required (Must Have):
- Post-deployment health check workflow created and runs after production/testing deployments
- Health check validates SDK documentation endpoints return 200 (not 503/500)
- Health check validates redirect functionality (Lambda@Edge execution)
- Health check sends Slack notification on failure
- CloudWatch alarms created for Lambda@Edge errors
- CloudWatch alarms integrated with SNS for notifications
- Documentation updated in BUILD-AND-DEPLOY.md describing health checks and alarms
Optional (Nice to Have):
- Cypress tests expanded to cover SDK documentation endpoints
- Cypress tests validate redirect functionality
Technical Details
Files to create:
.github/workflows/post-deployment-health-check.yml
Files to modify:
infrastructure/index.ts(CloudWatch alarms)infrastructure/lambdaEdge.ts(export metrics/names for alarms)cypress/e2e/site.cy.js(optional: SDK tests)BUILD-AND-DEPLOY.md(document health checks)
Related Issues/PRs
- PR Merge 43 Dependabot dependency updates #17190: Webpack upgrade that broke Lambda@Edge (went undetected initially)
.github/workflows/check-search-urls.yml: Disabled due to rate limiting
Estimated Effort
Medium (4-6 hours)
- Post-deployment health check: 2 hours
- CloudWatch alarms: 2 hours
- Cypress test expansion: 1-2 hours (optional)
- Documentation: 1 hour