-
Notifications
You must be signed in to change notification settings - Fork 261
Description
URL Checking with Sampling - Implementation Plan
Problem Statement
The check-urls script consistently fails after ~11,000-12,000 requests due to AWS cumulative rate limiting from a single IP address. Multiple attempts with various delay strategies (2s, 10s) and concurrency limiting all fail at approximately the same threshold. This is not a request rate limit but a cumulative request limit - AWS identifies ~11k requests from one IP as suspicious activity regardless of pacing.
Current behavior:
- Total URLs in search index: ~66,000
- Failure point: Group 12 (~12,000 URLs)
- No amount of delay prevents this cumulative limit
Solution: Sampling + Critical URL Monitoring
Implement a three-tier checking approach that stays well under the ~11k threshold:
- Critical URLs (~200 URLs) - Always checked, high-priority pages
- Random Sample (5,000 URLs) - Stratified sampling for broad coverage
- Full Check Mode (optional) - Manual trigger for complete validation
Total URLs per run: ~5,200 (well under 11k limit)
Architecture Overview
Three Operating Modes
| Mode | URLs Checked | Use Case | Time |
|---|---|---|---|
sample (default) |
~5,200 | Daily CI checks | 25-30 min |
critical-only |
~200 | Fast validation | 1-2 min |
full |
~66,000 | Manual testing only | N/A (will fail) |
Sampling Strategy: Stratified Random
Why stratified: Pure random could miss entire sections. Stratification ensures coverage across all site areas.
Stratification by section:
- Docs: 40% of sample (2,000 URLs)
- Registry: 35% of sample (1,750 URLs)
- Blog: 10% of sample (500 URLs)
- Tutorials: 10% of sample (500 URLs)
- Other: 5% of sample (250 URLs)
Deterministic seeding:
- Use date-based seed (YYYY-MM-DD)
- Reproducible for debugging
- Different sample each day for broad coverage over time
Implementation Details
File 1: Create /workspaces/src/pulumi/docs/scripts/search/critical-urls.js
Purpose: Define critical URL lists (always checked)
Structure: JavaScript module following existing patterns from rank.js
Content:
module.exports = {
// Static list of critical URLs
getStaticCriticalURLs() {
return [
// Homepage & Navigation
"/",
"/docs/",
"/registry/",
"/blog/",
"/templates/",
"/product/",
"/pricing/",
// Core Get-Started Guides (Tier 1 from rank.js)
"/docs/get-started/",
"/docs/clouds/aws/get-started/",
"/docs/clouds/azure/get-started/",
"/docs/clouds/gcp/get-started/",
"/docs/clouds/kubernetes/get-started/",
// Cloud Landing Pages
"/docs/clouds/aws/",
"/docs/clouds/azure/",
"/docs/clouds/gcp/",
"/docs/clouds/kubernetes/",
// Core Documentation Sections
"/docs/concepts/",
"/docs/concepts/projects/",
"/docs/concepts/stacks/",
"/docs/concepts/resources/",
"/docs/using-pulumi/",
"/docs/pulumi-cloud/",
// Tier-1 Provider Pages
"/registry/packages/aws/",
"/registry/packages/azure-native/",
"/registry/packages/gcp/",
"/registry/packages/kubernetes/",
"/registry/packages/aws/installation-configuration/",
"/registry/packages/azure-native/installation-configuration/",
"/registry/packages/gcp/installation-configuration/",
"/registry/packages/kubernetes/installation-configuration/",
// Component Packages
"/registry/packages/awsx/",
"/registry/packages/eks/",
];
},
// Pattern-based critical URLs (regex)
getCriticalURLPatterns() {
return [
/^\/docs\/[^/]+\/$/, // All top-level doc sections
/^\/docs\/clouds\/(aws|azure|gcp|kubernetes)\/get-started\/$/,
];
},
// Check if URL is critical
isCriticalURL(href) {
if (this.getStaticCriticalURLs().includes(href)) {
return true;
}
return this.getCriticalURLPatterns().some(pattern => pattern.test(href));
},
// Filter critical URLs from full list
filterCriticalURLs(allObjects) {
return allObjects.filter(obj => this.isCriticalURL(obj.href));
}
};Critical URL Selection Criteria:
- Homepage and main navigation
- All "get-started" guides
- Core documentation sections (concepts, using-pulumi, pulumi-cloud)
- Tier-1 cloud provider landing and installation pages
- Top component packages
File 2: Modify /workspaces/src/pulumi/docs/scripts/search/check-urls.js
Change 1: Add URL Selection Function (after line 10)
Purpose: Select which URLs to check based on mode
function selectURLsToCheck(allObjects, options = {}) {
const {
mode = 'sample',
sampleSize = 5000,
seed = new Date().toISOString().split('T')[0],
} = options;
const criticalURLs = require('./critical-urls');
// Extract critical URLs
const critical = criticalURLs.filterCriticalURLs(allObjects);
const criticalHrefs = new Set(critical.map(obj => obj.href));
console.log(`Found ${critical.length} critical URLs`);
// Handle different modes
if (mode === 'full') {
console.log('Running in FULL mode - checking all URLs');
return { selected: allObjects, metadata: { mode: 'full', total: allObjects.length, critical: critical.length, sampled: 0 } };
}
if (mode === 'critical-only') {
console.log('Running in CRITICAL-ONLY mode');
return { selected: critical, metadata: { mode: 'critical-only', total: allObjects.length, critical: critical.length, sampled: 0 } };
}
// SAMPLE mode: stratified sampling
const remaining = allObjects.filter(obj => !criticalHrefs.has(obj.href));
// Stratify by section
const sections = {
docs: remaining.filter(obj => obj.href.startsWith('/docs')),
registry: remaining.filter(obj => obj.href.startsWith('/registry')),
blog: remaining.filter(obj => obj.href.startsWith('/blog')),
tutorials: remaining.filter(obj => obj.href.startsWith('/tutorials')),
other: remaining.filter(obj =>
!obj.href.startsWith('/docs') &&
!obj.href.startsWith('/registry') &&
!obj.href.startsWith('/blog') &&
!obj.href.startsWith('/tutorials')
),
};
// Calculate target sample sizes per section
const targetSampleSize = sampleSize - critical.length;
const sampleSizes = {
docs: Math.floor(targetSampleSize * 0.40),
registry: Math.floor(targetSampleSize * 0.35),
blog: Math.floor(targetSampleSize * 0.10),
tutorials: Math.floor(targetSampleSize * 0.10),
other: Math.floor(targetSampleSize * 0.05),
};
// Seeded random number generator
let seedValue = hashString(seed);
function seededRandom() {
seedValue = (seedValue * 9301 + 49297) % 233280;
return seedValue / 233280;
}
// Sample from each section
const sampled = [];
for (const [section, urls] of Object.entries(sections)) {
const target = Math.min(sampleSizes[section], urls.length);
const sectionSample = shuffleArray([...urls], seededRandom).slice(0, target);
sampled.push(...sectionSample);
console.log(` ↳ ${section}: sampled ${target} of ${urls.length} URLs`);
}
// Combine critical + sampled
const selected = [...critical, ...sampled];
console.log(`\nSelected ${selected.length} total URLs to check:`);
console.log(` - Critical: ${critical.length}`);
console.log(` - Sampled: ${sampled.length}`);
console.log(` - Skipped: ${allObjects.length - selected.length}`);
console.log(` - Seed: ${seed}\n`);
return {
selected,
metadata: {
mode: 'sample',
total: allObjects.length,
critical: critical.length,
sampled: sampled.length,
seed: seed,
}
};
}
// Helper: Simple string hash
function hashString(str) {
let hash = 0;
for (let i = 0; i < str.length; i++) {
const char = str.charCodeAt(i);
hash = ((hash << 5) - hash) + char;
hash = hash & hash;
}
return Math.abs(hash);
}
// Helper: Fisher-Yates shuffle with seeded random
function shuffleArray(array, randomFn) {
for (let i = array.length - 1; i > 0; i--) {
const j = Math.floor(randomFn() * (i + 1));
[array[i], array[j]] = [array[j], array[i]];
}
return array;
}Change 2: Update Main Function Signature and Logic (line 5)
Before:
async function checkSearchURLs(baseURL) {After:
async function checkSearchURLs(baseURL, mode = 'sample', sampleSize = 5000, seed = null) {Then update function body (after line 10):
Before:
const objects = fs.readFileSync("./public/search-index.json", "utf-8")...
console.log(`Checking ${objects.length} search URLs...`);After:
const objects = fs.readFileSync("./public/search-index.json", "utf-8")...
console.log(`Loaded ${objects.length} URLs from search index.`);
// Select URLs based on mode
const selection = selectURLsToCheck(objects, { mode, sampleSize, seed });
const urlsToCheck = selection.selected;
console.log(`Checking ${urlsToCheck.length} URLs...\n`);Change 3: Update Chunking to Use Selected URLs (line 15-21)
Change all references from objects to urlsToCheck:
for (let i = 0; i < urlsToCheck.length; i += chunkSize) {
chunks.push({ chunk: i / chunkSize, objects: urlsToCheck.slice(i, i + chunkSize) });
}Change 4: Update Function Call and Argument Parsing (line 84)
Before:
checkSearchURLs(process.argv[2] || "https://www.pulumi.com")After:
const baseURL = process.argv[2] || "https://www.pulumi.com";
const mode = process.argv[3] || "sample";
const sampleSize = parseInt(process.argv[4]) || 5000;
const seed = process.argv[5] || new Date().toISOString().split('T')[0];
checkSearchURLs(baseURL, mode, sampleSize, seed)Change 5: Enhance JSON Output (line 87-92)
Add metadata to summary:
const summary = {
checked: results.length,
fulfilled: results.filter(r => r.status === "fulfilled").map(r => r.value) || [],
rejected: results.filter(r => r.status === "rejected").map(r => r.reason) || [],
metadata: selection.metadata, // ADD THIS LINE
timestamp: new Date().toISOString(), // ADD THIS LINE
};Note: Need to make selection available in the scope (return it from checkSearchURLs or make it module-level)
File 3: Modify /workspaces/src/pulumi/docs/scripts/search/check-urls.sh
Change line 16:
Before:
node "./scripts/search/check-urls.js" "$base_url"After:
mode="${3:-sample}"
node "./scripts/search/check-urls.js" "$base_url" "$mode"Update usage comment at top of file:
# Usage: ./check-urls.sh <index-name> <base-url> [mode]
# Modes: sample (default), critical-only, fullFile 4: Update /workspaces/src/pulumi/docs/Makefile
Change existing target (around line 68):
Before:
check_search_urls:
$(MAKE) banner
$(MAKE) ensure
./scripts/search/check-urls.sh production "https://www.pulumi.com"After:
check_search_urls:
$(MAKE) banner
$(MAKE) ensure
./scripts/search/check-urls.sh production "https://www.pulumi.com" "sample"Add new targets:
check_search_urls_critical:
$(MAKE) banner
$(MAKE) ensure
./scripts/search/check-urls.sh production "https://www.pulumi.com" "critical-only"
check_search_urls_full:
$(MAKE) banner
$(MAKE) ensure
./scripts/search/check-urls.sh production "https://www.pulumi.com" "full"Testing Strategy
Phase 1: Local Testing (Critical-Only Mode)
Fast validation without full index:
# Build site and generate search index
make build
# Test critical-only mode (should take 1-2 minutes)
make check_search_urls_criticalExpected output:
- "Found X critical URLs"
- "Running in CRITICAL-ONLY mode"
- All critical URLs should pass (no broken links)
Phase 2: Local Testing (Sample Mode)
Test with small sample first:
# Test with 1000 URL sample (fast)
node ./scripts/search/check-urls.js "https://www.pulumi.com" "sample" 1000
# Test with full 5000 sample
make check_search_urlsVerify:
- Logs show stratification (docs: X, registry: Y, etc.)
- Total URLs checked ~= 5,200
- All critical URLs included
- No rate limiting errors
Phase 3: Seed Reproducibility Test
Verify same seed produces same sample:
# Run twice with same seed
node ./scripts/search/check-urls.js "https://www.pulumi.com" "sample" 5000 "2026-01-26" > run1.log
node ./scripts/search/check-urls.js "https://www.pulumi.com" "sample" 5000 "2026-01-26" > run2.log
# Compare - should be identical
diff run1.log run2.logPhase 4: CI Integration Test
Push changes and monitor GitHub Actions:
- Commit and push changes
- Wait for scheduled job (or trigger manually)
- Verify in logs:
- Job completes without timeout
- No ECONNREFUSED or ETIMEDOUT errors after retries
- Shows "Selected X total URLs to check"
- Completes all groups successfully
Success criteria:
- ✅ Script completes without rate limiting
- ✅ Runtime < 30 minutes
- ✅ All critical URLs checked
- ✅ Sample includes all sections
Expected Outcomes
Immediate Benefits
- Rate limiting resolved: 5,200 URLs well under 11k limit
- Faster feedback: 25-30 min vs timeout/failure
- Critical coverage guaranteed: High-priority pages always checked
- Broad coverage over time: 5,000 different URLs daily (7.5% of site)
- Reproducible: Seed-based sampling for debugging
Trade-offs
Advantages:
- Simple implementation (no infrastructure changes)
- Stays within AWS limits
- Fast CI feedback
- Easy to maintain and understand
Disadvantages:
- Not all URLs checked each run (~92.5% skipped)
- Broken links in non-critical pages may take days to detect
- Probabilistic coverage (not deterministic for all URLs)
Mitigation:
- Run full checks monthly (manual trigger during low-traffic times)
- Monitor 404 rates in CloudFront logs for production issues
- Critical URLs provide confidence in most important paths
- Stratification ensures all sections represented
Files to Create/Modify
New Files
/workspaces/src/pulumi/docs/scripts/search/critical-urls.js- Critical URL configuration
Modified Files
/workspaces/src/pulumi/docs/scripts/search/check-urls.js- Add sampling logic/workspaces/src/pulumi/docs/scripts/search/check-urls.sh- Accept mode parameter/workspaces/src/pulumi/docs/Makefile- Update targets
No Changes Needed
.github/workflows/check-search-urls.yml- Uses default Makefile target (now sample mode)
Implementation Sequence
- Create critical-urls.js with initial list (~200 URLs)
- Add helper functions (hashString, shuffleArray) to check-urls.js
- Add selectURLsToCheck function to check-urls.js
- Update checkSearchURLs function signature and logic
- Update argument parsing at bottom of check-urls.js
- Enhance JSON output with metadata
- Update check-urls.sh to accept mode parameter
- Update Makefile targets
- Test locally with critical-only mode
- Test locally with sample mode (small sample first)
- Verify reproducibility with fixed seed
- Commit and push to test in CI
- Monitor GitHub Actions for successful completion
Success Criteria
Must Have (MVP):
- Daily scheduled checks complete without rate limiting
- Critical URLs always checked (verified in logs)
- Sample mode checks ~5,200 URLs
- Results JSON includes selection metadata
- No ECONNREFUSED errors in CI logs
Should Have:
- Stratified sampling across all sections
- Deterministic seeded sampling
- Logging shows sample distribution
- All three modes working (sample, critical-only, full)
Nice to Have (Future):
- Documentation in README
- Dashboard for URL health trends
- Historical tracking of broken links
- Slack notifications on failures
Risk Mitigation
If Rate Limiting Still Occurs
Unlikely but possible scenarios:
-
Sample size too large:
- Reduce from 5,000 to 3,000
- Adjust in Makefile:
"sample" 3000
-
Critical URLs too many:
- Review and reduce critical list to top 100
- Prioritize get-started guides only
-
Concurrent requests issue:
- Already addressed (kept existing 10s delays)
- Could reduce batch size from 1000 to 500
Rollback Strategy
If issues arise:
- Immediate: Switch to critical-only mode via workflow dispatch
- Short-term: Reduce sample size to 2,500
- Full rollback:
git revertto previous version
Future Enhancements (Post-MVP)
Phase 2 Improvements
-
Smart Sampling:
- Weight by page views (popular pages checked more often)
- Check recently modified pages more frequently
- Use rank.js scores to prioritize
-
Historical Tracking:
- Store results over time in repo
- Identify chronically failing URLs
- Trend analysis dashboard
-
Integration:
- Slack notifications for failures
- Automated GitHub issue creation
- CloudFront log analysis for 404 trends
-
Incremental Checking:
- Check only URLs modified since last deploy
- Use git diff to identify changed pages
- Full sample weekly, incremental daily