Skip to content

Implement URL sampling strategy to avoid rate limiting in check-search-urls #17245

@CamSoper

Description

@CamSoper

URL Checking with Sampling - Implementation Plan

Problem Statement

The check-urls script consistently fails after ~11,000-12,000 requests due to AWS cumulative rate limiting from a single IP address. Multiple attempts with various delay strategies (2s, 10s) and concurrency limiting all fail at approximately the same threshold. This is not a request rate limit but a cumulative request limit - AWS identifies ~11k requests from one IP as suspicious activity regardless of pacing.

Current behavior:

  • Total URLs in search index: ~66,000
  • Failure point: Group 12 (~12,000 URLs)
  • No amount of delay prevents this cumulative limit

Solution: Sampling + Critical URL Monitoring

Implement a three-tier checking approach that stays well under the ~11k threshold:

  1. Critical URLs (~200 URLs) - Always checked, high-priority pages
  2. Random Sample (5,000 URLs) - Stratified sampling for broad coverage
  3. Full Check Mode (optional) - Manual trigger for complete validation

Total URLs per run: ~5,200 (well under 11k limit)


Architecture Overview

Three Operating Modes

Mode URLs Checked Use Case Time
sample (default) ~5,200 Daily CI checks 25-30 min
critical-only ~200 Fast validation 1-2 min
full ~66,000 Manual testing only N/A (will fail)

Sampling Strategy: Stratified Random

Why stratified: Pure random could miss entire sections. Stratification ensures coverage across all site areas.

Stratification by section:

  • Docs: 40% of sample (2,000 URLs)
  • Registry: 35% of sample (1,750 URLs)
  • Blog: 10% of sample (500 URLs)
  • Tutorials: 10% of sample (500 URLs)
  • Other: 5% of sample (250 URLs)

Deterministic seeding:

  • Use date-based seed (YYYY-MM-DD)
  • Reproducible for debugging
  • Different sample each day for broad coverage over time

Implementation Details

File 1: Create /workspaces/src/pulumi/docs/scripts/search/critical-urls.js

Purpose: Define critical URL lists (always checked)

Structure: JavaScript module following existing patterns from rank.js

Content:

module.exports = {
    // Static list of critical URLs
    getStaticCriticalURLs() {
        return [
            // Homepage & Navigation
            "/",
            "/docs/",
            "/registry/",
            "/blog/",
            "/templates/",
            "/product/",
            "/pricing/",

            // Core Get-Started Guides (Tier 1 from rank.js)
            "/docs/get-started/",
            "/docs/clouds/aws/get-started/",
            "/docs/clouds/azure/get-started/",
            "/docs/clouds/gcp/get-started/",
            "/docs/clouds/kubernetes/get-started/",

            // Cloud Landing Pages
            "/docs/clouds/aws/",
            "/docs/clouds/azure/",
            "/docs/clouds/gcp/",
            "/docs/clouds/kubernetes/",

            // Core Documentation Sections
            "/docs/concepts/",
            "/docs/concepts/projects/",
            "/docs/concepts/stacks/",
            "/docs/concepts/resources/",
            "/docs/using-pulumi/",
            "/docs/pulumi-cloud/",

            // Tier-1 Provider Pages
            "/registry/packages/aws/",
            "/registry/packages/azure-native/",
            "/registry/packages/gcp/",
            "/registry/packages/kubernetes/",
            "/registry/packages/aws/installation-configuration/",
            "/registry/packages/azure-native/installation-configuration/",
            "/registry/packages/gcp/installation-configuration/",
            "/registry/packages/kubernetes/installation-configuration/",

            // Component Packages
            "/registry/packages/awsx/",
            "/registry/packages/eks/",
        ];
    },

    // Pattern-based critical URLs (regex)
    getCriticalURLPatterns() {
        return [
            /^\/docs\/[^/]+\/$/,  // All top-level doc sections
            /^\/docs\/clouds\/(aws|azure|gcp|kubernetes)\/get-started\/$/,
        ];
    },

    // Check if URL is critical
    isCriticalURL(href) {
        if (this.getStaticCriticalURLs().includes(href)) {
            return true;
        }
        return this.getCriticalURLPatterns().some(pattern => pattern.test(href));
    },

    // Filter critical URLs from full list
    filterCriticalURLs(allObjects) {
        return allObjects.filter(obj => this.isCriticalURL(obj.href));
    }
};

Critical URL Selection Criteria:

  • Homepage and main navigation
  • All "get-started" guides
  • Core documentation sections (concepts, using-pulumi, pulumi-cloud)
  • Tier-1 cloud provider landing and installation pages
  • Top component packages

File 2: Modify /workspaces/src/pulumi/docs/scripts/search/check-urls.js

Change 1: Add URL Selection Function (after line 10)

Purpose: Select which URLs to check based on mode

function selectURLsToCheck(allObjects, options = {}) {
    const {
        mode = 'sample',
        sampleSize = 5000,
        seed = new Date().toISOString().split('T')[0],
    } = options;

    const criticalURLs = require('./critical-urls');

    // Extract critical URLs
    const critical = criticalURLs.filterCriticalURLs(allObjects);
    const criticalHrefs = new Set(critical.map(obj => obj.href));

    console.log(`Found ${critical.length} critical URLs`);

    // Handle different modes
    if (mode === 'full') {
        console.log('Running in FULL mode - checking all URLs');
        return { selected: allObjects, metadata: { mode: 'full', total: allObjects.length, critical: critical.length, sampled: 0 } };
    }

    if (mode === 'critical-only') {
        console.log('Running in CRITICAL-ONLY mode');
        return { selected: critical, metadata: { mode: 'critical-only', total: allObjects.length, critical: critical.length, sampled: 0 } };
    }

    // SAMPLE mode: stratified sampling
    const remaining = allObjects.filter(obj => !criticalHrefs.has(obj.href));

    // Stratify by section
    const sections = {
        docs: remaining.filter(obj => obj.href.startsWith('/docs')),
        registry: remaining.filter(obj => obj.href.startsWith('/registry')),
        blog: remaining.filter(obj => obj.href.startsWith('/blog')),
        tutorials: remaining.filter(obj => obj.href.startsWith('/tutorials')),
        other: remaining.filter(obj =>
            !obj.href.startsWith('/docs') &&
            !obj.href.startsWith('/registry') &&
            !obj.href.startsWith('/blog') &&
            !obj.href.startsWith('/tutorials')
        ),
    };

    // Calculate target sample sizes per section
    const targetSampleSize = sampleSize - critical.length;
    const sampleSizes = {
        docs: Math.floor(targetSampleSize * 0.40),
        registry: Math.floor(targetSampleSize * 0.35),
        blog: Math.floor(targetSampleSize * 0.10),
        tutorials: Math.floor(targetSampleSize * 0.10),
        other: Math.floor(targetSampleSize * 0.05),
    };

    // Seeded random number generator
    let seedValue = hashString(seed);
    function seededRandom() {
        seedValue = (seedValue * 9301 + 49297) % 233280;
        return seedValue / 233280;
    }

    // Sample from each section
    const sampled = [];
    for (const [section, urls] of Object.entries(sections)) {
        const target = Math.min(sampleSizes[section], urls.length);
        const sectionSample = shuffleArray([...urls], seededRandom).slice(0, target);
        sampled.push(...sectionSample);
        console.log(`  ↳ ${section}: sampled ${target} of ${urls.length} URLs`);
    }

    // Combine critical + sampled
    const selected = [...critical, ...sampled];

    console.log(`\nSelected ${selected.length} total URLs to check:`);
    console.log(`  - Critical: ${critical.length}`);
    console.log(`  - Sampled: ${sampled.length}`);
    console.log(`  - Skipped: ${allObjects.length - selected.length}`);
    console.log(`  - Seed: ${seed}\n`);

    return {
        selected,
        metadata: {
            mode: 'sample',
            total: allObjects.length,
            critical: critical.length,
            sampled: sampled.length,
            seed: seed,
        }
    };
}

// Helper: Simple string hash
function hashString(str) {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
        const char = str.charCodeAt(i);
        hash = ((hash << 5) - hash) + char;
        hash = hash & hash;
    }
    return Math.abs(hash);
}

// Helper: Fisher-Yates shuffle with seeded random
function shuffleArray(array, randomFn) {
    for (let i = array.length - 1; i > 0; i--) {
        const j = Math.floor(randomFn() * (i + 1));
        [array[i], array[j]] = [array[j], array[i]];
    }
    return array;
}

Change 2: Update Main Function Signature and Logic (line 5)

Before:

async function checkSearchURLs(baseURL) {

After:

async function checkSearchURLs(baseURL, mode = 'sample', sampleSize = 5000, seed = null) {

Then update function body (after line 10):

Before:

const objects = fs.readFileSync("./public/search-index.json", "utf-8")...
console.log(`Checking ${objects.length} search URLs...`);

After:

const objects = fs.readFileSync("./public/search-index.json", "utf-8")...
console.log(`Loaded ${objects.length} URLs from search index.`);

// Select URLs based on mode
const selection = selectURLsToCheck(objects, { mode, sampleSize, seed });
const urlsToCheck = selection.selected;

console.log(`Checking ${urlsToCheck.length} URLs...\n`);

Change 3: Update Chunking to Use Selected URLs (line 15-21)

Change all references from objects to urlsToCheck:

for (let i = 0; i < urlsToCheck.length; i += chunkSize) {
    chunks.push({ chunk: i / chunkSize, objects: urlsToCheck.slice(i, i + chunkSize) });
}

Change 4: Update Function Call and Argument Parsing (line 84)

Before:

checkSearchURLs(process.argv[2] || "https://www.pulumi.com")

After:

const baseURL = process.argv[2] || "https://www.pulumi.com";
const mode = process.argv[3] || "sample";
const sampleSize = parseInt(process.argv[4]) || 5000;
const seed = process.argv[5] || new Date().toISOString().split('T')[0];

checkSearchURLs(baseURL, mode, sampleSize, seed)

Change 5: Enhance JSON Output (line 87-92)

Add metadata to summary:

const summary = {
    checked: results.length,
    fulfilled: results.filter(r => r.status === "fulfilled").map(r => r.value) || [],
    rejected: results.filter(r => r.status === "rejected").map(r => r.reason) || [],
    metadata: selection.metadata,  // ADD THIS LINE
    timestamp: new Date().toISOString(),  // ADD THIS LINE
};

Note: Need to make selection available in the scope (return it from checkSearchURLs or make it module-level)


File 3: Modify /workspaces/src/pulumi/docs/scripts/search/check-urls.sh

Change line 16:

Before:

node "./scripts/search/check-urls.js" "$base_url"

After:

mode="${3:-sample}"
node "./scripts/search/check-urls.js" "$base_url" "$mode"

Update usage comment at top of file:

# Usage: ./check-urls.sh <index-name> <base-url> [mode]
# Modes: sample (default), critical-only, full

File 4: Update /workspaces/src/pulumi/docs/Makefile

Change existing target (around line 68):

Before:

check_search_urls:
	$(MAKE) banner
	$(MAKE) ensure
	./scripts/search/check-urls.sh production "https://www.pulumi.com"

After:

check_search_urls:
	$(MAKE) banner
	$(MAKE) ensure
	./scripts/search/check-urls.sh production "https://www.pulumi.com" "sample"

Add new targets:

check_search_urls_critical:
	$(MAKE) banner
	$(MAKE) ensure
	./scripts/search/check-urls.sh production "https://www.pulumi.com" "critical-only"

check_search_urls_full:
	$(MAKE) banner
	$(MAKE) ensure
	./scripts/search/check-urls.sh production "https://www.pulumi.com" "full"

Testing Strategy

Phase 1: Local Testing (Critical-Only Mode)

Fast validation without full index:

# Build site and generate search index
make build

# Test critical-only mode (should take 1-2 minutes)
make check_search_urls_critical

Expected output:

  • "Found X critical URLs"
  • "Running in CRITICAL-ONLY mode"
  • All critical URLs should pass (no broken links)

Phase 2: Local Testing (Sample Mode)

Test with small sample first:

# Test with 1000 URL sample (fast)
node ./scripts/search/check-urls.js "https://www.pulumi.com" "sample" 1000

# Test with full 5000 sample
make check_search_urls

Verify:

  • Logs show stratification (docs: X, registry: Y, etc.)
  • Total URLs checked ~= 5,200
  • All critical URLs included
  • No rate limiting errors

Phase 3: Seed Reproducibility Test

Verify same seed produces same sample:

# Run twice with same seed
node ./scripts/search/check-urls.js "https://www.pulumi.com" "sample" 5000 "2026-01-26" > run1.log
node ./scripts/search/check-urls.js "https://www.pulumi.com" "sample" 5000 "2026-01-26" > run2.log

# Compare - should be identical
diff run1.log run2.log

Phase 4: CI Integration Test

Push changes and monitor GitHub Actions:

  1. Commit and push changes
  2. Wait for scheduled job (or trigger manually)
  3. Verify in logs:
    • Job completes without timeout
    • No ECONNREFUSED or ETIMEDOUT errors after retries
    • Shows "Selected X total URLs to check"
    • Completes all groups successfully

Success criteria:

  • ✅ Script completes without rate limiting
  • ✅ Runtime < 30 minutes
  • ✅ All critical URLs checked
  • ✅ Sample includes all sections

Expected Outcomes

Immediate Benefits

  1. Rate limiting resolved: 5,200 URLs well under 11k limit
  2. Faster feedback: 25-30 min vs timeout/failure
  3. Critical coverage guaranteed: High-priority pages always checked
  4. Broad coverage over time: 5,000 different URLs daily (7.5% of site)
  5. Reproducible: Seed-based sampling for debugging

Trade-offs

Advantages:

  • Simple implementation (no infrastructure changes)
  • Stays within AWS limits
  • Fast CI feedback
  • Easy to maintain and understand

Disadvantages:

  • Not all URLs checked each run (~92.5% skipped)
  • Broken links in non-critical pages may take days to detect
  • Probabilistic coverage (not deterministic for all URLs)

Mitigation:

  • Run full checks monthly (manual trigger during low-traffic times)
  • Monitor 404 rates in CloudFront logs for production issues
  • Critical URLs provide confidence in most important paths
  • Stratification ensures all sections represented

Files to Create/Modify

New Files

  1. /workspaces/src/pulumi/docs/scripts/search/critical-urls.js - Critical URL configuration

Modified Files

  1. /workspaces/src/pulumi/docs/scripts/search/check-urls.js - Add sampling logic
  2. /workspaces/src/pulumi/docs/scripts/search/check-urls.sh - Accept mode parameter
  3. /workspaces/src/pulumi/docs/Makefile - Update targets

No Changes Needed

  • .github/workflows/check-search-urls.yml - Uses default Makefile target (now sample mode)

Implementation Sequence

  1. Create critical-urls.js with initial list (~200 URLs)
  2. Add helper functions (hashString, shuffleArray) to check-urls.js
  3. Add selectURLsToCheck function to check-urls.js
  4. Update checkSearchURLs function signature and logic
  5. Update argument parsing at bottom of check-urls.js
  6. Enhance JSON output with metadata
  7. Update check-urls.sh to accept mode parameter
  8. Update Makefile targets
  9. Test locally with critical-only mode
  10. Test locally with sample mode (small sample first)
  11. Verify reproducibility with fixed seed
  12. Commit and push to test in CI
  13. Monitor GitHub Actions for successful completion

Success Criteria

Must Have (MVP):

  • Daily scheduled checks complete without rate limiting
  • Critical URLs always checked (verified in logs)
  • Sample mode checks ~5,200 URLs
  • Results JSON includes selection metadata
  • No ECONNREFUSED errors in CI logs

Should Have:

  • Stratified sampling across all sections
  • Deterministic seeded sampling
  • Logging shows sample distribution
  • All three modes working (sample, critical-only, full)

Nice to Have (Future):

  • Documentation in README
  • Dashboard for URL health trends
  • Historical tracking of broken links
  • Slack notifications on failures

Risk Mitigation

If Rate Limiting Still Occurs

Unlikely but possible scenarios:

  1. Sample size too large:

    • Reduce from 5,000 to 3,000
    • Adjust in Makefile: "sample" 3000
  2. Critical URLs too many:

    • Review and reduce critical list to top 100
    • Prioritize get-started guides only
  3. Concurrent requests issue:

    • Already addressed (kept existing 10s delays)
    • Could reduce batch size from 1000 to 500

Rollback Strategy

If issues arise:

  1. Immediate: Switch to critical-only mode via workflow dispatch
  2. Short-term: Reduce sample size to 2,500
  3. Full rollback: git revert to previous version

Future Enhancements (Post-MVP)

Phase 2 Improvements

  1. Smart Sampling:

    • Weight by page views (popular pages checked more often)
    • Check recently modified pages more frequently
    • Use rank.js scores to prioritize
  2. Historical Tracking:

    • Store results over time in repo
    • Identify chronically failing URLs
    • Trend analysis dashboard
  3. Integration:

    • Slack notifications for failures
    • Automated GitHub issue creation
    • CloudFront log analysis for 404 trends
  4. Incremental Checking:

    • Check only URLs modified since last deploy
    • Use git diff to identify changed pages
    • Full sample weekly, incremental daily

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions