Implement URL sampling strategy to avoid rate limiting in check-search-urls

# URL Checking with Sampling - Implementation Plan

## Problem Statement

The check-urls script consistently fails after ~11,000-12,000 requests due to AWS cumulative rate limiting from a single IP address. Multiple attempts with various delay strategies (2s, 10s) and concurrency limiting all fail at approximately the same threshold. This is not a request *rate* limit but a *cumulative* request limit - AWS identifies ~11k requests from one IP as suspicious activity regardless of pacing.

**Current behavior:**
- Total URLs in search index: ~66,000
- Failure point: Group 12 (~12,000 URLs)
- No amount of delay prevents this cumulative limit

## Solution: Sampling + Critical URL Monitoring

Implement a three-tier checking approach that stays well under the ~11k threshold:

1. **Critical URLs** (~200 URLs) - Always checked, high-priority pages
2. **Random Sample** (5,000 URLs) - Stratified sampling for broad coverage
3. **Full Check Mode** (optional) - Manual trigger for complete validation

**Total URLs per run:** ~5,200 (well under 11k limit)

---

## Architecture Overview

### Three Operating Modes

| Mode | URLs Checked | Use Case | Time |
|------|-------------|----------|------|
| `sample` (default) | ~5,200 | Daily CI checks | 25-30 min |
| `critical-only` | ~200 | Fast validation | 1-2 min |
| `full` | ~66,000 | Manual testing only | N/A (will fail) |

### Sampling Strategy: Stratified Random

**Why stratified:** Pure random could miss entire sections. Stratification ensures coverage across all site areas.

**Stratification by section:**
- Docs: 40% of sample (2,000 URLs)
- Registry: 35% of sample (1,750 URLs)
- Blog: 10% of sample (500 URLs)
- Tutorials: 10% of sample (500 URLs)
- Other: 5% of sample (250 URLs)

**Deterministic seeding:**
- Use date-based seed (YYYY-MM-DD)
- Reproducible for debugging
- Different sample each day for broad coverage over time

---

## Implementation Details

### File 1: Create `/workspaces/src/pulumi/docs/scripts/search/critical-urls.js`

**Purpose:** Define critical URL lists (always checked)

**Structure:** JavaScript module following existing patterns from `rank.js`

**Content:**
```javascript
module.exports = {
    // Static list of critical URLs
    getStaticCriticalURLs() {
        return [
            // Homepage & Navigation
            "/",
            "/docs/",
            "/registry/",
            "/blog/",
            "/templates/",
            "/product/",
            "/pricing/",

            // Core Get-Started Guides (Tier 1 from rank.js)
            "/docs/get-started/",
            "/docs/clouds/aws/get-started/",
            "/docs/clouds/azure/get-started/",
            "/docs/clouds/gcp/get-started/",
            "/docs/clouds/kubernetes/get-started/",

            // Cloud Landing Pages
            "/docs/clouds/aws/",
            "/docs/clouds/azure/",
            "/docs/clouds/gcp/",
            "/docs/clouds/kubernetes/",

            // Core Documentation Sections
            "/docs/concepts/",
            "/docs/concepts/projects/",
            "/docs/concepts/stacks/",
            "/docs/concepts/resources/",
            "/docs/using-pulumi/",
            "/docs/pulumi-cloud/",

            // Tier-1 Provider Pages
            "/registry/packages/aws/",
            "/registry/packages/azure-native/",
            "/registry/packages/gcp/",
            "/registry/packages/kubernetes/",
            "/registry/packages/aws/installation-configuration/",
            "/registry/packages/azure-native/installation-configuration/",
            "/registry/packages/gcp/installation-configuration/",
            "/registry/packages/kubernetes/installation-configuration/",

            // Component Packages
            "/registry/packages/awsx/",
            "/registry/packages/eks/",
        ];
    },

    // Pattern-based critical URLs (regex)
    getCriticalURLPatterns() {
        return [
            /^\/docs\/[^/]+\/$/,  // All top-level doc sections
            /^\/docs\/clouds\/(aws|azure|gcp|kubernetes)\/get-started\/$/,
        ];
    },

    // Check if URL is critical
    isCriticalURL(href) {
        if (this.getStaticCriticalURLs().includes(href)) {
            return true;
        }
        return this.getCriticalURLPatterns().some(pattern => pattern.test(href));
    },

    // Filter critical URLs from full list
    filterCriticalURLs(allObjects) {
        return allObjects.filter(obj => this.isCriticalURL(obj.href));
    }
};
```

**Critical URL Selection Criteria:**
- Homepage and main navigation
- All "get-started" guides
- Core documentation sections (concepts, using-pulumi, pulumi-cloud)
- Tier-1 cloud provider landing and installation pages
- Top component packages

---

### File 2: Modify `/workspaces/src/pulumi/docs/scripts/search/check-urls.js`

#### Change 1: Add URL Selection Function (after line 10)

**Purpose:** Select which URLs to check based on mode

```javascript
function selectURLsToCheck(allObjects, options = {}) {
    const {
        mode = 'sample',
        sampleSize = 5000,
        seed = new Date().toISOString().split('T')[0],
    } = options;

    const criticalURLs = require('./critical-urls');

    // Extract critical URLs
    const critical = criticalURLs.filterCriticalURLs(allObjects);
    const criticalHrefs = new Set(critical.map(obj => obj.href));

    console.log(`Found ${critical.length} critical URLs`);

    // Handle different modes
    if (mode === 'full') {
        console.log('Running in FULL mode - checking all URLs');
        return { selected: allObjects, metadata: { mode: 'full', total: allObjects.length, critical: critical.length, sampled: 0 } };
    }

    if (mode === 'critical-only') {
        console.log('Running in CRITICAL-ONLY mode');
        return { selected: critical, metadata: { mode: 'critical-only', total: allObjects.length, critical: critical.length, sampled: 0 } };
    }

    // SAMPLE mode: stratified sampling
    const remaining = allObjects.filter(obj => !criticalHrefs.has(obj.href));

    // Stratify by section
    const sections = {
        docs: remaining.filter(obj => obj.href.startsWith('/docs')),
        registry: remaining.filter(obj => obj.href.startsWith('/registry')),
        blog: remaining.filter(obj => obj.href.startsWith('/blog')),
        tutorials: remaining.filter(obj => obj.href.startsWith('/tutorials')),
        other: remaining.filter(obj =>
            !obj.href.startsWith('/docs') &&
            !obj.href.startsWith('/registry') &&
            !obj.href.startsWith('/blog') &&
            !obj.href.startsWith('/tutorials')
        ),
    };

    // Calculate target sample sizes per section
    const targetSampleSize = sampleSize - critical.length;
    const sampleSizes = {
        docs: Math.floor(targetSampleSize * 0.40),
        registry: Math.floor(targetSampleSize * 0.35),
        blog: Math.floor(targetSampleSize * 0.10),
        tutorials: Math.floor(targetSampleSize * 0.10),
        other: Math.floor(targetSampleSize * 0.05),
    };

    // Seeded random number generator
    let seedValue = hashString(seed);
    function seededRandom() {
        seedValue = (seedValue * 9301 + 49297) % 233280;
        return seedValue / 233280;
    }

    // Sample from each section
    const sampled = [];
    for (const [section, urls] of Object.entries(sections)) {
        const target = Math.min(sampleSizes[section], urls.length);
        const sectionSample = shuffleArray([...urls], seededRandom).slice(0, target);
        sampled.push(...sectionSample);
        console.log(`  ↳ ${section}: sampled ${target} of ${urls.length} URLs`);
    }

    // Combine critical + sampled
    const selected = [...critical, ...sampled];

    console.log(`\nSelected ${selected.length} total URLs to check:`);
    console.log(`  - Critical: ${critical.length}`);
    console.log(`  - Sampled: ${sampled.length}`);
    console.log(`  - Skipped: ${allObjects.length - selected.length}`);
    console.log(`  - Seed: ${seed}\n`);

    return {
        selected,
        metadata: {
            mode: 'sample',
            total: allObjects.length,
            critical: critical.length,
            sampled: sampled.length,
            seed: seed,
        }
    };
}

// Helper: Simple string hash
function hashString(str) {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
        const char = str.charCodeAt(i);
        hash = ((hash << 5) - hash) + char;
        hash = hash & hash;
    }
    return Math.abs(hash);
}

// Helper: Fisher-Yates shuffle with seeded random
function shuffleArray(array, randomFn) {
    for (let i = array.length - 1; i > 0; i--) {
        const j = Math.floor(randomFn() * (i + 1));
        [array[i], array[j]] = [array[j], array[i]];
    }
    return array;
}
```

#### Change 2: Update Main Function Signature and Logic (line 5)

**Before:**
```javascript
async function checkSearchURLs(baseURL) {
```

**After:**
```javascript
async function checkSearchURLs(baseURL, mode = 'sample', sampleSize = 5000, seed = null) {
```

**Then update function body (after line 10):**

**Before:**
```javascript
const objects = fs.readFileSync("./public/search-index.json", "utf-8")...
console.log(`Checking ${objects.length} search URLs...`);
```

**After:**
```javascript
const objects = fs.readFileSync("./public/search-index.json", "utf-8")...
console.log(`Loaded ${objects.length} URLs from search index.`);

// Select URLs based on mode
const selection = selectURLsToCheck(objects, { mode, sampleSize, seed });
const urlsToCheck = selection.selected;

console.log(`Checking ${urlsToCheck.length} URLs...\n`);
```

#### Change 3: Update Chunking to Use Selected URLs (line 15-21)

**Change all references from `objects` to `urlsToCheck`:**

```javascript
for (let i = 0; i < urlsToCheck.length; i += chunkSize) {
    chunks.push({ chunk: i / chunkSize, objects: urlsToCheck.slice(i, i + chunkSize) });
}
```

#### Change 4: Update Function Call and Argument Parsing (line 84)

**Before:**
```javascript
checkSearchURLs(process.argv[2] || "https://www.pulumi.com")
```

**After:**
```javascript
const baseURL = process.argv[2] || "https://www.pulumi.com";
const mode = process.argv[3] || "sample";
const sampleSize = parseInt(process.argv[4]) || 5000;
const seed = process.argv[5] || new Date().toISOString().split('T')[0];

checkSearchURLs(baseURL, mode, sampleSize, seed)
```

#### Change 5: Enhance JSON Output (line 87-92)

**Add metadata to summary:**

```javascript
const summary = {
    checked: results.length,
    fulfilled: results.filter(r => r.status === "fulfilled").map(r => r.value) || [],
    rejected: results.filter(r => r.status === "rejected").map(r => r.reason) || [],
    metadata: selection.metadata,  // ADD THIS LINE
    timestamp: new Date().toISOString(),  // ADD THIS LINE
};
```

**Note:** Need to make `selection` available in the scope (return it from checkSearchURLs or make it module-level)

---

### File 3: Modify `/workspaces/src/pulumi/docs/scripts/search/check-urls.sh`

**Change line 16:**

**Before:**
```bash
node "./scripts/search/check-urls.js" "$base_url"
```

**After:**
```bash
mode="${3:-sample}"
node "./scripts/search/check-urls.js" "$base_url" "$mode"
```

**Update usage comment at top of file:**
```bash
# Usage: ./check-urls.sh <index-name> <base-url> [mode]
# Modes: sample (default), critical-only, full
```

---

### File 4: Update `/workspaces/src/pulumi/docs/Makefile`

**Change existing target (around line 68):**

**Before:**
```makefile
check_search_urls:
	$(MAKE) banner
	$(MAKE) ensure
	./scripts/search/check-urls.sh production "https://www.pulumi.com"
```

**After:**
```makefile
check_search_urls:
	$(MAKE) banner
	$(MAKE) ensure
	./scripts/search/check-urls.sh production "https://www.pulumi.com" "sample"
```

**Add new targets:**
```makefile
check_search_urls_critical:
	$(MAKE) banner
	$(MAKE) ensure
	./scripts/search/check-urls.sh production "https://www.pulumi.com" "critical-only"

check_search_urls_full:
	$(MAKE) banner
	$(MAKE) ensure
	./scripts/search/check-urls.sh production "https://www.pulumi.com" "full"
```

---

## Testing Strategy

### Phase 1: Local Testing (Critical-Only Mode)

**Fast validation without full index:**

```bash
# Build site and generate search index
make build

# Test critical-only mode (should take 1-2 minutes)
make check_search_urls_critical
```

**Expected output:**
- "Found X critical URLs"
- "Running in CRITICAL-ONLY mode"
- All critical URLs should pass (no broken links)

### Phase 2: Local Testing (Sample Mode)

**Test with small sample first:**

```bash
# Test with 1000 URL sample (fast)
node ./scripts/search/check-urls.js "https://www.pulumi.com" "sample" 1000

# Test with full 5000 sample
make check_search_urls
```

**Verify:**
- Logs show stratification (docs: X, registry: Y, etc.)
- Total URLs checked ~= 5,200
- All critical URLs included
- No rate limiting errors

### Phase 3: Seed Reproducibility Test

**Verify same seed produces same sample:**

```bash
# Run twice with same seed
node ./scripts/search/check-urls.js "https://www.pulumi.com" "sample" 5000 "2026-01-26" > run1.log
node ./scripts/search/check-urls.js "https://www.pulumi.com" "sample" 5000 "2026-01-26" > run2.log

# Compare - should be identical
diff run1.log run2.log
```

### Phase 4: CI Integration Test

**Push changes and monitor GitHub Actions:**

1. Commit and push changes
2. Wait for scheduled job (or trigger manually)
3. Verify in logs:
   - Job completes without timeout
   - No ECONNREFUSED or ETIMEDOUT errors after retries
   - Shows "Selected X total URLs to check"
   - Completes all groups successfully

**Success criteria:**
- ✅ Script completes without rate limiting
- ✅ Runtime < 30 minutes
- ✅ All critical URLs checked
- ✅ Sample includes all sections

---

## Expected Outcomes

### Immediate Benefits

1. **Rate limiting resolved:** 5,200 URLs well under 11k limit
2. **Faster feedback:** 25-30 min vs timeout/failure
3. **Critical coverage guaranteed:** High-priority pages always checked
4. **Broad coverage over time:** 5,000 different URLs daily (7.5% of site)
5. **Reproducible:** Seed-based sampling for debugging

### Trade-offs

**Advantages:**
- Simple implementation (no infrastructure changes)
- Stays within AWS limits
- Fast CI feedback
- Easy to maintain and understand

**Disadvantages:**
- Not all URLs checked each run (~92.5% skipped)
- Broken links in non-critical pages may take days to detect
- Probabilistic coverage (not deterministic for all URLs)

**Mitigation:**
- Run full checks monthly (manual trigger during low-traffic times)
- Monitor 404 rates in CloudFront logs for production issues
- Critical URLs provide confidence in most important paths
- Stratification ensures all sections represented

---

## Files to Create/Modify

### New Files
1. `/workspaces/src/pulumi/docs/scripts/search/critical-urls.js` - Critical URL configuration

### Modified Files
2. `/workspaces/src/pulumi/docs/scripts/search/check-urls.js` - Add sampling logic
3. `/workspaces/src/pulumi/docs/scripts/search/check-urls.sh` - Accept mode parameter
4. `/workspaces/src/pulumi/docs/Makefile` - Update targets

### No Changes Needed
- `.github/workflows/check-search-urls.yml` - Uses default Makefile target (now sample mode)

---

## Implementation Sequence

1. **Create critical-urls.js** with initial list (~200 URLs)
2. **Add helper functions** (hashString, shuffleArray) to check-urls.js
3. **Add selectURLsToCheck function** to check-urls.js
4. **Update checkSearchURLs function** signature and logic
5. **Update argument parsing** at bottom of check-urls.js
6. **Enhance JSON output** with metadata
7. **Update check-urls.sh** to accept mode parameter
8. **Update Makefile** targets
9. **Test locally** with critical-only mode
10. **Test locally** with sample mode (small sample first)
11. **Verify reproducibility** with fixed seed
12. **Commit and push** to test in CI
13. **Monitor GitHub Actions** for successful completion

---

## Success Criteria

**Must Have (MVP):**
- [ ] Daily scheduled checks complete without rate limiting
- [ ] Critical URLs always checked (verified in logs)
- [ ] Sample mode checks ~5,200 URLs
- [ ] Results JSON includes selection metadata
- [ ] No ECONNREFUSED errors in CI logs

**Should Have:**
- [ ] Stratified sampling across all sections
- [ ] Deterministic seeded sampling
- [ ] Logging shows sample distribution
- [ ] All three modes working (sample, critical-only, full)

**Nice to Have (Future):**
- [ ] Documentation in README
- [ ] Dashboard for URL health trends
- [ ] Historical tracking of broken links
- [ ] Slack notifications on failures

---

## Risk Mitigation

### If Rate Limiting Still Occurs

**Unlikely but possible scenarios:**

1. **Sample size too large:**
   - Reduce from 5,000 to 3,000
   - Adjust in Makefile: `"sample" 3000`

2. **Critical URLs too many:**
   - Review and reduce critical list to top 100
   - Prioritize get-started guides only

3. **Concurrent requests issue:**
   - Already addressed (kept existing 10s delays)
   - Could reduce batch size from 1000 to 500

### Rollback Strategy

If issues arise:
1. **Immediate:** Switch to critical-only mode via workflow dispatch
2. **Short-term:** Reduce sample size to 2,500
3. **Full rollback:** `git revert` to previous version

---

## Future Enhancements (Post-MVP)

### Phase 2 Improvements

1. **Smart Sampling:**
   - Weight by page views (popular pages checked more often)
   - Check recently modified pages more frequently
   - Use rank.js scores to prioritize

2. **Historical Tracking:**
   - Store results over time in repo
   - Identify chronically failing URLs
   - Trend analysis dashboard

3. **Integration:**
   - Slack notifications for failures
   - Automated GitHub issue creation
   - CloudFront log analysis for 404 trends

4. **Incremental Checking:**
   - Check only URLs modified since last deploy
   - Use git diff to identify changed pages
   - Full sample weekly, incremental daily


Mode	URLs Checked	Use Case	Time
`sample` (default)	~5,200	Daily CI checks	25-30 min
`critical-only`	~200	Fast validation	1-2 min
`full`	~66,000	Manual testing only	N/A (will fail)

Implement URL sampling strategy to avoid rate limiting in check-search-urls #17245

Description

URL Checking with Sampling - Implementation Plan

Problem Statement

Solution: Sampling + Critical URL Monitoring

Architecture Overview

Three Operating Modes

Sampling Strategy: Stratified Random

Implementation Details

File 1: Create /workspaces/src/pulumi/docs/scripts/search/critical-urls.js

File 2: Modify /workspaces/src/pulumi/docs/scripts/search/check-urls.js

Change 1: Add URL Selection Function (after line 10)

Change 2: Update Main Function Signature and Logic (line 5)

Change 3: Update Chunking to Use Selected URLs (line 15-21)

Change 4: Update Function Call and Argument Parsing (line 84)

Change 5: Enhance JSON Output (line 87-92)

File 3: Modify /workspaces/src/pulumi/docs/scripts/search/check-urls.sh

File 4: Update /workspaces/src/pulumi/docs/Makefile

Testing Strategy

Phase 1: Local Testing (Critical-Only Mode)

Phase 2: Local Testing (Sample Mode)

Phase 3: Seed Reproducibility Test

Phase 4: CI Integration Test

Expected Outcomes

Immediate Benefits

Trade-offs

Files to Create/Modify

New Files

Modified Files

No Changes Needed

Implementation Sequence

Success Criteria

Risk Mitigation

If Rate Limiting Still Occurs

Rollback Strategy

Future Enhancements (Post-MVP)

Phase 2 Improvements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

File 1: Create `/workspaces/src/pulumi/docs/scripts/search/critical-urls.js`

File 2: Modify `/workspaces/src/pulumi/docs/scripts/search/check-urls.js`

File 3: Modify `/workspaces/src/pulumi/docs/scripts/search/check-urls.sh`

File 4: Update `/workspaces/src/pulumi/docs/Makefile`