diff --git a/.github/actions/link-checker/README.md b/.github/actions/link-checker/README.md new file mode 100644 index 0000000..d07b727 --- /dev/null +++ b/.github/actions/link-checker/README.md @@ -0,0 +1,345 @@ +# AI-Powered Link Checker Action + +This GitHub Action scans HTML files for web links and validates them, providing AI-powered suggestions for improvements. It's designed to replace traditional link checkers like `lychee` with enhanced functionality that not only detects broken links but also suggests better alternatives using AI-driven analysis. + +## Features + +- **Smart Link Validation**: Checks external web links in HTML files with configurable timeout and redirect handling +- **Enhanced Robustness**: Intelligent detection of bot-blocked sites to reduce false positives +- **AI-Powered Suggestions**: Provides intelligent recommendations for broken or redirected links +- **Two Scanning Modes**: Full project scan or PR-specific changed files only +- **Configurable Status Codes**: Define which HTTP status codes to silently report (e.g., 403, 503) +- **Redirect Detection**: Identifies and suggests updates for redirected links +- **GitHub Integration**: Creates issues, PR comments, and workflow artifacts +- **MyST Markdown Support**: Works with Jupyter Book projects by scanning HTML output +- **Performance Optimized**: Respectful rate limiting, improved timeouts, and efficient scanning + +## Usage + +### Basic Usage + +```yaml +- name: Check links in documentation + uses: QuantEcon/meta/.github/actions/link-checker@main +``` + +### Weekly Full Project Scan + +```yaml +name: Weekly Link Check +on: + schedule: + - cron: '0 9 * * 1' # Monday at 9 AM UTC + workflow_dispatch: + +jobs: + link-check: + runs-on: ubuntu-latest + permissions: + contents: read + issues: write + steps: + - uses: actions/checkout@v4 + with: + ref: gh-pages # Check the published site + + - name: AI-powered link check + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: '.' + mode: 'full' + fail-on-broken: 'false' + create-issue: 'true' + ai-suggestions: 'true' + silent-codes: '403,503' + issue-title: 'Weekly Link Check Report' + notify: 'maintainer1,maintainer2' +``` + +### PR-Triggered Changed Files Only + +```yaml +name: PR Link Check +on: + pull_request: + branches: [ main ] + +jobs: + link-check: + runs-on: ubuntu-latest + permissions: + contents: read + pull-requests: write + steps: + - uses: actions/checkout@v4 + + - name: Build documentation + run: jupyter-book build . + + - name: Check links in changed files + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: './_build/html' + mode: 'changed' + fail-on-broken: 'true' + ai-suggestions: 'true' + silent-codes: '403,503' +``` + +### Complete Advanced Usage + +```yaml +- name: Comprehensive link checking + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: './_build/html' + mode: 'full' + silent-codes: '403,503,429' + fail-on-broken: 'false' + ai-suggestions: 'true' + create-issue: 'true' + issue-title: 'Link Check Report - Broken Links Found' + create-artifact: 'true' + artifact-name: 'detailed-link-report' + notify: 'team-lead,docs-maintainer' + timeout: '30' + max-redirects: '5' +``` + +## False Positive Reduction + +The action includes intelligent logic to reduce false positives for legitimate sites: + +### Bot Blocking Detection +- **Major Sites**: Automatically detects common sites that block automated requests (Netflix, Amazon, Facebook, etc.) +- **Encoding Issues**: Identifies encoding errors that often indicate bot protection +- **Status Code Analysis**: Recognizes rate limiting (429) and bot blocking patterns +- **Silent Reporting**: Marks likely bot-blocked sites as silent instead of broken + +### Improved Robustness +- **Browser-like Headers**: Uses realistic browser headers to reduce blocking +- **Increased Timeout**: Default 45-second timeout for slow-loading legitimate sites +- **Smart Error Handling**: Distinguishes between genuine broken links and temporary blocks + +### AI Suggestion Filtering +- **Constructive Suggestions**: Only suggests fixes, not removals, for legitimate domains +- **Manual Review**: Suggests manual verification for unknown domains instead of automatic removal +- **Domain Whitelist**: Recognizes trusted domains (GitHub, Python.org, etc.) and handles them appropriately + +## AI-Powered Suggestions + +The action includes intelligent analysis that can suggest: + +### Automatic Fixes +- **HTTPS Upgrades**: Detects `http://` links that should be `https://` +- **GitHub Branch Updates**: Finds `/master/` links that should be `/main/` +- **Documentation Migrations**: Suggests updated URLs for moved documentation sites +- **Version Updates**: Recommends newer versions of deprecated documentation + +### Redirect Optimization +- **Final Destination**: Suggests updating redirected links to their final destination +- **Performance**: Eliminates unnecessary redirect chains +- **Reliability**: Reduces dependency on redirect services + +### Example AI Suggestions Output: +``` +🤖 http://docs.python.org/2.7/library/urllib.html + Issue: Broken link (Status: 404) + 💡 version_update: https://docs.python.org/3/library/urllib.html + Reason: Python 2.7 is deprecated, consider Python 3 documentation + +🤖 http://github.com/user/repo/blob/master/README.md + Issue: Redirected 1 times + 💡 redirect_update: https://github.com/user/repo/blob/main/README.md + Reason: GitHub default branch changed from master to main +``` + +## How It Works + +1. **File Discovery**: Scans HTML files in the specified directory +2. **Link Extraction**: Uses BeautifulSoup to extract all external links +3. **Link Validation**: Checks each link with configurable timeout and redirect handling +4. **AI Analysis**: Applies rule-based AI to suggest improvements +5. **Reporting**: Creates detailed reports with actionable suggestions + +### Scanning Modes + +#### Full Mode (`mode: 'full'`) +- Scans all HTML files in the target directory +- Ideal for scheduled weekly scans +- Comprehensive coverage of entire project + +#### Changed Mode (`mode: 'changed'`) +- Only scans HTML files that changed in the current PR +- Efficient for PR-triggered workflows +- Falls back to full scan if no changes detected + +## Configuration + +### Silent Status Codes + +Configure which HTTP status codes should be reported without failing: + +```yaml +silent-codes: '403,503,429,502' +``` + +Common codes to consider: +- `403`: Forbidden (often due to bot detection) +- `503`: Service Unavailable (temporary outages) +- `429`: Too Many Requests (rate limiting) +- `502`: Bad Gateway (temporary server issues) + +### Performance Tuning + +```yaml +timeout: '30' # Timeout per link in seconds +max-redirects: '5' # Maximum redirects to follow +``` + +## Integration Examples + +### Replacing Lychee + +**Before (using lychee):** +```yaml +- name: Link Checker + uses: lycheeverse/lychee-action@v2 + with: + fail: false + args: --accept 403,503 *.html +``` + +**After (using AI-powered link checker):** +```yaml +- name: AI-Powered Link Checker + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: '.' + fail-on-broken: 'false' + silent-codes: '403,503' + ai-suggestions: 'true' + create-issue: 'true' +``` + +### MyST Markdown Projects + +For Jupyter Book projects: + +```yaml +- name: Build Jupyter Book + run: jupyter-book build lectures/ + +- name: Check links in built documentation + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: './lectures/_build/html' + mode: 'full' + ai-suggestions: 'true' +``` + +## Outputs + +Use action outputs in subsequent workflow steps: + +```yaml +- name: Check links + id: link-check + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + fail-on-broken: 'false' + +- name: Report results + run: | + echo "Broken links: ${{ steps.link-check.outputs.broken-link-count }}" + echo "Redirects: ${{ steps.link-check.outputs.redirect-count }}" + echo "AI suggestions available: ${{ steps.link-check.outputs.ai-suggestions != '' }}" +``` + +## Permissions + +Required workflow permissions depend on features used: + +```yaml +permissions: + contents: read # Always required + issues: write # For create-issue: 'true' + pull-requests: write # For PR comments + actions: read # For create-artifact: 'true' +``` + +## Inputs + +| Input | Description | Required | Default | +|-------|-------------|----------|---------| +| `html-path` | Path to HTML files directory | No | `./_build/html` | +| `mode` | Scan mode: `full` or `changed` | No | `full` | +| `silent-codes` | HTTP codes to silently report | No | `403,503` | +| `fail-on-broken` | Fail workflow on broken links | No | `true` | +| `ai-suggestions` | Enable AI-powered suggestions | No | `true` | +| `create-issue` | Create GitHub issue for broken links | No | `false` | +| `issue-title` | Title for created issues | No | `Broken Links Found in Documentation` | +| `create-artifact` | Create workflow artifact | No | `false` | +| `artifact-name` | Name for workflow artifact | No | `link-check-report` | +| `notify` | Users to assign to created issue | No | `` | +| `timeout` | Timeout per link (seconds) | No | `45` | +| `max-redirects` | Maximum redirects to follow | No | `5` | + +## Outputs + +| Output | Description | +|--------|-------------| +| `broken-links-found` | Whether broken links were found | +| `broken-link-count` | Number of broken links | +| `redirect-count` | Number of redirects found | +| `link-details` | Detailed broken link information | +| `ai-suggestions` | AI-powered improvement suggestions | +| `issue-url` | URL of created GitHub issue | +| `artifact-path` | Path to created artifact file | + +## Best Practices + +1. **Weekly Scans**: Use scheduled workflows for comprehensive link checking +2. **PR Validation**: Use changed-file mode for efficient PR validation +3. **Status Code Configuration**: Adjust silent codes based on your links' typical behavior +4. **AI Suggestions**: Review and apply AI suggestions to improve link quality +5. **Issue Management**: Use automatic issue creation for tracking broken links +6. **Performance**: Set appropriate timeouts based on your link destinations + +## Troubleshooting + +### Common Issues + +1. **Timeout Errors**: Increase `timeout` value for slow-responding sites (default is now 45s) +2. **False Positives**: The action automatically detects major sites that block bots (Netflix, Amazon, etc.) +3. **Rate Limiting**: Add `429` to `silent-codes` for rate-limited sites +4. **Bot Blocking**: Legitimate sites blocking automated requests are automatically handled gracefully +5. **Large Repositories**: Use `changed` mode for PR workflows + +### False Positive Mitigation + +If legitimate links are being flagged as broken: + +1. **Check if it's a major site**: Netflix, Amazon, Facebook, etc. are automatically detected as likely bot-blocked +2. **Increase timeout**: Use `timeout: '60'` for slower sites like tutorials or educational content +3. **Add to silent codes**: If a site consistently returns specific error codes, add them to `silent-codes` +4. **Review AI suggestions**: The action provides constructive fix suggestions rather than suggesting removal + +### Debug Output + +The action provides detailed logging including: +- Number of files scanned +- Links found per file +- Status codes and errors +- AI suggestion reasoning + +## Migration from Lychee + +This action can directly replace `lychee` workflows with enhanced functionality: + +1. Replace `lycheeverse/lychee-action` with this action +2. Update input parameters (see comparison above) +3. Add AI suggestions and issue creation features +4. Configure silent status codes as needed + +The enhanced AI capabilities provide value beyond basic link checking by suggesting improvements and maintaining link quality over time. \ No newline at end of file diff --git a/.github/actions/link-checker/__pycache__/link_checker.cpython-312.pyc b/.github/actions/link-checker/__pycache__/link_checker.cpython-312.pyc new file mode 100644 index 0000000..e6e46cb Binary files /dev/null and b/.github/actions/link-checker/__pycache__/link_checker.cpython-312.pyc differ diff --git a/.github/actions/link-checker/action.yml b/.github/actions/link-checker/action.yml new file mode 100644 index 0000000..425a348 --- /dev/null +++ b/.github/actions/link-checker/action.yml @@ -0,0 +1,448 @@ +name: 'AI-Powered Link Checker' +description: 'Check and validate web links in HTML files with AI-powered suggestions for improvements' +author: 'QuantEcon' + +inputs: + html-path: + description: 'Path to directory containing HTML files to scan' + required: false + default: './_build/html' + mode: + description: 'Scanning mode: "full" for all files, "changed" for PR-changed files only' + required: false + default: 'full' + silent-codes: + description: 'HTTP status codes to silently report without failing (comma-separated)' + required: false + default: '403,503' + fail-on-broken: + description: 'Whether to fail the workflow if broken links are found' + required: false + default: 'true' + ai-suggestions: + description: 'Whether to enable AI-powered link improvement suggestions' + required: false + default: 'true' + create-issue: + description: 'Whether to create a GitHub issue when broken links are found' + required: false + default: 'false' + issue-title: + description: 'Title for the GitHub issue when broken links are found' + required: false + default: 'Broken Links Found in Documentation' + create-artifact: + description: 'Whether to create a workflow artifact with the link report' + required: false + default: 'false' + artifact-name: + description: 'Name for the workflow artifact containing the link report' + required: false + default: 'link-check-report' + notify: + description: 'GitHub username(s) to assign to the created issue (comma-separated for multiple users)' + required: false + default: '' + timeout: + description: 'Timeout in seconds for each link check (increased default for better robustness)' + required: false + default: '45' + max-redirects: + description: 'Maximum number of redirects to follow' + required: false + default: '5' + +outputs: + broken-links-found: + description: 'Whether broken links were found (true/false)' + value: ${{ steps.check.outputs.broken-links-found }} + broken-link-count: + description: 'Number of broken links found' + value: ${{ steps.check.outputs.broken-link-count }} + redirect-count: + description: 'Number of redirects found' + value: ${{ steps.check.outputs.redirect-count }} + link-details: + description: 'Details of broken links and suggestions' + value: ${{ steps.check.outputs.link-details }} + ai-suggestions: + description: 'AI-powered suggestions for link improvements' + value: ${{ steps.check.outputs.ai-suggestions }} + issue-url: + description: 'URL of the created GitHub issue (if create-issue is enabled)' + value: ${{ steps.create-issue.outputs.issue-url }} + artifact-path: + description: 'Path to the created artifact file (if create-artifact is enabled)' + value: ${{ steps.create-artifact.outputs.artifact-path }} + +runs: + using: 'composite' + steps: + - name: Install dependencies + shell: bash + run: | + python3 -m pip install requests beautifulsoup4 --quiet + + - name: Check links and generate AI suggestions + id: check + shell: bash + run: | + # Get the action directory + ACTION_DIR="${{ github.action_path }}" + + # Parse inputs + HTML_PATH="${{ inputs.html-path }}" + MODE="${{ inputs.mode }}" + SILENT_CODES="${{ inputs.silent-codes }}" + FAIL_ON_BROKEN="${{ inputs.fail-on-broken }}" + AI_SUGGESTIONS="${{ inputs.ai-suggestions }}" + TIMEOUT="${{ inputs.timeout }}" + MAX_REDIRECTS="${{ inputs.max-redirects }}" + + echo "Scanning HTML files in: $HTML_PATH" + echo "Mode: $MODE" + echo "Silent codes: $SILENT_CODES" + echo "AI suggestions enabled: $AI_SUGGESTIONS" + + # Initialize counters + TOTAL_BROKEN=0 + TOTAL_REDIRECTS=0 + BROKEN_LINKS_FOUND="false" + LINK_DETAILS="" + AI_SUGGESTIONS_OUTPUT="" + DETAILED_REPORT="" + + # Check if HTML path exists + if [ ! -e "$HTML_PATH" ]; then + echo "Error: HTML path '$HTML_PATH' does not exist" + exit 1 + fi + + # Determine files to check based on mode + if [ "$MODE" = "changed" ] && [ "${{ github.event_name }}" = "pull_request" ]; then + echo "PR mode: checking only changed files" + # Get changed HTML files in the target directory + FILES_CHANGED=$(git diff --name-only origin/${{ github.base_ref }}...HEAD | grep -E "\.html$" | grep "^$HTML_PATH/" || true) + if [ -z "$FILES_CHANGED" ]; then + echo "No HTML files changed in PR, checking all files in HTML path" + mapfile -d '' FILES < <(find "$HTML_PATH" -name "*.html" -type f -print0) + else + mapfile -t FILES <<< "$FILES_CHANGED" + fi + else + echo "Full mode: checking all HTML files" + mapfile -d '' FILES < <(find "$HTML_PATH" -name "*.html" -type f -print0) + fi + + echo "Found ${#FILES[@]} HTML files to check" + + # Process each HTML file + for file in "${FILES[@]}"; do + if [ ! -f "$file" ]; then + continue + fi + + echo "Checking links in: $file" + + # Build AI suggestions flag + AI_FLAG="" + if [ "$AI_SUGGESTIONS" = "true" ]; then + AI_FLAG="--ai-suggestions" + fi + + # Run link checker and capture JSON output + result_json=$(python3 "$ACTION_DIR/link_checker.py" "$file" \ + --timeout "$TIMEOUT" \ + --max-redirects "$MAX_REDIRECTS" \ + --silent-codes "$SILENT_CODES" \ + $AI_FLAG 2>/tmp/stderr.log) + + if [ $? -ne 0 ] || [ -z "$result_json" ]; then + echo "Warning: Failed to process $file" + cat /tmp/stderr.log >&2 + continue + fi + + # Parse results and update counters + broken_count=$(echo "$result_json" | python3 -c "import json, sys; data=json.load(sys.stdin); print(len(data['broken_results']))") + redirect_count=$(echo "$result_json" | python3 -c "import json, sys; data=json.load(sys.stdin); print(len(data['redirect_results']))") + total_links=$(echo "$result_json" | python3 -c "import json, sys; data=json.load(sys.stdin); print(data['total_links'])") + + TOTAL_BROKEN=$((TOTAL_BROKEN + broken_count)) + TOTAL_REDIRECTS=$((TOTAL_REDIRECTS + redirect_count)) + + if [ "$broken_count" -gt 0 ] || [ "$redirect_count" -gt 0 ]; then + BROKEN_LINKS_FOUND="true" + + # Extract detailed results for reporting + if [ "$broken_count" -gt 0 ]; then + broken_details=$(echo "$result_json" | python3 "$ACTION_DIR/format_results.py" broken) + LINK_DETAILS="$LINK_DETAILS\n\n**$file** - $broken_count broken link(s):\n$broken_details" + fi + + if [ "$redirect_count" -gt 0 ]; then + redirect_details=$(echo "$result_json" | python3 "$ACTION_DIR/format_results.py" redirect) + LINK_DETAILS="$LINK_DETAILS\n\n**$file** - $redirect_count redirect(s):\n$redirect_details" + fi + + # Extract AI suggestions + if [ "$AI_SUGGESTIONS" = "true" ]; then + ai_details=$(echo "$result_json" | python3 "$ACTION_DIR/format_results.py" ai) + if [ -n "$ai_details" ]; then + AI_SUGGESTIONS_OUTPUT="$AI_SUGGESTIONS_OUTPUT\n\n**$file** - AI Suggestions:\n$ai_details" + fi + fi + fi + + echo " Found $total_links total links, $broken_count broken, $redirect_count redirected" + done + + # Set outputs + echo "broken-links-found=$BROKEN_LINKS_FOUND" >> $GITHUB_OUTPUT + echo "broken-link-count=$TOTAL_BROKEN" >> $GITHUB_OUTPUT + echo "redirect-count=$TOTAL_REDIRECTS" >> $GITHUB_OUTPUT + echo "link-details<> $GITHUB_OUTPUT + echo -e "$LINK_DETAILS" >> $GITHUB_OUTPUT + echo "EOF" >> $GITHUB_OUTPUT + echo "ai-suggestions<> $GITHUB_OUTPUT + echo -e "$AI_SUGGESTIONS_OUTPUT" >> $GITHUB_OUTPUT + echo "EOF" >> $GITHUB_OUTPUT + + # Create detailed report for artifacts/issues + DETAILED_REPORT="## Link Check Summary\n\n" + DETAILED_REPORT="$DETAILED_REPORT- **Total broken links**: $TOTAL_BROKEN\n" + DETAILED_REPORT="$DETAILED_REPORT- **Total redirects found**: $TOTAL_REDIRECTS\n\n" + + if [ "$TOTAL_BROKEN" -gt 0 ]; then + DETAILED_REPORT="$DETAILED_REPORT## Broken Links\n$LINK_DETAILS\n\n" + fi + + if [ "$AI_SUGGESTIONS" = "true" ] && [ -n "$AI_SUGGESTIONS_OUTPUT" ]; then + DETAILED_REPORT="$DETAILED_REPORT## AI-Powered Suggestions\n$AI_SUGGESTIONS_OUTPUT\n\n" + fi + + echo "detailed-report<> $GITHUB_OUTPUT + echo -e "$DETAILED_REPORT" >> $GITHUB_OUTPUT + echo "EOF" >> $GITHUB_OUTPUT + + # Summary + if [ "$BROKEN_LINKS_FOUND" = "true" ]; then + echo "❌ Found $TOTAL_BROKEN broken link(s) and $TOTAL_REDIRECTS redirect(s)" + if [ "$FAIL_ON_BROKEN" = "true" ]; then + echo "::error::Found $TOTAL_BROKEN broken link(s) in HTML files" + fi + else + echo "✅ No broken links found" + fi + + - name: Post PR comment with link report + if: inputs.fail-on-broken == 'true' && steps.check.outputs.broken-links-found == 'true' && github.event_name == 'pull_request' + uses: actions/github-script@v7 + with: + script: | + const brokenCount = '${{ steps.check.outputs.broken-link-count }}'; + const redirectCount = '${{ steps.check.outputs.redirect-count }}'; + const detailedReport = ${{ toJSON(steps.check.outputs.detailed-report) }}; + + const body = [ + '## 🔗 Link Check Results', + '', + '🚨 **' + brokenCount + ' broken link(s)** and **' + redirectCount + ' redirect(s)** were found.', + '', + '**Build Details:**', + '- **Workflow Run:** [${{ github.run_id }}](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }})', + '- **Commit:** ${{ github.sha }}', + '- **Date:** ' + new Date().toISOString(), + '', + '---', + '', + detailedReport, + '', + '---', + '', + '**Next Steps:**', + '1. Review the broken links listed above', + '2. Update or remove broken links', + '3. Consider applying AI suggestions for better alternatives', + '4. Push the changes to update this PR', + '', + '📝 *This comment was automatically generated by the [AI-Powered Link Checker Action](https://github.com/QuantEcon/meta/.github/actions/link-checker).*' + ].join('\n'); + + try { + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: context.issue.number, + body: body + }); + console.log('Posted PR comment with link check results'); + } catch (error) { + console.error('Failed to create PR comment:', error); + core.setFailed('Failed to create PR comment: ' + error.message); + } + + - name: Fail workflow on broken links + if: inputs.fail-on-broken == 'true' && steps.check.outputs.broken-links-found == 'true' + shell: bash + run: | + echo "Failing workflow due to broken links found" + exit 1 + + - name: Create artifact with link report + id: create-artifact + if: inputs.create-artifact == 'true' && steps.check.outputs.broken-links-found == 'true' + shell: bash + run: | + ARTIFACT_NAME="${{ inputs.artifact-name }}" + ARTIFACT_FILE="$ARTIFACT_NAME.md" + CURRENT_DATE=$(date -u '+%Y-%m-%d %H:%M:%S UTC') + + { + echo "# Link Check Report" + echo "" + echo "**Date:** $CURRENT_DATE" + echo "**Repository:** ${{ github.repository }}" + echo "**Workflow:** ${{ github.workflow }}" + echo "**Run ID:** ${{ github.run_id }}" + echo "**Broken Links Found:** ${{ steps.check.outputs.broken-link-count }}" + echo "**Redirects Found:** ${{ steps.check.outputs.redirect-count }}" + echo "" + echo "---" + echo "" + echo "${{ steps.check.outputs.detailed-report }}" + echo "" + echo "---" + echo "" + echo "Generated by [AI-Powered Link Checker Action](https://github.com/QuantEcon/meta/.github/actions/link-checker)" + } > "$ARTIFACT_FILE" + + echo "artifact-path=$ARTIFACT_FILE" >> $GITHUB_OUTPUT + echo "Created link check report artifact: $ARTIFACT_FILE" + + - name: Upload link report artifact + if: inputs.create-artifact == 'true' && steps.check.outputs.broken-links-found == 'true' + uses: actions/upload-artifact@v4 + with: + name: ${{ inputs.artifact-name }} + path: ${{ steps.create-artifact.outputs.artifact-path }} + retention-days: 30 + + - name: Create GitHub issue + id: create-issue + if: inputs.create-issue == 'true' && steps.check.outputs.broken-links-found == 'true' + uses: actions/github-script@v7 + with: + script: | + const brokenCount = '${{ steps.check.outputs.broken-link-count }}'; + const redirectCount = '${{ steps.check.outputs.redirect-count }}'; + const detailedReport = ${{ toJSON(steps.check.outputs.detailed-report) }}; + const title = '${{ inputs.issue-title }}'; + const notify = '${{ inputs.notify }}'; + + const body = [ + '# Link Check Report', + '', + '🚨 **' + brokenCount + ' broken link(s)** and **' + redirectCount + ' redirect(s)** were found in the documentation.', + '', + '**Details:**', + '- **Repository:** ${{ github.repository }}', + '- **Workflow:** ${{ github.workflow }}', + '- **Run ID:** [${{ github.run_id }}](${{ github.server_url }}/${{ github.repository }}/actions/runs/${{ github.run_id }})', + '- **Commit:** ${{ github.sha }}', + '- **Branch:** ${{ github.ref_name }}', + '- **Date:** ' + new Date().toISOString(), + '', + '---', + '', + detailedReport, + '', + '---', + '', + '**Next Steps:**', + '1. Review the broken links listed above', + '2. Update or remove broken links from the source files', + '3. Consider applying AI suggestions for better alternatives', + '4. Re-run the link check to verify fixes', + '', + '**Note:** This issue was automatically created by the [AI-Powered Link Checker Action](https://github.com/QuantEcon/meta/.github/actions/link-checker).', + '', + 'Please close this issue once all broken links have been addressed.' + ].join('\n'); + + try { + const response = await github.rest.issues.create({ + owner: context.repo.owner, + repo: context.repo.repo, + title: title, + body: body, + labels: ['bug', 'documentation', 'broken-links'] + }); + + const issueUrl = response.data.html_url; + const issueNumber = response.data.number; + console.log('Created issue: ' + issueUrl); + core.setOutput('issue-url', issueUrl); + + // Assign users to the issue if notify parameter is provided + if (notify && notify.trim()) { + try { + const assignees = notify.split(',') + .map(username => username.trim()) + .filter(username => username.length > 0); + + if (assignees.length > 0) { + console.log('Assigning issue to users: ' + assignees.join(', ')); + + await github.rest.issues.addAssignees({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: issueNumber, + assignees: assignees + }); + + console.log('Successfully assigned issue to: ' + assignees.join(', ')); + } + } catch (assignError) { + console.error('Failed to assign users to issue:', assignError); + console.log('Issue was created successfully, but assignment failed.'); + } + } + + return issueUrl; + } catch (error) { + console.error('Failed to create issue:', error); + core.setFailed('Failed to create issue: ' + error.message); + } + + - name: Post simple PR comment linking to issue + if: inputs.create-issue == 'true' && steps.check.outputs.broken-links-found == 'true' && github.event_name == 'pull_request' + uses: actions/github-script@v7 + with: + script: | + const issueUrl = '${{ steps.create-issue.outputs.issue-url }}'; + + const body = [ + '🔗 Link check found broken links in this PR.', + '', + `For detailed analysis and AI-powered suggestions, please check ${issueUrl}`, + '', + 'Note: This issue was automatically created by the [AI-Powered Link Checker Action](https://github.com/QuantEcon/meta/.github/actions/link-checker).' + ].join('\n'); + + try { + await github.rest.issues.createComment({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: context.issue.number, + body: body + }); + console.log('Posted simple PR comment linking to issue'); + } catch (error) { + console.error('Failed to create PR comment:', error); + core.setFailed('Failed to create PR comment: ' + error.message); + } + +branding: + icon: 'link' + color: 'blue' \ No newline at end of file diff --git a/.github/actions/link-checker/examples.md b/.github/actions/link-checker/examples.md new file mode 100644 index 0000000..7a56d33 --- /dev/null +++ b/.github/actions/link-checker/examples.md @@ -0,0 +1,330 @@ +# Usage Examples for AI-Powered Link Checker + +This document provides practical examples for different use cases of the AI-Powered Link Checker action. + +## Example 1: Weekly Scheduled Link Check + +Replace the existing lychee-based link checker with AI-powered functionality: + +```yaml +name: Weekly Link Check +on: + schedule: + # Run every Monday at 9 AM UTC (early morning in Australia) + - cron: '0 9 * * 1' + workflow_dispatch: + +permissions: + contents: read + issues: write + +jobs: + link-check: + name: AI-Powered Link Checking + runs-on: ubuntu-latest + steps: + # Checkout the published site (HTML) + - name: Checkout + uses: actions/checkout@v4 + with: + ref: gh-pages + + - name: AI-Powered Link Check + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: '.' + mode: 'full' + fail-on-broken: 'false' # Don't fail on schedule, just report + ai-suggestions: 'true' + silent-codes: '403,503,429' + create-issue: 'true' + issue-title: 'Weekly Link Check Report' + notify: 'maintainer1,maintainer2' + create-artifact: 'true' + artifact-name: 'weekly-link-report' +``` + +## Example 2: Pull Request Link Validation + +Check links in documentation changes during PR review: + +```yaml +name: PR Documentation Check +on: + pull_request: + branches: [ main ] + paths: + - 'lectures/**' + - '_build/**' + - '**.md' + +permissions: + contents: read + pull-requests: write + +jobs: + docs-and-links: + runs-on: ubuntu-latest + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: '3.11' + + - name: Install dependencies + run: | + pip install jupyter-book myst-parser + + - name: Build Jupyter Book + run: | + jupyter-book build lectures/ + + - name: Check links in changed files + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: './lectures/_build/html' + mode: 'changed' # Only check files changed in this PR + fail-on-broken: 'true' # Fail PR if broken links + ai-suggestions: 'true' + silent-codes: '403,503' + timeout: '20' +``` + +## Example 3: Comprehensive Documentation Build + +Full documentation build with link checking and AI suggestions: + +```yaml +name: Build and Validate Documentation +on: + push: + branches: [ main ] + pull_request: + branches: [ main ] + +permissions: + contents: read + issues: write + pull-requests: write + actions: read + +jobs: + build-and-check: + runs-on: ubuntu-latest + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Set up Python + uses: actions/setup-python@v4 + with: + python-version: '3.11' + + - name: Install dependencies + run: | + pip install -r requirements.txt + + - name: Build documentation + run: | + jupyter-book build . + + - name: AI-Powered Link Check + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: './_build/html' + mode: ${{ github.event_name == 'pull_request' && 'changed' || 'full' }} + fail-on-broken: ${{ github.event_name == 'push' }} + ai-suggestions: 'true' + create-issue: ${{ github.event_name == 'push' }} + create-artifact: 'true' + silent-codes: '403,503,429,502' + issue-title: 'Documentation Link Issues - ${{ github.ref_name }}' + notify: 'docs-team,maintainers' + artifact-name: 'link-check-report-${{ github.run_number }}' +``` + +## Example 4: Multi-Project Link Checking + +Check links across multiple related documentation projects: + +```yaml +name: Cross-Project Link Check +on: + schedule: + - cron: '0 2 * * 0' # Sunday at 2 AM UTC + workflow_dispatch: + +jobs: + check-projects: + strategy: + matrix: + project: + - { name: 'python-programming', ref: 'gh-pages' } + - { name: 'datascience', ref: 'gh-pages' } + - { name: 'game-theory', ref: 'gh-pages' } + runs-on: ubuntu-latest + steps: + - name: Checkout ${{ matrix.project.name }} + uses: actions/checkout@v4 + with: + repository: 'QuantEcon/${{ matrix.project.name }}.myst' + ref: ${{ matrix.project.ref }} + + - name: Link Check - ${{ matrix.project.name }} + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: '.' + fail-on-broken: 'false' + ai-suggestions: 'true' + create-issue: 'true' + issue-title: 'Link Check Report - ${{ matrix.project.name }}' + notify: 'quantecon-team' +``` + +## Example 5: Advanced Configuration with Custom Timeouts + +For projects with many external links or slow-responding sites: + +```yaml +- name: Patient Link Checker + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: './_build/html' + timeout: '60' # 60 seconds per link + max-redirects: '10' # Follow up to 10 redirects + silent-codes: '403,503,429,502,520,521,522,523,524' + fail-on-broken: 'false' + ai-suggestions: 'true' + create-issue: 'true' + issue-title: 'Comprehensive Link Analysis' +``` + +## Example 6: Development Mode with Artifacts + +For debugging and development of documentation: + +```yaml +- name: Development Link Check + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: './_build/html' + fail-on-broken: 'false' # Don't fail during development + ai-suggestions: 'true' + create-artifact: 'true' # Always create artifacts for review + artifact-name: 'dev-link-report' + timeout: '15' +``` + +## Example 7: Integration with Existing Warning Check + +Combine with the existing warning checker for comprehensive quality control: + +```yaml +name: Documentation Quality Check +on: + pull_request: + branches: [ main ] + +jobs: + quality-check: + runs-on: ubuntu-latest + permissions: + contents: read + pull-requests: write + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Build documentation + run: jupyter-book build . + + - name: Check for Python warnings + uses: QuantEcon/meta/.github/actions/check-warnings@main + with: + html-path: './_build/html' + fail-on-warning: 'true' + + - name: Check for broken links + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: './_build/html' + mode: 'changed' + fail-on-broken: 'true' + ai-suggestions: 'true' +``` + +## Example 8: Silent Monitoring + +For continuous monitoring without disrupting development: + +```yaml +name: Silent Link Monitoring +on: + schedule: + - cron: '0 12 * * *' # Daily at noon + +jobs: + monitor: + runs-on: ubuntu-latest + steps: + - name: Checkout + uses: actions/checkout@v4 + with: + ref: gh-pages + + - name: Silent Link Check + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: '.' + fail-on-broken: 'false' # Never fail + ai-suggestions: 'true' + create-artifact: 'true' # Just create reports + artifact-name: 'daily-link-monitoring' + silent-codes: '403,503,429,502,520,521,522,523,524' +``` + +## Migration Guide from Lychee + +### Before (using lychee): +```yaml +- name: Link Checker + id: lychee + uses: lycheeverse/lychee-action@v2 + with: + fail: false + args: --accept 403,503 *.html + +- name: Create Issue From File + if: steps.lychee.outputs.exit_code != 0 + uses: peter-evans/create-issue-from-file@v5 + with: + title: Link Checker Report + content-filepath: ./lychee/out.md + labels: report, automated issue, linkchecker +``` + +### After (using AI-powered link checker): +```yaml +- name: AI-Powered Link Checker + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: '.' + fail-on-broken: 'false' + silent-codes: '403,503' + ai-suggestions: 'true' + create-issue: 'true' + issue-title: 'AI-Enhanced Link Check Report' + notify: 'maintainer-team' +``` + +## Benefits Over Lychee + +1. **AI Suggestions**: Automatically suggests fixes for broken links +2. **Redirect Optimization**: Recommends updating redirected links +3. **Better Integration**: Native GitHub Actions integration +4. **Flexible Reporting**: Multiple output formats (issues, artifacts, PR comments) +5. **Smart Filtering**: Context-aware link analysis +6. **Performance**: Configurable timeouts and rate limiting \ No newline at end of file diff --git a/.github/actions/link-checker/format_results.py b/.github/actions/link-checker/format_results.py new file mode 100644 index 0000000..8193bf3 --- /dev/null +++ b/.github/actions/link-checker/format_results.py @@ -0,0 +1,61 @@ +#!/usr/bin/env python3 +""" +Result formatter for link check results +""" +import json +import sys + +def format_broken_results(data): + """Format broken link results for display""" + results = [] + for result in data['broken_results']: + error_info = f" ({result['error']})" if result['error'] else "" + results.append(f"❌ {result['url']} - Status: {result['status_code']}{error_info}") + if result['text']: + results.append(f" Link text: {result['text']}") + return '\n'.join(results) + +def format_redirect_results(data): + """Format redirect results for display""" + results = [] + for result in data['redirect_results']: + results.append(f"🔄 {result['url']} -> {result['final_url']} ({result['redirect_count']} redirects)") + return '\n'.join(results) + +def format_ai_suggestions(data): + """Format AI suggestions for display""" + results = [] + for suggestion in data['ai_suggestions']: + results.append(f"🤖 {suggestion['original_url']}") + results.append(f" Issue: {suggestion['issue']}") + for s in suggestion['suggestions']: + results.append(f" 💡 {s['type']}: {s['url']}") + results.append(f" Reason: {s['reason']}") + return '\n'.join(results) + +def main(): + if len(sys.argv) != 2: + print("Usage: python3 format_results.py ", file=sys.stderr) + print("Modes: broken, redirect, ai", file=sys.stderr) + sys.exit(1) + + mode = sys.argv[1] + + try: + data = json.load(sys.stdin) + except json.JSONDecodeError as e: + print(f"Error parsing JSON: {e}", file=sys.stderr) + sys.exit(1) + + if mode == "broken": + print(format_broken_results(data)) + elif mode == "redirect": + print(format_redirect_results(data)) + elif mode == "ai": + print(format_ai_suggestions(data)) + else: + print(f"Unknown mode: {mode}", file=sys.stderr) + sys.exit(1) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/.github/actions/link-checker/link_checker.py b/.github/actions/link-checker/link_checker.py new file mode 100644 index 0000000..b8b7b3b --- /dev/null +++ b/.github/actions/link-checker/link_checker.py @@ -0,0 +1,317 @@ +#!/usr/bin/env python3 +""" +AI-Powered Link Checker Script +Checks external links in HTML files and provides AI suggestions for improvements. +""" +import re +import sys +import requests +import urllib.parse +from bs4 import BeautifulSoup +import json +import time +import os +import argparse + +def is_external_link(url): + """Check if URL is external (starts with http/https)""" + return url.startswith(('http://', 'https://')) + +def is_likely_bot_blocked(url, response_content=None, status_code=None, error=None): + """Detect if a site is likely blocking automated requests rather than being truly broken""" + domain_indicators = [ + 'netflix.com', 'amazon.com', 'facebook.com', 'twitter.com', 'instagram.com', + 'youtube.com', 'linkedin.com', 'pinterest.com', 'reddit.com', 'wikipedia.org' + ] + + # Check for well-known legitimate domains that should be treated carefully + legitimate_domains = [ + 'python.org', 'jupyter.org', 'github.com', 'docs.python.org', + 'stackoverflow.com', 'readthedocs.org', 'arxiv.org', 'doi.org', + 'numpy.org', 'scipy.org', 'matplotlib.org', 'pandas.pydata.org' + ] + + # Check if it's a major site that commonly blocks bots + for indicator in domain_indicators: + if indicator in url.lower(): + return True + + # Check if it's a legitimate domain that might be blocked by network restrictions + for domain in legitimate_domains: + if domain in url.lower() and error and 'Connection Error' in str(error): + return True + + # Check for encoding issues which often indicate bot blocking + if error and 'encoding' in str(error).lower(): + return True + + # Check for specific status codes that often indicate bot blocking rather than broken links + bot_blocking_codes = [429, 451, 503] # Rate limited, unavailable for legal reasons, service unavailable + if status_code in bot_blocking_codes: + return True + + return False + +def check_link(url, timeout, max_redirects, silent_codes): + """Check a single link and return status info""" + # Use a more browser-like user agent to reduce blocking + headers = { + 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36', + 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', + 'Accept-Language': 'en-US,en;q=0.5', + 'Accept-Encoding': 'gzip, deflate', + 'Connection': 'keep-alive', + 'Upgrade-Insecure-Requests': '1' + } + + try: + # Set up session with redirects tracking + session = requests.Session() + session.max_redirects = max_redirects + + response = session.get( + url, + timeout=timeout, + allow_redirects=True, + headers=headers + ) + + result = { + 'url': url, + 'status_code': response.status_code, + 'final_url': response.url, + 'redirect_count': len(response.history), + 'redirected': len(response.history) > 0, + 'broken': False, + 'silent': False, + 'error': None, + 'likely_bot_blocked': False + } + + # Check if this looks like bot blocking + result['likely_bot_blocked'] = is_likely_bot_blocked(url, response.text, response.status_code) + + # Check if status code should be silently reported + if response.status_code in silent_codes: + result['silent'] = True + elif result['likely_bot_blocked']: + # Don't mark as broken if likely bot blocked, mark as silent instead + result['silent'] = True + elif not response.ok: + result['broken'] = True + + return result + + except requests.exceptions.Timeout: + # Check if timeout on a likely legitimate site + likely_blocked = is_likely_bot_blocked(url, error='timeout') + return { + 'url': url, 'status_code': 0, 'final_url': url, + 'redirect_count': 0, 'redirected': False, 'broken': not likely_blocked, + 'silent': likely_blocked, 'error': 'Timeout', 'likely_bot_blocked': likely_blocked + } + except requests.exceptions.ConnectionError as e: + # Check if connection error on a likely legitimate site + likely_blocked = is_likely_bot_blocked(url, error='Connection Error') + return { + 'url': url, 'status_code': 0, 'final_url': url, + 'redirect_count': 0, 'redirected': False, 'broken': not likely_blocked, + 'silent': likely_blocked, 'error': 'Connection Error', 'likely_bot_blocked': likely_blocked + } + except UnicodeDecodeError as e: + # Encoding issues often indicate bot blocking + return { + 'url': url, 'status_code': 0, 'final_url': url, + 'redirect_count': 0, 'redirected': False, 'broken': False, + 'silent': True, 'error': f'Encoding issue: {str(e)}', 'likely_bot_blocked': True + } + except Exception as e: + # Check if the error suggests bot blocking + likely_blocked = is_likely_bot_blocked(url, error=str(e)) + return { + 'url': url, 'status_code': 0, 'final_url': url, + 'redirect_count': 0, 'redirected': False, 'broken': not likely_blocked, + 'silent': likely_blocked, 'error': str(e), 'likely_bot_blocked': likely_blocked + } + +def extract_links_from_html(file_path): + """Extract all external links from HTML file""" + try: + with open(file_path, 'r', encoding='utf-8') as f: + content = f.read() + + soup = BeautifulSoup(content, 'html.parser') + links = [] + + # Find all anchor tags with href + for tag in soup.find_all('a', href=True): + href = tag['href'] + if is_external_link(href): + # Store link with context + links.append({ + 'url': href, + 'text': tag.get_text(strip=True)[:100], # First 100 chars + 'line': None # We could calculate line numbers if needed + }) + + return links + + except Exception as e: + print(f"Error parsing {file_path}: {e}", file=sys.stderr) + return [] + +def generate_ai_suggestions(broken_results, redirect_results): + """Generate AI-powered suggestions for broken and redirected links""" + suggestions = [] + + # Simple rule-based AI suggestions (can be enhanced with actual AI services) + for result in broken_results: + url = result['url'] + + # Skip suggestions for likely bot-blocked sites + if result.get('likely_bot_blocked', False): + continue + + suggestion = { + 'original_url': url, + 'issue': f"Broken link (Status: {result['status_code']})", + 'suggestions': [] + } + + # Only suggest fixes, not removals, for legitimate domains + is_legitimate_domain = any(domain in url.lower() for domain in [ + 'github.com', 'python.org', 'jupyter.org', 'readthedocs.org', + 'stackoverflow.com', 'wikipedia.org', 'arxiv.org', 'doi.org' + ]) + + # Common URL fixes + if 'github.com' in url: + # GitHub-specific suggestions + if '/blob/master/' in url: + new_url = url.replace('/blob/master/', '/blob/main/') + suggestion['suggestions'].append({ + 'type': 'branch_update', + 'url': new_url, + 'reason': 'GitHub default branch changed from master to main' + }) + if 'github.io' in url and 'http://' in url: + new_url = url.replace('http://', 'https://') + suggestion['suggestions'].append({ + 'type': 'https_upgrade', + 'url': new_url, + 'reason': 'GitHub Pages now requires HTTPS' + }) + + # Documentation site migrations + elif 'readthedocs.org' in url and 'http://' in url: + new_url = url.replace('http://', 'https://') + suggestion['suggestions'].append({ + 'type': 'https_upgrade', + 'url': new_url, + 'reason': 'Read the Docs now requires HTTPS' + }) + + # Python.org domain changes + elif 'docs.python.org' in url: + if '/2.7/' in url: + new_url = url.replace('/2.7/', '/3/') + suggestion['suggestions'].append({ + 'type': 'version_update', + 'url': new_url, + 'reason': 'Python 2.7 is deprecated, consider Python 3 documentation' + }) + + # General HTTPS upgrade (but be cautious with legitimate domains) + elif url.startswith('http://') and 'localhost' not in url: + new_url = url.replace('http://', 'https://') + if is_legitimate_domain: + suggestion['suggestions'].append({ + 'type': 'https_upgrade', + 'url': new_url, + 'reason': 'HTTPS is more secure and widely supported' + }) + else: + # For unknown domains, suggest checking manually + suggestion['suggestions'].append({ + 'type': 'manual_check', + 'url': new_url, + 'reason': 'Try HTTPS version or verify the link manually' + }) + + # Only add suggestions if we have constructive fixes + if suggestion['suggestions']: + suggestions.append(suggestion) + + # Handle redirects + for result in redirect_results: + if result['redirect_count'] > 0: + suggestion = { + 'original_url': result['url'], + 'issue': f"Redirected {result['redirect_count']} times", + 'suggestions': [{ + 'type': 'redirect_update', + 'url': result['final_url'], + 'reason': f'Update to final destination to avoid {result["redirect_count"]} redirect(s)' + }] + } + suggestions.append(suggestion) + + return suggestions + +def main(): + parser = argparse.ArgumentParser(description='Check links in HTML files') + parser.add_argument('file_path', help='Path to HTML file') + parser.add_argument('--timeout', type=int, default=45, help='Timeout in seconds (increased default for robustness)') + parser.add_argument('--max-redirects', type=int, default=5, help='Maximum redirects') + parser.add_argument('--silent-codes', default='403,503', help='Silent status codes') + parser.add_argument('--ai-suggestions', action='store_true', help='Enable AI suggestions') + + args = parser.parse_args() + + silent_codes = [int(x.strip()) for x in args.silent_codes.split(',') if x.strip()] + + # Extract links + links = extract_links_from_html(args.file_path) + if not links: + print(json.dumps({ + 'broken_results': [], 'redirect_results': [], + 'ai_suggestions': [], 'total_links': 0 + })) + return + + broken_results = [] + redirect_results = [] + + print(f"Checking {len(links)} links in {args.file_path} (timeout: {args.timeout}s)...", file=sys.stderr) + + # Check each link + for i, link_info in enumerate(links): + url = link_info['url'] + result = check_link(url, args.timeout, args.max_redirects, silent_codes) + result['file'] = args.file_path + result['text'] = link_info['text'] + + if result['broken'] and not result['silent']: + broken_results.append(result) + elif result['redirected']: + redirect_results.append(result) + + # Add small delay to be respectful to servers + if i < len(links) - 1: + time.sleep(0.2) # Slightly increased delay to be more respectful + + # Generate AI suggestions + ai_suggestions = [] + if args.ai_suggestions: + ai_suggestions = generate_ai_suggestions(broken_results, redirect_results) + + # Output results + print(json.dumps({ + 'broken_results': broken_results, + 'redirect_results': redirect_results, + 'ai_suggestions': ai_suggestions, + 'total_links': len(links) + })) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/.github/workflows/test-link-checker.yml b/.github/workflows/test-link-checker.yml new file mode 100644 index 0000000..9ff4041 --- /dev/null +++ b/.github/workflows/test-link-checker.yml @@ -0,0 +1,169 @@ +name: Test Link Checker Action + +on: + push: + branches: [ main ] + pull_request: + branches: [ main ] + workflow_dispatch: + +jobs: + test-good-links: + runs-on: ubuntu-latest + name: Test with good links only + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Test action with good links + id: good-test + uses: .//.github/actions/link-checker + with: + html-path: './test/link-checker/good-links.html' + fail-on-broken: 'false' + ai-suggestions: 'true' + timeout: '10' + + - name: Verify good results + run: | + echo "Broken links found: ${{ steps.good-test.outputs.broken-links-found }}" + echo "Broken link count: ${{ steps.good-test.outputs.broken-link-count }}" + if [ "${{ steps.good-test.outputs.broken-links-found }}" != "false" ]; then + echo "❌ Expected no broken links but found some" + exit 1 + fi + echo "✅ Good links test passed" + + test-broken-links: + runs-on: ubuntu-latest + name: Test with broken links + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Test action with broken links + id: broken-test + uses: .//.github/actions/link-checker + with: + html-path: './test/link-checker/broken-links.html' + fail-on-broken: 'false' + ai-suggestions: 'true' + silent-codes: '403,503' + timeout: '10' + + - name: Verify broken results + run: | + echo "Broken links found: ${{ steps.broken-test.outputs.broken-links-found }}" + echo "Broken link count: ${{ steps.broken-test.outputs.broken-link-count }}" + echo "Redirect count: ${{ steps.broken-test.outputs.redirect-count }}" + echo "AI suggestions: ${{ steps.broken-test.outputs.ai-suggestions }}" + + if [ "${{ steps.broken-test.outputs.broken-links-found }}" != "true" ]; then + echo "❌ Expected broken links but found none" + exit 1 + fi + + if [ "${{ steps.broken-test.outputs.broken-link-count }}" -lt "2" ]; then + echo "❌ Expected at least 2 broken links but found ${{ steps.broken-test.outputs.broken-link-count }}" + exit 1 + fi + + echo "✅ Broken links test passed" + + test-redirect-links: + runs-on: ubuntu-latest + name: Test with redirect links and AI suggestions + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Test action with redirect links + id: redirect-test + uses: .//.github/actions/link-checker + with: + html-path: './test/link-checker/redirect-links.html' + fail-on-broken: 'false' + ai-suggestions: 'true' + timeout: '10' + + - name: Verify redirect results + run: | + echo "Broken links found: ${{ steps.redirect-test.outputs.broken-links-found }}" + echo "Redirect count: ${{ steps.redirect-test.outputs.redirect-count }}" + echo "AI suggestions: ${{ steps.redirect-test.outputs.ai-suggestions }}" + + # Should find redirects + if [ "${{ steps.redirect-test.outputs.redirect-count }}" -lt "1" ]; then + echo "❌ Expected at least 1 redirect but found ${{ steps.redirect-test.outputs.redirect-count }}" + exit 1 + fi + + # Should have AI suggestions + if [ -z "${{ steps.redirect-test.outputs.ai-suggestions }}" ]; then + echo "❌ Expected AI suggestions but found none" + exit 1 + fi + + echo "✅ Redirect and AI suggestions test passed" + + test-full-directory: + runs-on: ubuntu-latest + name: Test with full directory scan + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Test action with full directory + id: directory-test + uses: .//.github/actions/link-checker + with: + html-path: './test/link-checker' + mode: 'full' + fail-on-broken: 'false' + ai-suggestions: 'true' + create-artifact: 'true' + artifact-name: 'test-link-report' + timeout: '10' + + - name: Verify directory results + run: | + echo "Broken links found: ${{ steps.directory-test.outputs.broken-links-found }}" + echo "Broken link count: ${{ steps.directory-test.outputs.broken-link-count }}" + echo "Redirect count: ${{ steps.directory-test.outputs.redirect-count }}" + + # Should find some broken links across all test files + if [ "${{ steps.directory-test.outputs.broken-links-found }}" != "true" ]; then + echo "❌ Expected broken links in directory scan but found none" + exit 1 + fi + + if [ "${{ steps.directory-test.outputs.broken-link-count }}" -lt "2" ]; then + echo "❌ Expected at least 2 broken links in directory but found ${{ steps.directory-test.outputs.broken-link-count }}" + exit 1 + fi + + echo "✅ Directory scan test passed" + + test-fail-on-broken: + runs-on: ubuntu-latest + name: Test fail-on-broken functionality + steps: + - name: Checkout + uses: actions/checkout@v4 + + - name: Test that action fails when broken links found and fail-on-broken is true + id: fail-test + continue-on-error: true + uses: .//.github/actions/link-checker + with: + html-path: './test/link-checker/broken-links.html' + fail-on-broken: 'true' + timeout: '10' + + - name: Verify action failed + run: | + if [ "${{ steps.fail-test.outcome }}" != "failure" ]; then + echo "❌ Expected action to fail but it succeeded" + exit 1 + fi + echo "✅ Fail-on-broken test passed" \ No newline at end of file diff --git a/README.md b/README.md index 8cef683..8180c6c 100644 --- a/README.md +++ b/README.md @@ -25,6 +25,27 @@ A GitHub Action that scans HTML files for Python warnings and optionally fails t See the [action documentation](./.github/actions/check-warnings/README.md) for detailed usage instructions and examples. +### AI-Powered Link Checker Action + +A GitHub Action that validates web links in HTML files with AI-powered suggestions for improvements. Designed to replace traditional link checkers like `lychee` with enhanced functionality. + +**Location**: `.github/actions/link-checker` + +**Usage**: +```yaml +- name: AI-powered link check + uses: QuantEcon/meta/.github/actions/link-checker@main + with: + html-path: './_build/html' + mode: 'full' + ai-suggestions: 'true' + silent-codes: '403,503' +``` + +**Use case**: Perfect for MyST Markdown/Jupyter Book projects. Provides weekly scheduled scans and PR-specific validation with AI suggestions for broken or outdated links. + +See the [action documentation](./.github/actions/link-checker/README.md) for detailed usage instructions and examples. + ### Weekly Report Action A GitHub Action that generates a weekly report summarizing issues and PR activity across all QuantEcon repositories. diff --git a/test/README.md b/test/README.md index bbeab05..e2d2597 100644 --- a/test/README.md +++ b/test/README.md @@ -10,6 +10,10 @@ Each GitHub Action has its own test subdirectory: - `clean.html` - HTML file without warnings (negative test case) - `with-warnings.html` - HTML file with warnings (positive test case) +- `link-checker/` - Tests for the `.github/actions/link-checker` action + - `good-links.html` - HTML file with working external links (negative test case) + - `broken-links.html` - HTML file with broken and problematic links (positive test case) + - `redirect-links.html` - HTML file with redirected links for AI suggestion testing - `weekly-report/` - Tests for the `.github/actions/weekly-report` action - `test-basic.sh` - Basic functionality test for the weekly report action @@ -18,4 +22,5 @@ Each GitHub Action has its own test subdirectory: Tests are automatically run by the GitHub Actions workflows in `.github/workflows/`. - For the `check-warnings` action, tests are run by the `test-warning-check.yml` workflow. -- For the `weekly-report` action, tests are run by the `test-weekly-report.yml` workflow. \ No newline at end of file +- For the `link-checker` action, tests are run by the `test-link-checker.yml` workflow. +- For the `weekly-report` action, tests are run by the `test-weekly-report.yml` workflow. diff --git a/test/link-checker/broken-links.html b/test/link-checker/broken-links.html new file mode 100644 index 0000000..dff9000 --- /dev/null +++ b/test/link-checker/broken-links.html @@ -0,0 +1,37 @@ + + + + Test Page with Broken Links + + +

Test Page - With Broken Links

+ +

This page contains broken and problematic links for testing:

+ + + +

Links that should be silently reported:

+ + +

Redirected links that could be improved:

+ + +

Good links for comparison:

+ + + \ No newline at end of file diff --git a/test/link-checker/good-links.html b/test/link-checker/good-links.html new file mode 100644 index 0000000..f913dd9 --- /dev/null +++ b/test/link-checker/good-links.html @@ -0,0 +1,25 @@ + + + + Test Page with Working Links + + +

Test Page - No Broken Links

+ +

This page contains only working external links:

+ + + +

Some internal links that should be ignored:

+ + + \ No newline at end of file diff --git a/test/link-checker/legitimate-slow-links.html b/test/link-checker/legitimate-slow-links.html new file mode 100644 index 0000000..9b9000a --- /dev/null +++ b/test/link-checker/legitimate-slow-links.html @@ -0,0 +1,25 @@ + + + + Test Page with Legitimate but Potentially Slow Links + + +

Test Page - Legitimate but Potentially Problematic Links

+ +

This page contains legitimate links that might be flagged as false positives:

+ + + +

These should be handled gracefully without suggesting removal:

+ + + \ No newline at end of file diff --git a/test/link-checker/redirect-links.html b/test/link-checker/redirect-links.html new file mode 100644 index 0000000..f00d863 --- /dev/null +++ b/test/link-checker/redirect-links.html @@ -0,0 +1,24 @@ + + + + Test Page with Redirects + + +

Test Page - With Redirects

+ +

This page contains links that redirect to test AI suggestions:

+ + + +

These should generate AI suggestions:

+ + + \ No newline at end of file diff --git a/test/link-checker/test_bot_blocking.py b/test/link-checker/test_bot_blocking.py new file mode 100644 index 0000000..77ea968 --- /dev/null +++ b/test/link-checker/test_bot_blocking.py @@ -0,0 +1,75 @@ +#!/usr/bin/env python3 +""" +Test script to simulate bot blocking scenarios +""" +import sys +import os +sys.path.insert(0, os.path.join(os.path.dirname(__file__), '../../.github/actions/link-checker')) + +from link_checker import is_likely_bot_blocked + +def test_bot_blocking_detection(): + """Test the bot blocking detection logic""" + + # Test major site domains that commonly block bots + test_cases = [ + ("https://www.netflix.com/", True, "Netflix should be detected as likely bot-blocked"), + ("https://code.tutsplus.com/tutorial/something", False, "Tutsplus should not be automatically flagged"), + ("https://www.amazon.com/", True, "Amazon should be detected as likely bot-blocked"), + ("https://example.com/", False, "Example.com should not be flagged"), + ("https://github.com/user/repo", False, "GitHub should not be flagged as bot-blocked"), + ("https://www.wikipedia.org/wiki/Test", True, "Wikipedia should be detected as likely bot-blocked"), + ] + + print("Testing bot blocking detection logic:") + print("-" * 50) + + for url, expected, description in test_cases: + result = is_likely_bot_blocked(url) + status = "✅ PASS" if result == expected else "❌ FAIL" + print(f"{status}: {description}") + print(f" URL: {url}") + print(f" Expected: {expected}, Got: {result}") + print() + + # Test encoding error detection + print("Testing encoding error detection:") + print("-" * 50) + + encoding_cases = [ + ("https://www.netflix.com/", None, None, "encoding issue", True, "Encoding error should be detected"), + ("https://example.com/", None, None, "timeout", False, "Regular timeout should not be flagged"), + ("https://example.com/", None, 429, None, True, "Rate limiting should be detected"), + ("https://example.com/", None, 503, None, True, "Service unavailable should be detected"), + ] + + for url, content, status_code, error, expected, description in encoding_cases: + result = is_likely_bot_blocked(url, content, status_code, error) + status = "✅ PASS" if result == expected else "❌ FAIL" + print(f"{status}: {description}") + print(f" URL: {url}, Status: {status_code}, Error: {error}") + print(f" Expected: {expected}, Got: {result}") + print() + + # Test legitimate domains with connection errors (simulating network restrictions) + print("Testing legitimate domain protection:") + print("-" * 50) + + legitimate_cases = [ + ("https://www.python.org/", None, None, "Connection Error", True, "Python.org with connection error should be protected"), + ("https://jupyter.org/", None, None, "Connection Error", True, "Jupyter.org with connection error should be protected"), + ("https://docs.python.org/3/", None, None, "Connection Error", True, "Python docs with connection error should be protected"), + ("https://github.com/user/repo", None, None, "Connection Error", True, "GitHub with connection error should be protected"), + ("https://unknown-domain.com/", None, None, "Connection Error", False, "Unknown domain with connection error should not be protected"), + ] + + for url, content, status_code, error, expected, description in legitimate_cases: + result = is_likely_bot_blocked(url, content, status_code, error) + status = "✅ PASS" if result == expected else "❌ FAIL" + print(f"{status}: {description}") + print(f" URL: {url}, Error: {error}") + print(f" Expected: {expected}, Got: {result}") + print() + +if __name__ == "__main__": + test_bot_blocking_detection() \ No newline at end of file