Skip to content

Conversation

@cdcore09
Copy link
Contributor

@cdcore09 cdcore09 commented Dec 8, 2025

Add llms.txt Generation for LLM-Friendly Site Documentation

Summary

This PR implements automatic generation of an llms.txt file from the US-RSE Jekyll site, making the website content more accessible to Large Language Models (LLMs) following the llmstxt.org specification.

Changes

This PR adds three new files:

1. .github/workflows/llms-txt.yml - GitHub Actions Workflow

  • Triggers on pushes to main, pull requests, and manual workflow dispatch
  • Builds the Jekyll site in production mode
  • Runs the Python script to generate llms.txt
  • Uploads the generated file as a workflow artifact
  • Validates successful generation with file size and line count reporting

2. scripts/generate_llms_txt.py - Generation Script

  • Walks through the _site directory after Jekyll build
  • Extracts text content from HTML pages using a custom HTML parser
  • Filters out navigation, footer, scripts, and other non-content elements
  • Organizes pages into sections based on URL structure
  • Generates canonical URLs for each page
  • Creates concise summaries (truncated at ~250 characters)
  • Outputs a markdown-formatted llms.txt with organized sections

3. scripts/requirements.txt - Python Dependencies

  • pyyaml>=6.0 (for parsing _config.yml)

Features

  • Smart Content Extraction: Ignores utility pages (404, feeds, sitemaps) and asset files
  • Organized Output: Groups pages by section with "Root" pages listed first
  • Clean Summaries: Intelligently truncates text at sentence boundaries
  • Canonical URLs: Uses site configuration to build proper URLs
  • Validation: Workflow includes checks to ensure successful generation

Use Case

The llms.txt file provides LLMs with a structured, text-based overview of the entire US-RSE website, making it easier for AI assistants to understand and reference site content when answering questions about US-RSE.

Testing

The workflow can be manually triggered via workflow dispatch to test the generation process. Generated files are uploaded as artifacts for review.

Technical Details

  • Python Version: 3.11
  • Ruby Version: 3.1
  • Artifact Retention: 30 days
  • Summary Length: ~250 characters per page
  • Output Format: Markdown with section headers and bulleted lists

@cdcore09 cdcore09 added enhancement New feature or request development Things related to site deployment / development processes labels Dec 8, 2025
@cdcore09 cdcore09 self-assigned this Dec 8, 2025
@mrmundt
Copy link
Contributor

mrmundt commented Dec 8, 2025

Question: why do we want this feature?

@cdcore09
Copy link
Contributor Author

cdcore09 commented Dec 8, 2025

Question: why do we want this feature?

The purpose of llms.txt is to provide Large Language Models (LLMs) with a structured, machine-readable guide to our website's most important content, summaries, and structure, helping AI understand, access, and accurately represent the site's information, improving discoverability and ensuring better, more relevant AI-generated answers and citations. It's like a curated highlight reel for AI, directing it to valuable resources. This will increase our discoverability with GEO and allows us (if we choose to) in the future to have a Slackbot that members can ask about US-RSE questions. Plus, it costs us nothing to add it.

pull_request:
branches:
- main
workflow_dispatch:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you actually want this line? If so, I don't think it's complete. This is what I normally do:

  workflow_dispatch:
    inputs:
      git-ref:
        description: Git Hash (Optional)
        required: false

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha! I made some updates to the GH Actions to make the workflow more robust.

@cdcore09 cdcore09 requested a review from mrmundt December 9, 2025 20:49
@cdcore09 cdcore09 merged commit 26cf054 into main Dec 19, 2025
2 of 3 checks passed
@cdcore09 cdcore09 deleted the cdcore09/feat/llmstxt branch December 19, 2025 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

development Things related to site deployment / development processes enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants