AI-Powered Website Content Extraction for LLMs
LLMsGen is a production-ready Python SDK for generating llms.txt files from websites using advanced web crawling and AI-powered content analysis. Extract clean, structured content from any website for training or fine-tuning large language models.
- 🕷️ Advanced Web Crawling - Multi-strategy crawling (systematic, comprehensive, sitemap-based)
- 🤖 AI-Powered Analysis - Support for Ollama, Gemini, OpenAI, and Anthropic models
- 🗺️ Sitemap Integration - Automatic sitemap discovery and processing
- 📊 Multiple Output Formats - Text, JSON, YAML with full-text options
- 🔒 Production Security - Rate limiting, memory management, error recovery
- ⚡ High Performance - Parallel processing with adaptive batch sizing
- 🎯 Flexible API - Use as CLI tool or integrate into your applications
# Install from PyPI (when published)
pip install llmsgen
# Or install from source
git clone https://github.com/yourusername/llmsgen.git
cd llmsgen
pip install -e .from llmsgen import LLMsGenerator
import asyncio
async def main():
# Initialize the generator
generator = LLMsGenerator()
# Generate llms.txt from a website
await generator.generate_llmstxt(
base_url="https://docs.python.org",
max_pages=50,
export_format='text'
)
asyncio.run(main())# Interactive mode
llmsgen
# Direct URL
llmsgen https://docs.python.org
# Advanced options
llmsgen https://docs.python.org --max-pages 100 --format json --workers 10The main SDK class for generating llms.txt files.
from llmsgen import LLMsGenerator
generator = LLMsGenerator()
# Generate with custom options
await generator.generate_llmstxt(
base_url="https://example.com",
max_pages=100, # Number of pages to crawl
export_format='text', # 'text', 'json', 'yaml'
crawl_strategy='comprehensive', # 'systematic', 'comprehensive', 'sitemap'
include_full_text=True, # Include full content
parallel_workers=8, # Concurrent workers
batch_size=15, # Batch processing size
sitemap_url='auto' # Sitemap URL or 'auto'
)Advanced web crawler with multiple strategies.
from llmsgen import WebCrawler
crawler = WebCrawler()
# Systematic crawling
pages = await crawler.crawl_website(
base_url="https://example.com",
max_pages=50,
comprehensive=False
)
# Sitemap-based crawling
pages = await crawler.crawl_from_sitemap(
base_url="https://example.com",
sitemap_url="auto",
max_pages=1000
)AI model management and integration.
from llmsgen import ModelManager, AIClient
# Initialize model manager
model_manager = ModelManager()
# List available models
models = model_manager.list_models()
# Setup AI client
ai_client = AIClient(model_manager)
ai_client.set_model(selected_model)from llmsgen import LLMsGenerator
async def comprehensive_crawl():
generator = LLMsGenerator()
# Crawl entire domain with unlimited pages
await generator.generate_llmstxt(
base_url="https://fastapi.tiangolo.com",
max_pages=999999, # Unlimited
crawl_strategy='comprehensive',
export_format='json',
include_full_text=True
)async def sitemap_extraction():
generator = LLMsGenerator()
# Extract all pages from sitemap
await generator.generate_llmstxt(
base_url="https://www.alternates.ai",
crawl_strategy='sitemap',
sitemap_url='https://www.alternates.ai/sitemap.xml',
export_format='text'
)from llmsgen import WebCrawler
async def custom_pipeline():
crawler = WebCrawler()
# Step 1: Crawl website
pages = await crawler.crawl_website("https://example.com")
# Step 2: Custom filtering
quality_pages = [
page for page in pages
if page['word_count'] > 200
]
# Step 3: Custom processing
# Your custom logic here...async def batch_process():
generator = LLMsGenerator()
websites = [
"https://docs.python.org",
"https://fastapi.tiangolo.com",
"https://www.djangoproject.com"
]
for website in websites:
await generator.generate_llmstxt(
base_url=website,
max_pages=50,
export_format='json'
)Create a .env file for configuration:
# AI Model API Keys
GEMINI_API_KEY=your_gemini_api_key_here
OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
# Output Configuration
OUTPUT_DIR=./output
USE_LOCAL_PLAYWRIGHT=true
LOCAL_PLAYWRIGHT_BROWSERS=auto
# Performance Settings
DEFAULT_WORKERS=5
DEFAULT_BATCH_SIZE=10
MAX_PAGES_DEFAULT=50The SDK supports multiple AI providers:
- Ollama (local models) - llama3.2, codellama, etc.
- Google Gemini - gemini-1.5-flash, gemini-1.5-pro
- OpenAI - gpt-4, gpt-3.5-turbo
- Anthropic - claude-3-sonnet, claude-3-haiku
Generative SEO, also known as Generative Engine Optimization (GEO), represents a fundamental shift in SEO strategy. While traditional SEO focuses on optimizing for search engines like Google and Bing, generative SEO targets AI-driven search engines that generate comprehensive, context-rich answers rather than just listing websites.
- Targeted Search Engines: Traditional SEO optimizes for conventional search engines, while generative SEO targets AI-driven platforms like ChatGPT, Perplexity, Google AI Overviews, Gemini, and Copilot.
- Content Approach: Instead of just tweaking existing content with keywords, generative SEO focuses on creating conversations and content that naturally aligns with evolving search patterns. The emphasis shifts from keyword density to contextual richness, incorporating citations, statistics, and AI-driven insights.
- Success Metrics: Traditional SEO measures success through click-through rates and time on page, while generative SEO focuses on impression metrics - how often your content appears in AI-generated responses.
Generative SEO uses machine learning, user intent analysis, and AI-focused content optimization to improve visibility in AI-generated results. The process involves:
- Research and Analysis: GEO strategy requires understanding how AI interprets content and chooses what to feature in answers. This involves analyzing how users interact with AI platforms and studying natural language patterns across queries.
- Content Quality and Relevance: In generative AI SEO, relevance carries more weight than keyword repetition because AI models look for value, not volume.
Content that gets featured typically includes:
- Real problems and solutions to actual user queries
- Brand mentions and topical depth to signal authority
- Personal experiences that add credibility
- Original insights and straightforward explanations
- Scalability: AI tools speed up content creation and idea generation, making SEO efforts more efficient while requiring human strategy and creativity for meaningful connections.
- Personalization: AI enables creating content tailored to user needs, preferences, and search intent, increasing engagement and conversions.
- Competitive Edge: Early adopters of generative SEO strategies gain advantages in achieving higher rankings and visibility.
- Efficiency: AI automates time-consuming tasks like keyword research, content optimization, and link building, freeing up time for strategic planning.
- Create High-Quality, Authoritative Content: Focus on providing genuine value with in-depth articles, tutorials, and guides.
- Structure for Readability: Use clear headings, bullet points, and short paragraphs to make your content easy for both humans and AI to parse.
- Incorporate Natural Language: Write in a conversational tone that directly answers user questions.
- Build Topical Authority: Cover a subject area comprehensively to establish your site as an expert resource.
- Optimize for Featured Snippets: Target question-based queries and provide concise, direct answers.
- Use Schema Markup: Implement structured data to help search engines understand the context of your content.
# Systematic: Main page + direct links (default)
crawl_strategy='systematic'
# Comprehensive: Deep recursive crawling
crawl_strategy='comprehensive'
# Sitemap: Use website's sitemap for discovery
crawl_strategy='sitemap'# Plain text format (default)
export_format='text'
# Structured JSON with metadata
export_format='json'
# YAML format for configuration
export_format='yaml'await generator.generate_llmstxt(
base_url="https://example.com",
parallel_workers=10, # More workers = faster crawling
batch_size=20, # Larger batches = better throughput
max_pages=1000, # Adjust based on site size
crawl_strategy='sitemap' # Fastest for large sites
)# Website: https://example.com
# Generated: 2024-01-20T10:30:00Z
# Pages: 156
# Strategy: comprehensive
## Page 1: Homepage | https://example.com
This is the main content of the homepage...
## Page 2: About Us | https://example.com/about
Information about the company...
{
"metadata": {
"base_url": "https://example.com",
"generation_timestamp": "2024-01-20T10:30:00Z",
"total_pages": 156,
"crawl_strategy": "comprehensive",
"model_used": "gemini-1.5-flash"
},
"pages": [
{
"url": "https://example.com",
"title": "Homepage",
"description": "AI-generated description...",
"word_count": 1250,
"content": "Full page content...",
"crawl_timestamp": "2024-01-20T10:30:00Z"
}
]
}git clone https://github.com/yourusername/llmsgen.git
cd llmsgen
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Format code
black .
isort .- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- 🐛 Bug Reports: GitHub Issues
- 💬 Discussions: GitHub Discussions
- 📧 Email: hrishikeshgupta007@gmail.com
- GraphQL API support
- Real-time streaming crawling
- Advanced content filtering
- Multi-language support
- Docker containers
- Cloud deployment options
For questions, suggestions, or collaboration opportunities:
- Email: hrishikeshgupta007@gmail.com
- GitHub: Create an Issue
- Built with Crawl4AI for advanced web crawling
- Supports multiple AI providers: Ollama, Google Gemini, OpenAI, Anthropic
- Inspired by the llms.txt specification for LLM training data
Made with ❤️ by the LLMsGen Team