AIContextScraper

A powerful and reusable Python scraper framework designed for efficiently crawling documentation websites and preparing content for AI training. It features async operations, structured output formats, and intelligent content processing.

🔑 Key Features

🔁 Recursive crawling from a single starting URL
🧠 Intelligent content extraction and structuring
📄 Multiple export formats (JSON, TXT, PDF)
⚡ Async operations for improved performance
🔄 Automatic retry mechanism
📊 Token counting and chunking
📁 Organized output structure

📋 Requirements

Python 3.8+
Required packages listed in requirements.txt

🛠️ Installation

Clone the repository:

git clone [repository-url]
cd AIContextScraper

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: .\venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

💻 Usage

Run the scraper:
```
python main.py
```
Follow the prompts:
- Enter the documentation website URL
- Specify a project name (or use default)
- Choose whether to export PDFs

📁 Output Structure

D:/AI_Training_Corpora/PROJECT_NAME/
├── raw_html/           # Original HTML content
├── json/              # Structured content with metadata
├── txt/               # Chunked text content
├── pdf/               # PDF exports (optional)
├── logs/              # Execution logs
└── metadata.json      # Run statistics and summary

🧩 JSON Structure

{
  "title": "Page Title",
  "url": "https://example.com/docs/page",
  "content": "Extracted and cleaned content...",
  "tokens": 184,
  "timestamp": "2023-12-25T20:15:23Z"
}

⚙️ Configuration

Adjust settings in config.py:

HTTP request parameters
Crawling limits
Content processing options
Output formatting

🔄 Future Enhancements

Direct embedding export for vector databases
Automatic content classification
Markdown export support
Browser-based crawling for JS-heavy sites
Scheduled updates for documentation sites

📝 License

MIT License - feel free to use and modify for your needs.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
AI_Training_Corpora		AI_Training_Corpora
static		static
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIContextScraper

🔑 Key Features

📋 Requirements

🛠️ Installation

💻 Usage

📁 Output Structure

🧩 JSON Structure

⚙️ Configuration

🔄 Future Enhancements

📝 License

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AIContextScraper

🔑 Key Features

📋 Requirements

🛠️ Installation

💻 Usage

📁 Output Structure

🧩 JSON Structure

⚙️ Configuration

🔄 Future Enhancements

📝 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages