A Python-based news aggregation tool that automatically collects, categorizes, and summarizes technology articles from leading industry sources using AI-powered analysis.
This application streamlines the process of staying current with technology news by:
- Automated Collection: Scrapes articles from multiple curated tech news sources
- AI Categorization: Automatically classifies articles into relevant technical domains
- Intelligent Summarization: Generates concise summaries using AI models (Claude or Gemini)
- Structured Export: Outputs organized data in CSV format for analysis or archival
- Batch Processing: Optimized API calls to work within free-tier rate limits
- Features
- News Sources
- Article Categories
- Installation
- Configuration
- Usage
- Command-Line Options
- Makefile Commands
- Output Format
- Architecture
- Performance
- Troubleshooting
- Contributing
- License
- Multi-Source Scraping: Aggregates articles from 5+ premium tech news sources
- Dual AI Provider Support: Compatible with both Anthropic Claude and Google Gemini APIs
- Smart Batch Processing: Processes 3 articles per API call to optimize rate limits
- Flexible Output Options: Terminal display and/or CSV export
- Deduplication: Optional URL-based duplicate removal
- Rate Limiting: Built-in delays to respect source website policies
- Error Handling: Graceful failure handling with detailed error messages
- Functional Programming: Clean, testable code with pure functions
- Type Hints: Full type annotations for better code clarity
- Dataclasses: Immutable data structures for articles and configuration
- Configurable: Customizable scraping and AI settings
- No Database Required: Stateless operation with CSV output
The aggregator collects articles from the following sources:
| Source | Focus Area | Article Limit |
|---|---|---|
| ByteByteGo | System design, engineering fundamentals | 10 |
| InfoQ | Software development, enterprise tech | 15 |
| Hacker News | Technology, startups, programming | 15 |
| Last Week in AWS | AWS services, cloud infrastructure | 10 |
| The Pragmatic Engineer | Software engineering, career insights | 10 |
Total: Approximately 60 articles per run
Articles are automatically classified into one of the following categories:
- AI/Machine Learning - LLMs, neural networks, ML infrastructure
- Algorithms & Data Structures - Computational theory, optimization
- Cloud/DevOps/Infrastructure - AWS, Kubernetes, containerization
- Software Architecture - System design, architectural patterns
- Programming Languages - Language features, new releases
- Databases - SQL, NoSQL, data storage systems
- Security - Application security, vulnerabilities, best practices
- Web Development - Frontend, backend, frameworks
- Mobile Development - iOS, Android, cross-platform
- Career/Leadership - Engineering management, career growth
- General Tech News - Industry news, company updates
- Other - Miscellaneous or uncategorizable content
- Python 3.10 or higher
- pip (Python package manager)
- Internet connection for scraping and API calls
- Clone the repository:
git clone https://github.com/yourusername/tech-news-aggregator.git
cd tech-news-aggregator- Create and activate a virtual environment:
# Linux/macOS
python3 -m venv venv
source venv/bin/activate
# Windows
python -m venv venv
venv\Scripts\activate- Install dependencies:
pip install -r requirements.txtOr using make:
make installCreate a .env file in the project root:
# Choose your AI provider (gemini or claude)
AI_PROVIDER=gemini
# Google Gemini API Key (free tier: 20 requests/minute)
GEMINI_API_KEY=your_gemini_api_key_here
# Anthropic Claude API Key (optional, only if using Claude)
ANTHROPIC_API_KEY=your_claude_api_key_hereGoogle Gemini (Recommended for free tier):
- Visit https://aistudio.google.com/app/apikey
- Sign in with your Google account
- Click "Create API Key"
- Copy the key to your
.envfile
Anthropic Claude:
- Visit https://console.anthropic.com/
- Sign up or log in
- Navigate to API Keys section
- Generate a new key
- Copy the key to your
.envfile
models.py - Customize AI behavior:
@dataclass(frozen=True)
class AIConfig:
categories: tuple[str, ...] = (...) # Article categories
default_category: str = "Other"
claude_model: str = "claude-3-5-haiku-20241022"
gemini_model: str = "gemini-2.5-flash"
max_tokens: int = 150
rate_limit_delay: float = 3.5 # Seconds between API callsmodels.py - Customize scraping behavior:
@dataclass(frozen=True)
class ScraperConfig:
default_article_limit: int = 10
high_volume_limit: int = 15
description_max_length: int = 200
inter_source_delay_seconds: float = 1.0
http_timeout_seconds: int = 10Run with default settings (uses AI provider from .env):
python main.py# Run with AI categorization (default provider from .env)
make run
# Quick scrape without AI processing (fast)
make scrape
# Run with specific AI provider
make run-gemini
make run-claude
# See all available commands
make helppython main.py [OPTIONS]| Option | Description |
|---|---|
--provider {claude,gemini} |
Override AI provider from .env file |
--output FILENAME |
Specify custom CSV output filename |
--no-display |
Skip terminal output, only generate CSV |
--scrape-only |
Scrape articles without AI processing (fast mode) |
--deduplicate |
Remove duplicate articles by URL |
--help |
Show help message and exit |
Use specific AI provider:
python main.py --provider geminiCustom output filename:
python main.py --output weekly_digest.csvScrape only (no AI, fastest):
python main.py --scrape-onlySilent mode with custom output:
python main.py --no-display --output silent_run.csvFull featured run:
python main.py --provider gemini --deduplicate --output curated_news.csvmake install # Install Python dependencies
make format # Format code with Black
make lint # Check code with Ruff linter
make lint-fix # Auto-fix linting issues
make check # Run format + lint
make clean # Remove __pycache__ and cache filesmake run # Run with AI (default provider)
make scrape # Scrape without AI processing
make run-gemini # Force Gemini AI provider
make run-claude # Force Claude AI providerArticles are organized by category and displayed with:
================================================================================
CATEGORIZED TECH NEWS
================================================================================
--------------------------------------------------------------------------------
AI/Machine Learning (5 articles)
--------------------------------------------------------------------------------
How LLMs Learn from the Internet: The Training Process
Source: ByteByteGo
URL: https://blog.bytebytego.com/p/how-llms-learn-from-the-internet
Summary: This article explores the complete LLM training pipeline, from
raw data collection to fine-tuning conversational models.
[Additional articles...]
--------------------------------------------------------------------------------
Cloud/DevOps/Infrastructure (8 articles)
--------------------------------------------------------------------------------
[Articles...]
Generated CSV file includes the following columns:
| Column | Description |
|---|---|
Category |
AI-assigned category |
Title |
Article title |
Source |
News source name |
URL |
Full article URL |
Summary |
AI-generated 1-2 sentence summary |
Original Description |
Raw description from source |
Filename format: tech_news_YYYYMMDD_HHMMSS.csv
Example: tech_news_20251208_143022.csv
tech-news-aggregator/
├── main.py # Application entry point and CLI
├── scrapers.py # Web scraping implementations
├── ai_categorizer.py # AI categorization and summarization
├── models.py # Data models and configurations
├── requirements.txt # Python dependencies
├── pyproject.toml # Black and Ruff configuration
├── Makefile # Development and run commands
├── .env.example # Environment variables (not in git)
├── .gitignore # Git ignore patterns
└── README.md # Project documentation
main.py
- Argument parsing and validation
- Pipeline orchestration (scraping, categorization, display, export)
- Error handling and user feedback
scrapers.py
- RSS feed parsing for ByteByteGo, InfoQ, AWS, Pragmatic Engineer
- HTML scraping for Hacker News
- Source-specific article extraction logic
- Deduplication functionality
ai_categorizer.py
- Batch prompt generation (3 articles per call)
- API integration for Claude and Gemini
- Response parsing and validation
- Category mapping and fallback handling
models.py
- Immutable
Articledataclass AIConfigfor AI behaviorScraperConfigfor scraping behavior- Default configuration constants
1. Scrape Sources → Raw Articles
2. Optional: Deduplicate by URL
3. Batch Articles (3 per batch)
4. AI Processing → Category + Summary
5. Display in Terminal
6. Export to CSV
| Mode | Articles | API Calls | Duration |
|---|---|---|---|
| Scrape only | ~60 | 0 | ~10 seconds |
| With AI (batch) | ~60 | ~20 | ~70-90 seconds |
| With AI (old) | ~60 | 60 | ~4-5 minutes |
Gemini Free Tier:
- 20 requests per minute
- Batch processing fits perfectly within limits
- 3.5 second delay between batches
Claude API:
- Higher rate limits (tier-dependent)
- Faster response times
- Requires paid account
- Batch Processing: Reduced API calls by 67% (60 → 20 calls)
- Smart Delays: Calculated to stay within rate limits
- Parallel Scraping: Could be added for faster collection
- Caching: Could cache results for repeated runs
"No articles found" error:
- Check internet connection
- Verify news source is accessible
- Website structure may have changed (requires code update)
"API key not found" error:
- Ensure
.envfile exists in project root - Verify API key is set correctly:
GEMINI_API_KEY=... - Check
AI_PROVIDERmatches your configured key
"ResourceExhausted" / Rate limit errors:
- Gemini free tier: 20 requests/minute
- Wait 60 seconds and retry
- Batch processing should prevent this (already implemented)
"ModuleNotFoundError":
- Activate virtual environment:
source venv/bin/activate - Install dependencies:
pip install -r requirements.txt
Empty summaries or "Other" category:
- AI model may be outdated (check
models.py) - API response parsing might be failing
- Check console for error messages
Add verbose output by modifying the code temporarily:
# In ai_categorizer.py, uncomment debug lines
print(f"API Response: {response_text}")Contributions are welcome! Here's how to contribute:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature-name - Make your changes
- Run code quality checks:
make check - Commit your changes:
git commit -m "Add your feature" - Push to your fork:
git push origin feature/your-feature-name - Open a Pull Request
- Formatting: Black (line length: 100)
- Linting: Ruff with strict settings
- Type Hints: Required for all functions
- Docstrings: Use for complex functions
- Testing: Add tests for new features (future enhancement)
make check # Runs Black formatting and Ruff linting- Add new news sources
- Improve AI prompt engineering
- Add unit tests
- Create web interface
- Implement database storage
- Add email digest functionality
- Enhance error handling
- Optimize scraping speed
requests- HTTP requests for scrapingbeautifulsoup4- HTML parsingfeedparser- RSS feed parsinganthropic- Claude AI APIgoogle-generativeai- Gemini AI APIpython-dotenv- Environment variable management
black- Code formattingruff- Fast Python linter
See requirements.txt for complete list with versions.
This project is provided as-is for educational and personal use.
Important: When using this tool:
- Respect the terms of service of scraped websites
- Do not overload servers with requests
- Comply with API usage policies
- Attribute sources appropriately
Disclaimer: This tool is for personal news aggregation. Users are responsible for ensuring their use complies with applicable laws and website terms of service.
- News sources for providing valuable content
- Anthropic and Google for AI API access
- Open source community for excellent Python libraries
Star this repository if you find it useful!