Tech News Aggregator

A Python-based news aggregation tool that automatically collects, categorizes, and summarizes technology articles from leading industry sources using AI-powered analysis.

Overview

This application streamlines the process of staying current with technology news by:

Automated Collection: Scrapes articles from multiple curated tech news sources
AI Categorization: Automatically classifies articles into relevant technical domains
Intelligent Summarization: Generates concise summaries using AI models (Claude or Gemini)
Structured Export: Outputs organized data in CSV format for analysis or archival
Batch Processing: Optimized API calls to work within free-tier rate limits

Features

Core Functionality

Multi-Source Scraping: Aggregates articles from 5+ premium tech news sources
Dual AI Provider Support: Compatible with both Anthropic Claude and Google Gemini APIs
Smart Batch Processing: Processes 3 articles per API call to optimize rate limits
Flexible Output Options: Terminal display and/or CSV export
Deduplication: Optional URL-based duplicate removal
Rate Limiting: Built-in delays to respect source website policies
Error Handling: Graceful failure handling with detailed error messages

Technical Features

Functional Programming: Clean, testable code with pure functions
Type Hints: Full type annotations for better code clarity
Dataclasses: Immutable data structures for articles and configuration
Configurable: Customizable scraping and AI settings
No Database Required: Stateless operation with CSV output

News Sources

The aggregator collects articles from the following sources:

Source	Focus Area	Article Limit
ByteByteGo	System design, engineering fundamentals	10
InfoQ	Software development, enterprise tech	15
Hacker News	Technology, startups, programming	15
Last Week in AWS	AWS services, cloud infrastructure	10
The Pragmatic Engineer	Software engineering, career insights	10

Total: Approximately 60 articles per run

Article Categories

Articles are automatically classified into one of the following categories:

AI/Machine Learning - LLMs, neural networks, ML infrastructure
Algorithms & Data Structures - Computational theory, optimization
Cloud/DevOps/Infrastructure - AWS, Kubernetes, containerization
Software Architecture - System design, architectural patterns
Programming Languages - Language features, new releases
Databases - SQL, NoSQL, data storage systems
Security - Application security, vulnerabilities, best practices
Web Development - Frontend, backend, frameworks
Mobile Development - iOS, Android, cross-platform
Career/Leadership - Engineering management, career growth
General Tech News - Industry news, company updates
Other - Miscellaneous or uncategorizable content

Installation

Prerequisites

Python 3.10 or higher
pip (Python package manager)
Internet connection for scraping and API calls

Setup Steps

Clone the repository:

git clone https://github.com/yourusername/tech-news-aggregator.git
cd tech-news-aggregator

Create and activate a virtual environment:

# Linux/macOS
python3 -m venv venv
source venv/bin/activate

# Windows
python -m venv venv
venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Or using make:

make install

Configuration

Environment Variables

Create a .env file in the project root:

# Choose your AI provider (gemini or claude)
AI_PROVIDER=gemini

# Google Gemini API Key (free tier: 20 requests/minute)
GEMINI_API_KEY=your_gemini_api_key_here

# Anthropic Claude API Key (optional, only if using Claude)
ANTHROPIC_API_KEY=your_claude_api_key_here

Obtaining API Keys

Google Gemini (Recommended for free tier):

Visit https://aistudio.google.com/app/apikey
Sign in with your Google account
Click "Create API Key"
Copy the key to your .env file

Anthropic Claude:

Visit https://console.anthropic.com/
Sign up or log in
Navigate to API Keys section
Generate a new key
Copy the key to your .env file

Configuration Files

models.py - Customize AI behavior:

@dataclass(frozen=True)
class AIConfig:
    categories: tuple[str, ...] = (...)  # Article categories
    default_category: str = "Other"
    claude_model: str = "claude-3-5-haiku-20241022"
    gemini_model: str = "gemini-2.5-flash"
    max_tokens: int = 150
    rate_limit_delay: float = 3.5  # Seconds between API calls

models.py - Customize scraping behavior:

@dataclass(frozen=True)
class ScraperConfig:
    default_article_limit: int = 10
    high_volume_limit: int = 15
    description_max_length: int = 200
    inter_source_delay_seconds: float = 1.0
    http_timeout_seconds: int = 10

Usage

Basic Usage

Run with default settings (uses AI provider from .env):

python main.py

Using Makefile

# Run with AI categorization (default provider from .env)
make run

# Quick scrape without AI processing (fast)
make scrape

# Run with specific AI provider
make run-gemini
make run-claude

# See all available commands
make help

Command-Line Options

python main.py [OPTIONS]

Available Options

Option	Description
`--provider {claude,gemini}`	Override AI provider from .env file
`--output FILENAME`	Specify custom CSV output filename
`--no-display`	Skip terminal output, only generate CSV
`--scrape-only`	Scrape articles without AI processing (fast mode)
`--deduplicate`	Remove duplicate articles by URL
`--help`	Show help message and exit

Usage Examples

Use specific AI provider:

python main.py --provider gemini

Custom output filename:

python main.py --output weekly_digest.csv

Scrape only (no AI, fastest):

python main.py --scrape-only

Silent mode with custom output:

python main.py --no-display --output silent_run.csv

Full featured run:

python main.py --provider gemini --deduplicate --output curated_news.csv

Makefile Commands

Development Commands

make install     # Install Python dependencies
make format      # Format code with Black
make lint        # Check code with Ruff linter
make lint-fix    # Auto-fix linting issues
make check       # Run format + lint
make clean       # Remove __pycache__ and cache files

Run Commands

make run              # Run with AI (default provider)
make scrape           # Scrape without AI processing
make run-gemini       # Force Gemini AI provider
make run-claude       # Force Claude AI provider

Output Format

Terminal Display

Articles are organized by category and displayed with:

================================================================================
CATEGORIZED TECH NEWS
================================================================================

--------------------------------------------------------------------------------
AI/Machine Learning (5 articles)
--------------------------------------------------------------------------------

   How LLMs Learn from the Internet: The Training Process
     Source: ByteByteGo
     URL: https://blog.bytebytego.com/p/how-llms-learn-from-the-internet
     Summary: This article explores the complete LLM training pipeline, from
     raw data collection to fine-tuning conversational models.

   [Additional articles...]

--------------------------------------------------------------------------------
Cloud/DevOps/Infrastructure (8 articles)
--------------------------------------------------------------------------------

   [Articles...]

CSV Export

Generated CSV file includes the following columns:

Column	Description
`Category`	AI-assigned category
`Title`	Article title
`Source`	News source name
`URL`	Full article URL
`Summary`	AI-generated 1-2 sentence summary
`Original Description`	Raw description from source

Filename format: tech_news_YYYYMMDD_HHMMSS.csv

Example: tech_news_20251208_143022.csv

Architecture

Project Structure

tech-news-aggregator/
├── main.py                 # Application entry point and CLI
├── scrapers.py             # Web scraping implementations
├── ai_categorizer.py       # AI categorization and summarization
├── models.py               # Data models and configurations
├── requirements.txt        # Python dependencies
├── pyproject.toml          # Black and Ruff configuration
├── Makefile               # Development and run commands
├── .env.example                    # Environment variables (not in git)
├── .gitignore             # Git ignore patterns
└── README.md              # Project documentation

Core Components

main.py

Argument parsing and validation
Pipeline orchestration (scraping, categorization, display, export)
Error handling and user feedback

scrapers.py

RSS feed parsing for ByteByteGo, InfoQ, AWS, Pragmatic Engineer
HTML scraping for Hacker News
Source-specific article extraction logic
Deduplication functionality

ai_categorizer.py

Batch prompt generation (3 articles per call)
API integration for Claude and Gemini
Response parsing and validation
Category mapping and fallback handling

models.py

Immutable Article dataclass
AIConfig for AI behavior
ScraperConfig for scraping behavior
Default configuration constants

Data Flow

1. Scrape Sources → Raw Articles
2. Optional: Deduplicate by URL
3. Batch Articles (3 per batch)
4. AI Processing → Category + Summary
5. Display in Terminal
6. Export to CSV

Performance

Execution Times

Mode	Articles	API Calls	Duration
Scrape only	~60	0	~10 seconds
With AI (batch)	~60	~20	~70-90 seconds
With AI (old)	~60	60	~4-5 minutes

Rate Limits

Gemini Free Tier:

20 requests per minute
Batch processing fits perfectly within limits
3.5 second delay between batches

Claude API:

Higher rate limits (tier-dependent)
Faster response times
Requires paid account

Optimization Strategies

Batch Processing: Reduced API calls by 67% (60 → 20 calls)
Smart Delays: Calculated to stay within rate limits
Parallel Scraping: Could be added for faster collection
Caching: Could cache results for repeated runs

Troubleshooting

Common Issues

"No articles found" error:

Check internet connection
Verify news source is accessible
Website structure may have changed (requires code update)

"API key not found" error:

Ensure .env file exists in project root
Verify API key is set correctly: GEMINI_API_KEY=...
Check AI_PROVIDER matches your configured key

"ResourceExhausted" / Rate limit errors:

Gemini free tier: 20 requests/minute
Wait 60 seconds and retry
Batch processing should prevent this (already implemented)

"ModuleNotFoundError":

Activate virtual environment: source venv/bin/activate
Install dependencies: pip install -r requirements.txt

Empty summaries or "Other" category:

AI model may be outdated (check models.py)
API response parsing might be failing
Check console for error messages

Debug Mode

Add verbose output by modifying the code temporarily:

# In ai_categorizer.py, uncomment debug lines
print(f"API Response: {response_text}")

Contributing

Contributions are welcome! Here's how to contribute:

Development Setup

Fork the repository
Create a feature branch: git checkout -b feature/your-feature-name
Make your changes
Run code quality checks: make check
Commit your changes: git commit -m "Add your feature"
Push to your fork: git push origin feature/your-feature-name
Open a Pull Request

Code Standards

Formatting: Black (line length: 100)
Linting: Ruff with strict settings
Type Hints: Required for all functions
Docstrings: Use for complex functions
Testing: Add tests for new features (future enhancement)

Before Committing

make check  # Runs Black formatting and Ruff linting

Areas for Contribution

Add new news sources
Improve AI prompt engineering
Add unit tests
Create web interface
Implement database storage
Add email digest functionality
Enhance error handling
Optimize scraping speed

Dependencies

Core Dependencies

requests - HTTP requests for scraping
beautifulsoup4 - HTML parsing
feedparser - RSS feed parsing
anthropic - Claude AI API
google-generativeai - Gemini AI API
python-dotenv - Environment variable management

Development Dependencies

black - Code formatting
ruff - Fast Python linter

See requirements.txt for complete list with versions.

License

This project is provided as-is for educational and personal use.

Important: When using this tool:

Respect the terms of service of scraped websites
Do not overload servers with requests
Comply with API usage policies
Attribute sources appropriately

Disclaimer: This tool is for personal news aggregation. Users are responsible for ensuring their use complies with applicable laws and website terms of service.

Acknowledgments

News sources for providing valuable content
Anthropic and Google for AI API access
Open source community for excellent Python libraries

Star this repository if you find it useful!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ai_categorizer.py		ai_categorizer.py
main.py		main.py
models.py		models.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scrapers.py		scrapers.py

License

KamilSupera/TechNewsAggregator

Folders and files

Latest commit

History

Repository files navigation

Tech News Aggregator

Overview

Table of Contents

Features

Core Functionality

Technical Features

News Sources

Article Categories

Installation

Prerequisites

Setup Steps

Configuration

Environment Variables

Obtaining API Keys

Configuration Files

Usage

Basic Usage

Using Makefile

Command-Line Options

Available Options

Usage Examples

Makefile Commands

Development Commands

Run Commands

Output Format

Terminal Display

CSV Export

Architecture

Project Structure

Core Components

Data Flow

Performance

Execution Times

Rate Limits

Optimization Strategies

Troubleshooting

Common Issues

Debug Mode

Contributing

Development Setup

Code Standards

Before Committing

Areas for Contribution

Dependencies

Core Dependencies

Development Dependencies

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages