Skip to content

AI-powered tech news aggregator that scrapes, categorizes, and summarizes articles from leading industry sources

License

Notifications You must be signed in to change notification settings

KamilSupera/TechNewsAggregator

Repository files navigation

Tech News Aggregator

A Python-based news aggregation tool that automatically collects, categorizes, and summarizes technology articles from leading industry sources using AI-powered analysis.

Overview

This application streamlines the process of staying current with technology news by:

  • Automated Collection: Scrapes articles from multiple curated tech news sources
  • AI Categorization: Automatically classifies articles into relevant technical domains
  • Intelligent Summarization: Generates concise summaries using AI models (Claude or Gemini)
  • Structured Export: Outputs organized data in CSV format for analysis or archival
  • Batch Processing: Optimized API calls to work within free-tier rate limits

Table of Contents

Features

Core Functionality

  • Multi-Source Scraping: Aggregates articles from 5+ premium tech news sources
  • Dual AI Provider Support: Compatible with both Anthropic Claude and Google Gemini APIs
  • Smart Batch Processing: Processes 3 articles per API call to optimize rate limits
  • Flexible Output Options: Terminal display and/or CSV export
  • Deduplication: Optional URL-based duplicate removal
  • Rate Limiting: Built-in delays to respect source website policies
  • Error Handling: Graceful failure handling with detailed error messages

Technical Features

  • Functional Programming: Clean, testable code with pure functions
  • Type Hints: Full type annotations for better code clarity
  • Dataclasses: Immutable data structures for articles and configuration
  • Configurable: Customizable scraping and AI settings
  • No Database Required: Stateless operation with CSV output

News Sources

The aggregator collects articles from the following sources:

Source Focus Area Article Limit
ByteByteGo System design, engineering fundamentals 10
InfoQ Software development, enterprise tech 15
Hacker News Technology, startups, programming 15
Last Week in AWS AWS services, cloud infrastructure 10
The Pragmatic Engineer Software engineering, career insights 10

Total: Approximately 60 articles per run

Article Categories

Articles are automatically classified into one of the following categories:

  • AI/Machine Learning - LLMs, neural networks, ML infrastructure
  • Algorithms & Data Structures - Computational theory, optimization
  • Cloud/DevOps/Infrastructure - AWS, Kubernetes, containerization
  • Software Architecture - System design, architectural patterns
  • Programming Languages - Language features, new releases
  • Databases - SQL, NoSQL, data storage systems
  • Security - Application security, vulnerabilities, best practices
  • Web Development - Frontend, backend, frameworks
  • Mobile Development - iOS, Android, cross-platform
  • Career/Leadership - Engineering management, career growth
  • General Tech News - Industry news, company updates
  • Other - Miscellaneous or uncategorizable content

Installation

Prerequisites

  • Python 3.10 or higher
  • pip (Python package manager)
  • Internet connection for scraping and API calls

Setup Steps

  1. Clone the repository:
git clone https://github.com/yourusername/tech-news-aggregator.git
cd tech-news-aggregator
  1. Create and activate a virtual environment:
# Linux/macOS
python3 -m venv venv
source venv/bin/activate

# Windows
python -m venv venv
venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Or using make:

make install

Configuration

Environment Variables

Create a .env file in the project root:

# Choose your AI provider (gemini or claude)
AI_PROVIDER=gemini

# Google Gemini API Key (free tier: 20 requests/minute)
GEMINI_API_KEY=your_gemini_api_key_here

# Anthropic Claude API Key (optional, only if using Claude)
ANTHROPIC_API_KEY=your_claude_api_key_here

Obtaining API Keys

Google Gemini (Recommended for free tier):

  1. Visit https://aistudio.google.com/app/apikey
  2. Sign in with your Google account
  3. Click "Create API Key"
  4. Copy the key to your .env file

Anthropic Claude:

  1. Visit https://console.anthropic.com/
  2. Sign up or log in
  3. Navigate to API Keys section
  4. Generate a new key
  5. Copy the key to your .env file

Configuration Files

models.py - Customize AI behavior:

@dataclass(frozen=True)
class AIConfig:
    categories: tuple[str, ...] = (...)  # Article categories
    default_category: str = "Other"
    claude_model: str = "claude-3-5-haiku-20241022"
    gemini_model: str = "gemini-2.5-flash"
    max_tokens: int = 150
    rate_limit_delay: float = 3.5  # Seconds between API calls

models.py - Customize scraping behavior:

@dataclass(frozen=True)
class ScraperConfig:
    default_article_limit: int = 10
    high_volume_limit: int = 15
    description_max_length: int = 200
    inter_source_delay_seconds: float = 1.0
    http_timeout_seconds: int = 10

Usage

Basic Usage

Run with default settings (uses AI provider from .env):

python main.py

Using Makefile

# Run with AI categorization (default provider from .env)
make run

# Quick scrape without AI processing (fast)
make scrape

# Run with specific AI provider
make run-gemini
make run-claude

# See all available commands
make help

Command-Line Options

python main.py [OPTIONS]

Available Options

Option Description
--provider {claude,gemini} Override AI provider from .env file
--output FILENAME Specify custom CSV output filename
--no-display Skip terminal output, only generate CSV
--scrape-only Scrape articles without AI processing (fast mode)
--deduplicate Remove duplicate articles by URL
--help Show help message and exit

Usage Examples

Use specific AI provider:

python main.py --provider gemini

Custom output filename:

python main.py --output weekly_digest.csv

Scrape only (no AI, fastest):

python main.py --scrape-only

Silent mode with custom output:

python main.py --no-display --output silent_run.csv

Full featured run:

python main.py --provider gemini --deduplicate --output curated_news.csv

Makefile Commands

Development Commands

make install     # Install Python dependencies
make format      # Format code with Black
make lint        # Check code with Ruff linter
make lint-fix    # Auto-fix linting issues
make check       # Run format + lint
make clean       # Remove __pycache__ and cache files

Run Commands

make run              # Run with AI (default provider)
make scrape           # Scrape without AI processing
make run-gemini       # Force Gemini AI provider
make run-claude       # Force Claude AI provider

Output Format

Terminal Display

Articles are organized by category and displayed with:

================================================================================
CATEGORIZED TECH NEWS
================================================================================

--------------------------------------------------------------------------------
AI/Machine Learning (5 articles)
--------------------------------------------------------------------------------

   How LLMs Learn from the Internet: The Training Process
     Source: ByteByteGo
     URL: https://blog.bytebytego.com/p/how-llms-learn-from-the-internet
     Summary: This article explores the complete LLM training pipeline, from
     raw data collection to fine-tuning conversational models.

   [Additional articles...]

--------------------------------------------------------------------------------
Cloud/DevOps/Infrastructure (8 articles)
--------------------------------------------------------------------------------

   [Articles...]

CSV Export

Generated CSV file includes the following columns:

Column Description
Category AI-assigned category
Title Article title
Source News source name
URL Full article URL
Summary AI-generated 1-2 sentence summary
Original Description Raw description from source

Filename format: tech_news_YYYYMMDD_HHMMSS.csv

Example: tech_news_20251208_143022.csv

Architecture

Project Structure

tech-news-aggregator/
├── main.py                 # Application entry point and CLI
├── scrapers.py             # Web scraping implementations
├── ai_categorizer.py       # AI categorization and summarization
├── models.py               # Data models and configurations
├── requirements.txt        # Python dependencies
├── pyproject.toml          # Black and Ruff configuration
├── Makefile               # Development and run commands
├── .env.example                    # Environment variables (not in git)
├── .gitignore             # Git ignore patterns
└── README.md              # Project documentation

Core Components

main.py

  • Argument parsing and validation
  • Pipeline orchestration (scraping, categorization, display, export)
  • Error handling and user feedback

scrapers.py

  • RSS feed parsing for ByteByteGo, InfoQ, AWS, Pragmatic Engineer
  • HTML scraping for Hacker News
  • Source-specific article extraction logic
  • Deduplication functionality

ai_categorizer.py

  • Batch prompt generation (3 articles per call)
  • API integration for Claude and Gemini
  • Response parsing and validation
  • Category mapping and fallback handling

models.py

  • Immutable Article dataclass
  • AIConfig for AI behavior
  • ScraperConfig for scraping behavior
  • Default configuration constants

Data Flow

1. Scrape Sources → Raw Articles
2. Optional: Deduplicate by URL
3. Batch Articles (3 per batch)
4. AI Processing → Category + Summary
5. Display in Terminal
6. Export to CSV

Performance

Execution Times

Mode Articles API Calls Duration
Scrape only ~60 0 ~10 seconds
With AI (batch) ~60 ~20 ~70-90 seconds
With AI (old) ~60 60 ~4-5 minutes

Rate Limits

Gemini Free Tier:

  • 20 requests per minute
  • Batch processing fits perfectly within limits
  • 3.5 second delay between batches

Claude API:

  • Higher rate limits (tier-dependent)
  • Faster response times
  • Requires paid account

Optimization Strategies

  1. Batch Processing: Reduced API calls by 67% (60 → 20 calls)
  2. Smart Delays: Calculated to stay within rate limits
  3. Parallel Scraping: Could be added for faster collection
  4. Caching: Could cache results for repeated runs

Troubleshooting

Common Issues

"No articles found" error:

  • Check internet connection
  • Verify news source is accessible
  • Website structure may have changed (requires code update)

"API key not found" error:

  • Ensure .env file exists in project root
  • Verify API key is set correctly: GEMINI_API_KEY=...
  • Check AI_PROVIDER matches your configured key

"ResourceExhausted" / Rate limit errors:

  • Gemini free tier: 20 requests/minute
  • Wait 60 seconds and retry
  • Batch processing should prevent this (already implemented)

"ModuleNotFoundError":

  • Activate virtual environment: source venv/bin/activate
  • Install dependencies: pip install -r requirements.txt

Empty summaries or "Other" category:

  • AI model may be outdated (check models.py)
  • API response parsing might be failing
  • Check console for error messages

Debug Mode

Add verbose output by modifying the code temporarily:

# In ai_categorizer.py, uncomment debug lines
print(f"API Response: {response_text}")

Contributing

Contributions are welcome! Here's how to contribute:

Development Setup

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature-name
  3. Make your changes
  4. Run code quality checks: make check
  5. Commit your changes: git commit -m "Add your feature"
  6. Push to your fork: git push origin feature/your-feature-name
  7. Open a Pull Request

Code Standards

  • Formatting: Black (line length: 100)
  • Linting: Ruff with strict settings
  • Type Hints: Required for all functions
  • Docstrings: Use for complex functions
  • Testing: Add tests for new features (future enhancement)

Before Committing

make check  # Runs Black formatting and Ruff linting

Areas for Contribution

  • Add new news sources
  • Improve AI prompt engineering
  • Add unit tests
  • Create web interface
  • Implement database storage
  • Add email digest functionality
  • Enhance error handling
  • Optimize scraping speed

Dependencies

Core Dependencies

  • requests - HTTP requests for scraping
  • beautifulsoup4 - HTML parsing
  • feedparser - RSS feed parsing
  • anthropic - Claude AI API
  • google-generativeai - Gemini AI API
  • python-dotenv - Environment variable management

Development Dependencies

  • black - Code formatting
  • ruff - Fast Python linter

See requirements.txt for complete list with versions.

License

This project is provided as-is for educational and personal use.

Important: When using this tool:

  • Respect the terms of service of scraped websites
  • Do not overload servers with requests
  • Comply with API usage policies
  • Attribute sources appropriately

Disclaimer: This tool is for personal news aggregation. Users are responsible for ensuring their use complies with applicable laws and website terms of service.

Acknowledgments

  • News sources for providing valuable content
  • Anthropic and Google for AI API access
  • Open source community for excellent Python libraries

Star this repository if you find it useful!

About

AI-powered tech news aggregator that scrapes, categorizes, and summarizes articles from leading industry sources

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published