An automated news aggregation and posting system that collects relevant news stories from Google News, filters and summarizes them using OpenAI, and posts high-quality updates to the BlueSky social network at a controlled rate.
- YAML Configuration Management - Flexible config loading with secrets support
- Google News Fetching - Automated story collection with customizable search parameters
- Redis-based Storage - Persistent storage with Pydantic models and PostStatus tracking
- Advanced Story Filtering - Source validation, headline/URL keyword filtering with detailed status tracking
- Smart Deduplication - URL and headline-based duplicate detection with semantic framework
- Google News URL Decoding - Resolves Google News redirect URLs to original sources
- Full Article Scraping - Multi-strategy content extraction with fallback handling
- OpenAI-Powered Summarization - Intelligent story summarization with retry logic and character limits
- Relevance Checking - Content relevance validation for accepted sources
- BlueSky Integration - Automated posting with rate limiting and post history tracking
- Continuous Operation - Background runner with configurable intervals and graceful shutdown
- Comprehensive Logging - Multi-level logging with file and console output
- Utility Scripts - Redis management, story inspection, and startup scripts
- Robust Testing Suite - Complete unit test coverage for all components
- FastAPI Observability Layer - Web dashboard for monitoring and story inspection
- Enhanced NLP Features - Semantic deduplication and advanced content analysis
See specs/prd.md for complete requirements and specs/todo.md for development progress.
- Python 3.10+
- Redis (for persistent storage)
- OpenAI API key (for summarization)
- BlueSky account (for posting)
This project uses Poetry for dependency management:
-
Install Poetry (if not already installed):
curl -sSL https://install.python-poetry.org | python3 - -
Install dependencies:
poetry install
-
Configure the application:
- Edit
configs/config.yamlfor main settings (search terms, sources, etc.) - Create
configs/secrets.yamlfrom the example template for API keys
# Copy the example secrets file and customize it cp configs/secrets.yaml.example configs/secrets.yaml # Edit with your actual API keys vim configs/secrets.yaml
- Edit
-
Start Redis (required for storage):
# Windows ./scripts/start-redis.ps1 # Linux/macOS ./scripts/check_and_start_redis.sh
# Run once with default settings
poetry run python src/main.py
# Run with debug logging
poetry run python src/main.py --log-level DEBUG
# See all options
poetry run python src/main.py --help# Windows
./scripts/start_newsbot.bat
# Linux/macOS
./scripts/start_newsbot.sh
# Custom continuous run
poetry run python continuous_runner.py --interval 30m --log-level INFOThe application uses two YAML configuration files:
configs/config.yaml- Main application settings (search terms, filtering rules, processing options)configs/secrets.yaml- API keys and credentials (excluded from version control)
-
Copy and customize the main config:
# The config.yaml file contains all application settings # Edit search terms, sources, and processing options vim configs/config.yaml
-
Create your secrets file:
# Create secrets.yaml with your API keys cp configs/secrets.yaml.example configs/secrets.yaml vim configs/secrets.yaml -
Add your credentials to secrets.yaml:
# BlueSky credentials bluesky: handle: "your-handle.bsky.social" app_password: "your-app-password" # Generate in BlueSky Settings # OpenAI API key (model settings are in config.yaml) openai: api_key: "sk-your-openai-api-key"
For comprehensive configuration documentation including all options, examples, and security best practices, see:
The configuration guide covers:
- All configuration options with detailed explanations
- Security best practices for managing secrets
- Example configurations for different use cases
- Troubleshooting common configuration issues
- Environment variable overrides for deployment
newsbot/
โโโ src/ # Main source code
โ โโโ main.py # Application entry point with CLI
โ โโโ config_loader.py # YAML configuration management
โ โโโ fetcher_rss.py # Google News integration
โ โโโ filter.py # Story filtering system
โ โโโ deduplicator.py # Advanced deduplication logic
โ โโโ url_decoder.py # Google News URL resolution
โ โโโ article_scraper.py # Multi-strategy content extraction
โ โโโ summarizer.py # OpenAI-powered summarization
โ โโโ relevance_checker.py # Content relevance validation
โ โโโ bluesky_poster.py # BlueSky social posting
โ โโโ storage.py # Redis-based persistence
โ โโโ typechecking.py # Type definitions and enums
โโโ tests/ # Comprehensive test suite
โ โโโ test_*.py # Unit tests for all components
โ โโโ conftest.py # Test configuration and fixtures
โโโ scripts/ # Utility and startup scripts
โ โโโ start_newsbot.bat # Windows startup script
โ โโโ start_newsbot.sh # Linux/macOS startup script
โ โโโ start-redis.ps1 # Windows Redis startup
โ โโโ check_and_start_redis.sh # Linux Redis management
โโโ configs/ # Configuration files
โ โโโ config.yaml # Main configuration
โ โโโ secrets.yaml # API keys and secrets
โโโ docs/ # Documentation
โ โโโ configuration.md # Comprehensive configuration guide
โ โโโ continuous_runner.md # Continuous operation guide
โ โโโ semantic_deduplication.md# Future NLP features
โโโ specs/ # Project specifications
โ โโโ prd.md # Product requirements document
โ โโโ todo.md # Development progress tracking
โโโ stubs/ # Type stubs for external libraries
โโโ continuous_runner.py # Continuous operation script
โโโ *.py # Utility scripts (Redis, story management)
โโโ pyproject.toml # Poetry configuration and dependencies
The NewsBot follows a structured pipeline for processing news stories:
- Connects to Google News with configurable search parameters
- Fetches stories from the last N days (configurable)
- Processes raw Google News data into structured Story objects
- Generates stable story IDs using content hashing
- Story ID matching: Prevents exact duplicate processing
- URL normalization: Removes tracking parameters and handles redirects
- Headline similarity: Detects stories with similar titles
- Google News handling: Smart handling of redirect URLs
- Batch processing: Prevents duplicates within same fetch batch
- Semantic framework: Ready for future NLP-based deduplication
- Source validation: Confirms stories from trusted sources
- Headline filtering: Removes stories with banned keywords
- URL filtering: Blocks content from unwanted domains
- Status tracking: Records filter decisions and reasons
- Comprehensive logging: Detailed filter statistics and decisions
- Google News resolution: Converts redirect URLs to original sources
- Batch processing: Efficient handling of multiple URLs
- Error handling: Graceful fallback for failed resolutions
- Statistics tracking: Success/failure metrics
- Multi-strategy extraction: newspaper3k, trafilatura, BeautifulSoup
- Fallback handling: Tries multiple methods for maximum success
- Content validation: Ensures meaningful article content
- Rate limiting: Configurable delays between requests
- Error tracking: Detailed failure analysis per scraping method
- OpenAI integration: GPT-powered intelligent summarization
- Multiple strategies: OpenAI, condensation, headline fallback
- Character limits: Respects BlueSky post length requirements
- Retry logic: Automatic retry for over-length summaries
- Quality validation: Ensures summary relevance and readability
- Content analysis: Validates story relevance for accepted sources
- Keyword matching: Configurable relevance criteria
- Source-based logic: Different rules for confirmed vs accepted sources
- Decision tracking: Logs relevance decisions and criteria
- Rate limiting: Configurable intervals between posts
- Post history: Tracks successful and failed posts
- Authentication: Secure BlueSky API integration
- Status tracking: Updates story records with posting results
- Error handling: Graceful handling of API failures
- Redis persistence: All stories stored with full metadata
- Filter status preservation: Tracks passed/rejected with reasons
- Advanced field preservation: Protects summary, post_status during updates
- Story lifecycle management: Handles updates without data loss
- Statistics tracking: Real-time counts of passed/rejected stories
list_stories.py- List and filter stored stories with detailed informationshow_story.py- Display complete details for specific storiesflush_story.py- Remove specific stories for testingflush_redis.py- Clear all Redis datacheck_redis.py- Verify Redis connectivity and data
continuous_runner.py- Run the pipeline continuously with configurable intervalsscripts/start_newsbot.bat- Windows startup script (1-minute intervals)scripts/start_newsbot.sh- Linux/macOS startup script (1-minute intervals)
scripts/start-redis.ps1- Start Redis on Windowsscripts/check_and_start_redis.sh- Check and start Redis on Linux/macOS
The application provides comprehensive logging with configurable levels:
- DEBUG - Detailed processing information for development
- INFO - General application flow and status updates (default)
- WARNING - Important notices and potential issues
- ERROR - Error conditions that don't stop execution
- CRITICAL - Severe errors that may cause application termination
- File logging:
newsbot.log(persistent, rotated) - Console output: Real-time monitoring with colors
- Configurable: Can disable file logging for console-only operation
Real-time statistics for each pipeline stage:
- Fetcher: Stories retrieved, processing success rate
- Deduplication: Duplicates by type (story_id, URL, headline, semantic)
- Filtering: Pass/fail rates, source breakdown, filter reasons
- URL Decoding: Success/failure rates, processing times
- Article Scraping: Success by method, content quality metrics
- Summarization: Strategy usage, character counts, retry rates
- Relevance Checking: Relevance decisions, keyword matches
- BlueSky Posting: Post success/failure, rate limiting effectiveness
- Storage: Total stories, new vs updated, error counts
- Overall: Pipeline throughput, processing times, error rates
# Run all tests
poetry run pytest
# Run with coverage
poetry run pytest --cov=src
# Run specific test file
poetry run pytest tests/test_storage.py -v# Type checking
poetry run mypy src/
# Linting (if configured)
poetry run ruff check src/- Use
--log-level DEBUGfor detailed debugging information - Test individual components with their respective utility scripts
- Monitor
newsbot.logfor persistent logging during development - Use
flush_redis.pyto reset state between testing sessions
- FastAPI Observability Layer: Web dashboard for monitoring and story inspection
- Enhanced NLP: Semantic deduplication using sentence transformers
- Multi-source Support: Extend beyond Google News to RSS feeds and other sources
- Advanced Filtering: ML-based content quality scoring
- Performance Optimization: Async processing and batch operations
- Multiple Topics: Support for different search terms and posting accounts
- Scheduling: Time-based posting schedules and content calendars
- Content Templates: Customizable post formatting and templates
- Integration Webhooks: Notifications and external system integration
This project is licensed under the MIT License - see the LICENSE file for details.