A Python-based web scraper designed to power a News Reporter Chatbot. The scraper crawls verified news websites, extracts articles, and stores them in a Redis database. This enables the chatbot to answer user queries about recent events and verify rumors or social media claims with reliable sources.
- Focused URL Crawling: Crawls only trusted news domains starting from predefined base URLs.
- Verified Content Extraction: Extracts clean article text, headlines, and publication details.
- Redis Integration: Stores articles with metadata for fast retrieval by the chatbot.
- Duplicate Prevention: Skips already scraped URLs to avoid redundant storage.
- Rate Limiting: Configurable delay between requests to ensure respectful crawling.
- Comprehensive Logging: Tracks crawling, scraping, and storage activities.
.scrapper/
├── data/
│ └── base_urls.txt # List of news source URLs
├── dump.rdb # Redis database file
├── README.md # This file
├── redis/
│ └── dump.rdb # Redis persistence file
├── requirements.txt # Python dependencies
├── run_scrapper.log # Log file for run.sh
├── run.sh # Bash script to run the scraper
├── scraper.log # Scraper activity log
├── src/
│ ├── crawler.py # URL discovery and crawling
│ ├── db.py # Redis database operations
│ ├── __init__.py
│ ├── main.py # Main entry point
│ ├── scrapper.py # Article extraction and cleaning
│ └── utils.py # Utility functions
└── tests/
└── test_scrapper.py # Unit tests
Create a Conda environment with Python 3.10.13:
conda create -n scrapper-env python=3.10.13 -y
conda activate scrapper-envpip install -r requirements.txt- Install Redis locally or use a cloud Redis service.
- Ensure Redis is running and accessible.
- Default Redis persistence file is stored at
redis/dump.rdb.
Edit data/base_urls.txt and list the base URLs of news websites to scrape, one URL per line.
You can adjust scraping parameters in the .env file if available:
MAX_PAGES_PER_DOMAIN→ Maximum articles per domainREQUEST_DELAY→ Delay between requests (seconds)USER_AGENT→ HTTP User-Agent header
The run.sh script will activate the environment, run the scraper, and log output:
bash run.shThis will:
- Load base URLs from
data/base_urls.txt - Crawl each domain to discover new URLs
- Extract and clean article text and metadata
- Store articles in Redis
- Log scraping activity to
scraper.logandrun_scrapper.log
cd src
python main.py- Articles: Stored as hashes with keys like
news:<url_hash> - Scraped URL Index: Set
scraped_urlsstores all processed URLs - Metadata: Includes headline, content, source, timestamp, and length
Run unit tests:
pytest tests/- Only crawls trusted news domains listed in
base_urls.txt. - Automatically skips duplicate URLs.
- Handles errors gracefully without stopping the scraping process.
- Respects rate limits to avoid overwhelming servers.
- Provides structured and reliable data for the News Reporter Chatbot.