This project is a web scraper for news articles from various media sources. The scraper downloads the XML feed of news articles from a particular URL, extracts specific fields of data from each article, and stores the data in a SQLite database. The program is written in Python and uses the BeautifulSoup and dataset libraries for web scraping and database management, respectively. Files
The project consists of two Python files: init.py and main.py. init.py
The init.py file contains two classes: FeedParser and Scraper.
FeedParser is a class for parsing XML feeds and extracting the relevant fields of data from the articles. Each instance of FeedParser is associated with a specific URL and SQLite table. The class downloads the XML feed from the URL, iterates over each article in the feed, and extracts the article's title, link, description, category, and publication date. The class then stores this information in the SQLite table, along with a unique identifier generated by the gen_uniqueId function. The parse_feed method takes optional keyword arguments that can be used to extract additional fields of data from the articles.
Scraper is a class for scraping additional fields of data from the articles that were not included in the XML feed. Currently, the class is configured to extract the author and keywords from the HTML of each article. The class uses the BeautifulSoup library to parse the HTML of each article, extracts the desired fields of data, and stores them in the SQLite table. main.py
The main.py file is the main entry point for the program. It instantiates instances of FeedParser and Scraper for each media source, extracts the relevant data from each article, and stores it in the SQLite database. The file is currently configured to scrape news articles from these sources: Frankfurter Allgemeine Zeitung, Die Tageszeitung, JungeWelt, Zeit-Online and Der Spiegel.
The get_feed function is used to create an instance of FeedParser for each media source and extract the relevant data from the articles. The do_html function is used to create an instance of Scraper for each media source and extract additional fields of data from the articles. The do_faz function is used to extract the author name from articles in the Frankfurter Allgemeine Zeitung (becaus their not stored as meta's in their html, like the others are).
The program is designed to run continuously, and the main loop in main.py is currently set to scrape news articles from each media source every 5 minutes. Dependencies
This project requires the following Python libraries:
bs4 (BeautifulSoup)
requests
xml.etree.ElementTree
dataset
json
The project is designed to use a SQLite database to store the scraped data. The database connection string is currently hardcoded in init.py and main.py. Usage
To run the program, simply execute main.py