Public Data Scraper for Parliament Data for the EU and other Parliaments
- Install git (if not present already)
- Clone project using
git clone https://github.com/sampritipanda/simple_app.git - Install Ruby (version >= 2.1) and Bundler
- Run
bundle installto install the required gems - Run the script using
ruby eu_scraper.rbor./eu_scraper.rb - Find the scraped questions in the docs/ folder
- Ruby - The Language
- Nokogiri - For HTML Parsing
##Scala-based Asynchronous crawler Setup
- Install sbt, git and latest version of scala(sbt will do the update for you)
git clone https://github.com/DengYiping/parliament-scaper.gitsbt run- sbt will first automatically download the necessary dependencies, and it will run the script.
###Technologies Used in Scala crawler:
- Scala: a functional programming language on JVM
- Akka: a effective framework for asynchronous, non-blocking and event-driven programming in Scala
- Spray-client: a light-weighted HTTP client based on Akka Actor model.
##Python Based Crawler Setup
- Install the requirements for this crawler
pip install -r requirements.txt - Run
$ python eu_scraper.py
###Technologies Used in Python Crawler:
- Requests library
- lxml library for DOM traversal
##Python-async parser setup
- Create a virtual environment inside
python-asyncfolder withvirtualenv --python=python3.4 venv - Activate you virtual environment with
source venv/bin/activate - Install all appropriate requirements with
pip install -r requirements.txt - Run the parser with
$ python parser.py
Changing the parser behavior
- Change
YEARS_TO_PARSEin order to parse data from different years - Change
FOLDER_TO_DOWNLOADin order to change the name of the folder to download the data into.
###Technologies Used in Python-async parser:
- Requests + requests-futures for async requests
- threading for async downloading
- beautifulsoup4 for DOM parsing
- tqdm for progress bar