Ragnar is an ETL tool for Big Data on top of only open-source tools like:
- Spark
- Docker
- Elasticsearch
- Kibana
- Numpy
- Pandas
- Jupyter notebook
https://www.kaggle.com/zynicide/wine-reviews/version/4#
ftp://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html
We use a Code-Style based on PEP8.
The tool we use for linting is called flake8.
It is also recommended to activate a virtualenv to keep the house clean.
First, enter the source folder.
cd src
On Windows 10, it is only safe to use virtualenv on cmd prompt.
virtualenv .venv
.venv\Scripts\activate.bat
After activating the virtual env, install pip requirements to run linter and tests locally.
pip install -r requirements.txt
To leave the virtualenv:
deactivate
flake8 app
It is recommended to setup your IDE to also use the flake8 linter.
On the app folder, execute the command below:
set PYTHONPATH="%cd%"
python -m pytest tests
It is necessary to build the following three images before running docker-compose.
The order is important.
Build ragnar/spark image.
.\bin\docker_build_spark.ps1
Build ragnar/pyspark image.
.\bin\docker_build_pyspark.ps1
Build the application base image: ragnar/app image.
The base image is required for the ragnar/app/notebook image.
.\bin\docker_build_app.ps1
Raise a local environment for development and data exploration.
docker-compose up
- Kibana:
http://localhost:5601 - Elasticsearch:
http://localhost:9200 - Jupyter Notebook:
http://localhost:8888
Jupyter: Since the volume is binded by docker-compose, everything is automatically saved on the folder notebooks. So no manual job is required here.
Kibana: It is necessary to manualy Export and Import saved objects. Every kibana object should be saved on the folder kibana.
Elasticsearch: All data saved into elasticseach will be lost when the container dies.
Running Spark (Scala) in interactive-mode.
docker run --rm -it ragnar/spark
Running Spark (Python) in interactive-mode.
docker run --rm -it --entrypoint "/bin/bash" ragnar/pyspark
Share the hole project path with the VirtualBox.
This will let we access the folders inside the ragnar project (eg: /ragnar/notebooks).
docker-machine stop
vboxmanage sharedfolder add default --name "ragnar" --hostpath "D:\Projetos\ragnar" --automount
docker-machine start
If you are using the docker-machine, you need to replace the localhost for the virtual machine ip (eg: 192.168.99.100).
./bin/pyspark --packages org.apache.spark:spark-avro_2.11:2.4.4
SPARK
TESTS