sql-query-engine

Thesis: Query optimisation in distributed databases

Manual on how to run on Google Cloud Dataproc's Compute engine

Create a Google Cloud Storage called storage_intermediate
- upload the data as a test_data.zip here
- create a sql-query-engine directory and upload files from this repository there
Create a Compute Engine Cluster
- Choose the standard cluster type - (1 master, N workers)
- Enable component gateway to have access to MapReduce and Yarn web UIs
- Set up the master node and three worker nodes to n2-standard-2 (2 vCPU, 8 GB)
- Set up an initialisation action to the script_for_cloud.sh script
Ssh to all worker nodes and do the following:
- sudo chown root /var/lib/hadoop-hdfs
- sudo hdfs --daemon start datanode
Ssh to master node from cloud console
- cd /home/sql-query-engine/
- source .venv/bin/activate
- export CLUSTER_NAME=<name_of_the_cluster_here>
- export PYSPARK_DRIVER_PYTHON=/home/sql-query-engine/.venv/bin/python
- export PYSPARK_PYTHON=/home/sql-query-engine/.venv/bin/python
- export SPARK_HOME=/usr/lib/spark
- hadoop fs -put /data/* /data/
- sudo yarn --daemon start resourcemanager
- hadoop classpath
- edit the property below from /etc/hadoop/conf/yarn-site.xml to the values from the command above
- ```
 <property>
   <name>yarn.application.classpath</name>
   <value>output from hadoop classpath </value>
</property>
```

python main.py --env <LOCAL | HDFS> --mode <hadoop | spark> --dd_path <path> <"sql query">

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
data		data
ra2mr		ra2mr
ra2spark		ra2spark
raopt		raopt
sql2ra		sql2ra
tests		tests
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
luigi.cfg		luigi.cfg
main.py		main.py
pizza_set.json		pizza_set.json
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
script_for_cloud.sh		script_for_cloud.sh
test_commands.txt		test_commands.txt
tpc-h.json		tpc-h.json