Thesis: Query optimisation in distributed databases
-
Create a Google Cloud Storage called storage_intermediate
- upload the data as a test_data.zip here
- create a sql-query-engine directory and upload files from this repository there
-
Create a Compute Engine Cluster
- Choose the standard cluster type - (1 master, N workers)
- Enable component gateway to have access to MapReduce and Yarn web UIs
- Set up the master node and three worker nodes to n2-standard-2 (2 vCPU, 8 GB)
- Set up an initialisation action to the script_for_cloud.sh script
-
Ssh to all worker nodes and do the following:
sudo chown root /var/lib/hadoop-hdfssudo hdfs --daemon start datanode
-
Ssh to master node from cloud console
cd /home/sql-query-engine/source .venv/bin/activateexport CLUSTER_NAME=<name_of_the_cluster_here>export PYSPARK_DRIVER_PYTHON=/home/sql-query-engine/.venv/bin/pythonexport PYSPARK_PYTHON=/home/sql-query-engine/.venv/bin/pythonexport SPARK_HOME=/usr/lib/sparkhadoop fs -put /data/* /data/sudo yarn --daemon start resourcemanagerhadoop classpath- edit the property below from
/etc/hadoop/conf/yarn-site.xmlto the values from the command above -
<property> <name>yarn.application.classpath</name> <value>output from hadoop classpath </value> </property>
python main.py --env <LOCAL | HDFS> --mode <hadoop | spark> --dd_path <path> <"sql query">