GitHub - paulr291/shark-benchmark

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
ec2		ec2
queries		queries
tpch_q1		tpch_q1
README		README
config.sh		config.sh
executeQueries.sh		executeQueries.sh
install.rb		install.rb
shark-env.sh		shark-env.sh
spark-env.sh		spark-env.sh

Repository files navigation

This project provides a framework for automating benchmarking for Shark.

The directory ec2 provides scripts and configuration files for setting up an ec2 cluster and running the tests on it.
The README in that directory explains its use.

To run on a given cluster, 
clone this repository into the master, 
execute "./install.rb [Git Spark Commit Hash] [Git Shark Commit Hash]" which installs the given version of Spark and Shark and copy the tpch data onto hdfs,
modify the necessary configuration parameters, 
and execute "./executeQueries.sh".

The following needs to be set in config.sh:
  RESULTS - the file the csv result will be saved to. It will be in the format "query name,iteration number,seconds".
  BENCHMARK_LOG - the file the output of Shark will be saved to
  QUERIES_DIR - the directory of queries to run (more info below)
  ALL_QUERY - the file that will hold the queries concatenated together to be run by Shark
  ITERATIONS - number of times to execute each timed query 

There are three kind of files that should be in QUERIES_DIR:
  setup.hive - There should be one of these and it will be executed once when Shark starts up
  *.hive - These contain queries that will be timed. They will be executed ITERATIONS number of times.
  *.hive_setup - Each corresponds to a *.hive file. These are run once before executing the corresponding file ITERATIONS number of times.
For examples, see tpch_q1 and queries.