Skip to content

hkustDB/Quorion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Query running too slow? Rewrite it with Quorion!

Part1: Reproducibility of the Experiments

Step0: Environment Requirements

  • Java JDK 1.8
  • Scala 2.12.10
  • Maven 3.8.6
  • Python version >= 3.9
  • Python package requirements: docopt, requests, flask, openpyxl, pandas, matplotlib, numpy

Step1: DBMS Requirement Preparation

DuckDB 1.0:

  1. Move into install directory. Do the following command:
  2. Download *.zip or *.tar.gz file from https://github.com/duckdb/duckdb/releases/tag/v1.0.0
  3. Extract the content and generate duckdb executable file
# Step 0:
$ cd Quorion/query

# Step 1:
# duckdb_cli-linux-aarch64.zip
$ wget https://github.com/duckdb/duckdb/releases/download/v1.0.0/duckdb_cli-linux-aarch64.zip
    or
# duckdb_cli-linux-amd64.zip
wget https://github.com/duckdb/duckdb/releases/download/v1.0.0/duckdb_cli-linux-amd64.zip
    or 
# duckdb_cli-osx-universal.zip
wget https://github.com/duckdb/duckdb/releases/download/v1.0.0/duckdb_cli-osx-universal.zip
    or
# duckdb_cli-windows-amd64.zip
wget https://github.com/duckdb/duckdb/releases/download/v1.0.0/duckdb_cli-windows-amd64.zip

# Step 3:
unzip duckdb_cli-*.zip

PostgreSQL 16.2

  1. Change directory to any directory that you want to install your PostgreSQL
  2. Install PostgreSQL 16.2.
# 1. Download 
$ wget https://ftp.postgresql.org/pub/source/v16.2/postgresql-16.2.tar.gz
$ tar -xvzf postgresql-16.2.tar.gz 
$ cd postgresql-16.2
# 2. Build
$ ./configure --prefix=/path/to/postgresql-16.2
$ make -j
$ make install
$ mkdir data
# 3. Set environment
$ export PGDATA=/path/to/postgresql-16.2/data
$ export PATH=/opt/pgsql16/bin:$PATH
# 4. Initialization
$ initdb -D $PGDATA -E UTF8 --locale=C -U postgres
# 5. Start pg server
$ bin/pg_ctl -D $PGDATA -l logfile start
  1. Create a database test. You may use another name for the database.
# 6. Create test database
$ createdb -U postgres test
# 7. Check
$ bin/psql -U postgres test
  # Show
  test=#
  1. Make sure you can access the database by /path/to/postgresql-16.2/bin/psql -U postgres -d test (without a password)
  2. Install extension after access the database by using command CREATE EXTENSION file_fdw;. If executing the command failed, executing the following commands.
$ cd /path/to/postgresql-16.2/contrib/file_fdw
$ make
$ make install
$ /path/to/postgresql-16.2/bin/pg_ctl -D /path/to/data stop
$ /path/to/postgresql-16.2/bin/pg_ctl -D /path/to/data start
$ /path/to/postgresql-16.2/bin/psql -U postgres -d test
test=# CREATE EXTENSION file_fdw;

Spark 3.5.1

  1. Change directory to any directory that you want to install your Spark
  2. Download Spark 3.5.1 from https://archive.apache.org/dist/spark/spark-3.5.1/
  3. Extract the downloaded package
  4. Set environment variables. Please ensure to modify them according to your file path.
export SPARK_HOME="/path/to/spark-3.5.1"
export PATH="${SPARK_HOME}/bin":"${PATH}"

Step2: Dataset Download

0. [Important] Download path

  1. make directory under Quorion
$ cd Quorion/
$ mkdir -p Data
$ cd Data/
$ mkdir -p graph
$ mkdir -p lsqb
$ mkdir -p tpch
$ mkdir -p job
  1. Move all downloaded data to path Quorion/Data/[graph|lsqb|tpch|job]

1. Graph data

  1. Run bash download_graph.sh to download a graph from SNAP.
  2. Move graph data under Quorion/Data/graph.

2. LSQB data

Choice 1: generate by yourself from official site
  1. Clone lsqb dataset generate tool from https://github.com/ldbc/lsqb and generate the scale factor = 30 data result.
  2. Move graph data under Quorion/Data/lsqb.
Choice 2: download directly from the cloud storage (~13G)
  1. Please download from lsqb_30.
  2. Move graph data under Quorion/Data/lsqb.

3. TPC-H data

Choice 1: generate by yourself from official site
  1. Clone TPC-H dataset generation tool from https://www.tpc.org/tpc_documents_current_versions/current_specifications5.asp and generate the scale factor = 100 data result.
  2. Move graph data under Quorion/Data/tpch.
Choice 2: download directly from the cloud storage (~108G)
  1. Please download from tpch_100
  2. Move graph data under Quorion/Data/tpch.

4. JOB data

  1. Please download from job_100.
  2. Move graph data under Quorion/Data/job.

Step3: Database Initialization

  1. Make sure you have already move the data to path Quorion/Data/[graph|lsqb|tpch|job].
  2. Replace the default path in load_[graph|lsqb|tpch|job]_[duckdb|pg].sql by running the command below.
$ bash scripts/update_paths.sh
  1. Copy the file query/config.properties.template and rename it as query/config.properties. Change the settings in query/config.properties to set the corresponding PostgreSQL config and DuckDB config.
  2. Then load data to the DuckDB and PostgreSQL by the following commands.
$ bash scripts/load_data_duckdb.sh
$ bash scripts/load_data_pg.sh

Step4: Generate rewritten queries

Option1: Use the generated rewritten queries

  • Go to Step5 directly.

Option2: Generate rewritten queries by yourself

  1. Build jar file.
$ git submodule init
$ git submodule update
$ cd SparkSQLPlus
$ mvn clean package
$ cp sqlplus-web/target/sparksql-plus-web-jar-with-dependencies.jar ../
  1. Change the Parser config at query/config.properties.
  2. Start parser using command
$ bash ./scripts/start_parser.sh
  1. Execute main.py to launch the Python backend rewriter component.
$ python main.py
  1. Generate rewritten queries for DuckDB SQL syntax.
./auto_rewrite.sh graph graph_duckdb D N
./auto_rewrite.sh graph graph_pg M N
./auto_rewrite.sh lsqb lsqb D N
./auto_rewrite.sh tpch tpch D N
./auto_rewrite.sh job job D N

Step5: Run experiments

Use prepared rewritten queries directly

  1. Change the specifications in query/config.properties. As for the Experiment config, the default repeat times is 5 and timeout is 7200 seconds.
  2. Execute ./auto_run_duckdb_batch.sh to run all duckdb experiements, ./auto_run_pg_batch.sh to run all postgresql experiements. Or run different benchmark seperately.
$ ./auto_run_duckdb_batch.sh
$ ./auto_run_pg_batch.sh
    or
# Run DuckDB
$ ./auto_run_duckdb.sh graph graph_duckdb
$ ./auto_run_duckdb.sh lsqb lsqb
$ ./auto_run_duckdb.sh tpch tpch
$ ./auto_run_duckdb.sh job job
# Run PG
$ ./auto_run_pg.sh graph_pg
$ ./auto_run_pg.sh lsqb
$ ./auto_run_pg.sh tpch
$ ./auto_run_pg.sh job
  1. The queries for parallism, scale & selectivity is under query directory.
  • For parallism testing, the queries is under query/parallelism_[lsqb|sgpb], please set parallism through
./auto_run_duckdb.sh parallelism_[lsqb|sgpb] [1|2|4|8|16|32|48]
  • For scale testing, the queries is under query/scale_[job|lsqb]
  • For selectivity testing, the queries is under query/selectivity_[lsqb|tpch]

SparkSQL

For details, please refer to the SparkSQLRunner README.

Step6: plot

  1. Execute the following command to gather statistics. The generated statistis is in summary_*_statistics[_default].csv.
# Gather results for query under directory graph & lsqb & tpch & job
./auto_summary.sh graph
./auto_summary.sh lsqb
./auto_summary.sh tpch
./auto_summary_job.sh job
  1. Execute scripts under draw/* to do the plotting and generated picture is under draw/*.pdf.
# Generate pictures(graph.pdf, lsqb.pdf, tpch.pdf) about running times for SGPB, LSQB and TPCH. Corresponding to Figure 9. 
python3 draw_graph.py

# Generate pictures(job_duckdb.pdf, job_postgresql.pdf) about running times for JOB. Corresponding to Figure 10. 
python3 draw_job.py

# Generate picture(selectivity_scale.pdf) about selectivity & scale. Corresponding to Figure 11. 
python3 draw_selectivity.py

# Generate pictures(thread1.pdf, thread2.pdf) about parallelism. Corresponding to Figure 12.
python3 draw_thread.py

Step7: File Structure

Quorion/
├── README.md
├── *.py                              # Python backend rewriter component
├── sparksql-plus-web-jar-with-dependencies.jar  # Parser jar file
├── SparkSQLRunner/
│   └── README.md
├── SparkSQLPlus/                     # Git submodule for Java parser
├── Data/                             # Dataset directory (created by user)
│   ├── graph/                        # Graph dataset
│   ├── lsqb/                         # LSQB dataset (scale=30)
│   ├── tpch/                         # TPC-H dataset (scale=100)
│   └── job/                          # JOB dataset (scale=100)
├── query/                            # Query and execution scripts
│   ├── config.properties.template    # Configuration template
│   ├── config.properties             # User configuration (created from template)
│   ├── auto_run_duckdb.sh            # DuckDB execution script
│   ├── auto_run_pg.sh                # PostgreSQL execution script
│   ├── auto_run_duckdb_batch.sh      # Batch DuckDB execution script
│   ├── auto_run_pg_batch.sh          # Batch PostgreSQL execution script
│   ├── auto_rewrite.sh               # Query rewriting script
│   ├── auto_summary.sh               # Results summary script
│   ├── auto_summary_job.sh           # JOB results summary script
│   ├── load_graph_duckdb.sql         # Graph data loading for DuckDB
│   ├── load_graph_pg.sql             # Graph data loading for PostgreSQL
│   ├── load_lsqb_duckdb.sql          # LSQB data loading for DuckDB
│   ├── load_lsqb_pg.sql              # LSQB data loading for PostgreSQL
│   ├── load_tpch_duckdb.sql          # TPC-H data loading for DuckDB
│   ├── load_tpch_pg.sql              # TPC-H data loading for PostgreSQL
│   ├── load_job_duckdb.sql           # JOB data loading for DuckDB
│   ├── load_job_pg.sql               # JOB data loading for PostgreSQL
│   ├── summary_*_statistics.csv      # Generated statistics files
│   ├── summary_*_statistics_default.csv  # Default/fallback statistics
│   ├── graph/                        # Graph queries directory
│   ├── lsqb/                         # LSQB queries directory
│   ├── tpch/                         # TPC-H queries directory
│   ├── job/                          # JOB queries directory
│   ├── parallelism_lsqb/             # Parallelism test queries (LSQB)
│   ├── parallelism_sgpb/             # Parallelism test queries (SGPB)
│   ├── scale_job/                    # Scale test queries (JOB)
│   ├── scale_lsqb/                   # Scale test queries (LSQB)
│   ├── selectivity_lsqb/             # Selectivity test queries (LSQB)
│   ├── selectivity_tpch/             # Selectivity test queries (TPC-H)
│   ├── src/                          # SparkSQL source files
│   ├── Schema/                       # Schema files for SparkSQL
│   ├── preprocess.sh                 # Cost generated script
│   ├── gen_cost.sh                   # Cost statistics generation
│   ├── gen_plan.sh                   # Plan generation script
│   └── start_parser.sh               # Parser startup script
├── draw/                             # Visualization scripts
│   ├── draw_graph.py                 # Generate Figure 9 (SGPB, LSQB, TPCH)
│   ├── draw_job.py                   # Generate Figure 10 (JOB performance)
│   ├── draw_selectivity.py           # Generate Figure 11 (selectivity & scale)
│   ├── draw_thread.py                # Generate Figure 12 (parallelism)
│   ├── graph.pdf                     # Generated visualization output
│   ├── lsqb.pdf                      # Generated visualization output
│   ├── tpch.pdf                      # Generated visualization output
│   ├── job_duckdb.pdf                # Generated visualization output
│   ├── job_postgresql.pdf            # Generated visualization output
│   ├── selectivity_scale.pdf         # Generated visualization output
│   ├── thread1.pdf                   # Generated visualization output
│   └── thread2.pdf                   # Generated visualization output
├── scripts/                          # Utility scripts
│   ├── update_paths.sh               # Update data paths in SQL files
│   ├── load_data_duckdb.sh           # Load all data into DuckDB
│   ├── load_data_pg.sh               # Load all data into PostgreSQL
│   └── download_graph.sh             # Download graph dataset
├── figure/                           # Documentation figures
│   ├── 1.png
│   ├── 2.png
│   ├── 3.png
└── └── 4.png

Part2: Extra Information [Option]

Structure Overview

  • Web-based Interface
  • Java Parser Backend
  • Python Optimizer & Rewriter Backend
  1. Preprocessing[option].
  • Statistics: For generating new statistics (cost.csv), we offer the DuckDB version scripts query/preprocess.sh and query/gen_cost.sh. Modify the configurations in them, and execute the following command. For web-ui, please move the generated statistics files to folder graph/q1/, tpch/q2/, lsqb/q1/, job/1a/, and custom/q1/ respectively; for command-line operations, please move them to the specific corresponding query folders.
  • Plan: Here, we also provide the conversion of DuckDB plans. Please modify the DuckDB and Python paths in gen_plan.sh. Then execute the following command. After running the command, the original DuckDB plan will be generated as db_plan.json, and the newly generated plan will be plan.json, which is suitable for our parser. Here ${DB_FILE_PATH} represents a persistent database in DuckDB. Please change the parameter to timeout=0 in requests.post at main.py:223 if you want to use the self-defined plan.
$ ./gen_plan.sh ${DB_FILE_PATH} ${QUERY_DIRECTORY}
e.g.
./gen_plan.sh ~/test_db job
  1. We provide two execution modes. The default mode is web-ui execution. If you need to switch, please modify the corresponding value EXEC_MODE at Line 767 in main.py.

Web-UI

  1. Execute main.py to launch the Python backend rewriter component.
$ python main.py
  1. Execute the Java backend parser component through command java -jar sparksql-plus-web-jar-with-dependencies.jar build from SparkSQLPlus, which is included as a submodule. [Option] You can also build jar file by yourself.
  2. Please use the following command to init and update it.
$ git submodule init
$ git submodule update [--remote]
    or
$ git submodule update --init --recursive
  1. Open the webpage at http://localhost:8848.
  2. Begin submitting queries for execution on the webpage.

Command Line [Default]

  1. Modify python path (PYTHON_ENV) in auto_rewrite.sh.
  2. Execute the following command to get the rewrite querys. The rewrite time is shown in rewrite_time.txt
  3. OPTIONS
  • Mode: Set generate code mode D(DuckDB)/M(MySql) [default: D]
  • Yannakakis/Yannakakis-Plus : Set Y for Yannakakis; N for Yannakakis-Plus [default: N]
$ bash start_parser.sh
$ Parser started.
$ ./auto_rewrite.sh ${DDL_NAME} ${QUERY_DIR} [OPTIONS]
e.g ./auto_rewrite.sh lsqb lsqb M N
  1. If you want to run a single query, please change the code commented # NOTE: single query keeps here in function init_global_vars (Line 587 - Line 589 in main.py), and comment the code block labeled # NOTE: auto-rewrite keeps here (the code between the two blank lines, Line 610 - Line 629 in main.py).

Demonstration

Step 1

Step1

Step 2

Step2

Step 3

Step3

Step 4

Step4

NOTE

  • For queries like SELECT DISTINCT ..., please remove DISTINCT keyword before parsing.
  • Use jps command to get the parser pid which name is jar, and then kill it.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •