- XDBC is a holistic, high-performance framework for fast and scalable data transfers across heterogeneous data systems (e.g. DBMS to dataframes) aiming to combine the generality of generic solutions with performance of specialized connectors
- It decomposes data transfer into a configurable pipeline (read -> deserialize -> compress -> send/receive -> decompress -> serialize -> write) with pipeline-parallel execution and ring-buffer memory manager for low resource overhead.
- The core of the framework (xdbc-client and xdbc-server) are written in C++ with bindings available for Python and Spark. It includes built-in adapters to connect to PostgreSQL, CSV, Parquet and Pandas.
- The project includes a lightweight heuristic optimizer implemented in Python that automatically tunes the parallelism, buffer sizes, intermediate formats and compression algorithms to the current environment.
XDBC consists of multiple repositories covering the cross-system functionality. For the reproducibility experiments the following repositories will be cloned and used :
xdbc-clientClient-side module, for loading data into the target system.xdbc-serverServer-side module, for extracting the data from the source system.xdbc-pythonPython bindings for loading data into Pandas (through pybind).xdbc-sparkSpark bindings, for loading data into a Spark RDD (through a custom DataSource with JNI).pg_xdbc_fdwPostgreSQL Foreign Data Wrapper, for loading data into a table.
The experiments were validated on Ubuntu 22.04 LTS, and should work on Ubuntu 22.04 or later (and generally on recent Linux distributions).On Ubuntu, the following packages are required:
bashdocker-ce(tested with version 28.3.3)docker-compose-plugin(tested with version v2.39.1)gitmake
You can install them with:
sudo apt-get update
sudo apt-get install -y bash docker-ce docker-compose-plugin git makeFor better interoperability and a streamlined setup, the experiments load all datasets into memory; therefore, the /dev/shm region should have at least 40 GB of available space.
To reproduce our experiments, one needs to execute the run_experiments.sh script located in the root project directory. The script contains 7 steps, of which each can be manually disabled between lines 20-26.
- The script clones all necessary repositories.
- The XDBC infrastructure is set up via docker.
- The datasets are downloaded, extracted and copied into memory.
- The container for handling the experiments is set up.
- The data for the transfer experiments is prepared.
- The actual experiments will be executed. If specific experiments should be re-run, comment-out the specific commands between lines 135-154.
- The plots are created and copied in a pdf file to xdbc-sigmod-ari/experiment_files/res/pdf_plots/ .
- The specific commands executed to run the experiments are located in the Makefile.
- The specific code the execute the experiments is located in ./experiment_files .
- The code to create the plots is located in ./experiment_files/e2e_plotter.py .
- The code for the heuristic optimizer used in the experiments is located under ./optimizer
We executed our data transfer experiments on compute nodes with the following specifications:
| Component | Details |
|---|---|
| CPU | Dual-socket Intel Xeon Silver 4216 @ 2.10 GHz with 16 cores and 32 threads each |
| Memory | 512 GB of main memory and a 2 TB SSD disk drive |
| OS | Ubuntu v20.04.2 LTS |