This small project demonstrates how to stream data from Pub/Sub into DuckDB using Ducklake.
From the root directory of the project:
docker compose up -dTo stop the containers:
docker compose downThe data is persisted in the data directory. After shutdown, the subscriber flushes the data to Parquet files. To read the data using DuckDB:
INSTALL ducklake;
ATTACH 'ducklake:data/duckdb_streaming.ducklake' AS db;
SELECT * FROM db.transactions;Note that you can only query the data after the subscriber container has been stopped, because there is a concurrency lock on the DuckDB database while the subscriber is running.
This project uses the Google Cloud Pub/Sub emulator as a message broker. It has not been tested with the real Google Cloud Pub/Sub service.
The publisher first creates a topic, and will continually receive messages from the Blockhain websocket (unconfirmed transactions) and publish them to the topic.
The subscriber first creates a subscription to the topic, and will continually receive messages from the topic and write them to the Ducklake table. Data inlining is enabled, in order to ensure that DuckDB doesn't output a single Parquet file per record. Once the process is stopped (it receives a SIGTERM signal), the subscriber flushes the data to Parquet files.