This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
pg_lake integrates Apache Iceberg and data lake files (Parquet, CSV, JSON) into PostgreSQL, enabling PostgreSQL to function as a lakehouse system. The architecture consists of two main components:
- PostgreSQL with pg_lake extensions: Handles query planning, transaction boundaries, and orchestration
- pgduck_server: A separate multi-threaded process that implements the PostgreSQL wire protocol and delegates computation to DuckDB's columnar execution engine
Users connect only to PostgreSQL. The pg_lake extensions transparently delegate data scanning and computation to pgduck_server (running DuckDB) when appropriate, while maintaining full transactional guarantees.
# Install vcpkg dependencies (required once)
export VCPKG_VERSION=2025.10.17
git clone --recurse-submodules --branch $VCPKG_VERSION https://github.com/Microsoft/vcpkg.git
./vcpkg/bootstrap-vcpkg.sh
./vcpkg/vcpkg install azure-identity-cpp azure-storage-blobs-cpp azure-storage-files-datalake-cpp openssl
export VCPKG_TOOLCHAIN_PATH="$(pwd)/vcpkg/scripts/buildsystems/vcpkg.cmake"
# Build and install all extensions and pgduck_server
make install# Fast build that skips rebuilding DuckDB if already built
make install-fast# Build/install individual extensions
make install-pg_lake_iceberg
make install-pg_lake_table
make install-pgduck_server
# Build all extensions (top-level meta extension)
make install-pg_lake-- In postgresql.conf:
shared_preload_libraries = 'pg_extension_base'
-- Connect to PostgreSQL and create extensions:
CREATE EXTENSION pg_lake CASCADE;
-- This installs: pg_extension_base, pg_map, pg_extension_updater,
-- pg_lake_engine, pg_lake_iceberg, pg_lake_table, pg_lake_copy, pg_lake
-- Set default location for Iceberg tables:
SET pg_lake_iceberg.default_location_prefix TO 's3://your-bucket/pglake';# Start pgduck_server (must be running for pg_lake to work)
pgduck_server
# With options:
pgduck_server --memory_limit '8GB' --cache_dir /tmp/cache --init_file_path /path/to/init.sql
# pgduck_server listens on port 5332 (unix socket /tmp by default)
# You can connect directly to pgduck_server for debugging:
psql -p 5332 -h /tmpThe project uses pytest for all regression testing (not traditional PostgreSQL SQL regression tests).
# Install test dependencies (first time only)
pipenv install --dev
# Run all local tests
make check
# Run end-to-end tests (requires S3/cloud access)
make check-e2e
# Run upgrade tests
make check-upgrade
# Run all tests (local + e2e)
make check # includes check-local and check-e2e# Test specific extension
make check-pg_lake_table
make check-pg_lake_iceberg
make check-pgduck_server
# Run isolation tests
make check-isolation_pg_lake_table# Start pgduck_server with test configuration
pgduck_server --init_file_path pgduck_server/tests/test_secrets.sql --cache_dir /tmp/cache &
# Run installcheck (tests against installed extensions)
make installcheck
# Run installcheck for specific component
make installcheck-pg_lake_tablecd pg_lake_table
PYTHONPATH=../test_common pipenv run pytest -v tests/pytests/test_specific.py
PYTHONPATH=../test_common pipenv run pytest -v tests/pytests/test_specific.py::test_function_name# Run PostgreSQL's own tests with pg_lake extensions loaded
export PG_REGRESS_DIR=/path/to/postgres/src/test/regress
# Run tests
make installcheck-postgres PG_REGRESS_DIR=$PG_REGRESS_DIR
make installcheck-postgres-with_extensions_created PG_REGRESS_DIR=$PG_REGRESS_DIRpg_lake follows a modular design with interoperating components. Each extension has a specific responsibility:
pg_lake (meta-extension)
├── pg_lake_table (FDW for querying data lake files)
│ └── pg_lake_iceberg (Iceberg specification implementation)
│ └── pg_lake_engine (common module for pg_lake extensions)
│ ├── pg_extension_base (foundation for all extensions)
│ ├── pg_map (generic map type)
│ └── pg_extension_updater (automatic extension updates)
└── pg_lake_copy (COPY to/from data lake)
└── pg_lake_engine
pg_lake_spatial (optional, depends on PostGIS)
pg_lake_benchmark (optional, for benchmarking)
- pg_extension_base: Foundation for all extensions, provides common utilities
- pg_extension_updater: Automatically updates extensions on startup
- pg_map: Generic map/key-value type generator for semi-structured data
- pg_lake_engine: Common module shared by data lake extensions (depends on Apache Avro)
- pg_lake_iceberg: Full Iceberg v2 protocol implementation with transactional support
- pg_lake_table: Foreign data wrapper to query Parquet/CSV/JSON/Iceberg files
- pg_lake_copy: COPY command extensions for importing/exporting to data lakes
- pg_lake: Meta-extension that installs all required extensions via CASCADE
- pgduck_server: Standalone server implementing PostgreSQL wire protocol, executes queries via DuckDB
- duckdb_pglake: DuckDB extension adding PostgreSQL-compatible functions to DuckDB
- avro: Apache Avro library (patched) for Iceberg metadata handling
Makefile: Top-level build orchestration for all componentsshared.mk: Shared Makefile rules for extensions- Each extension has its own
Makefilefollowing PGXS conventions
<extension>/tests/pytests/: Main pytest test suites<extension>/tests/e2e/: End-to-end tests requiring cloud storage<extension>/tests/isolation/: Isolation tester tests for concurrencytest_common/: Shared test utilities and fixturespytest.ini: Root pytest configuration
docs/building-from-source.md: Detailed build instructionsdocs/iceberg-tables.md: Iceberg table usagedocs/query-data-lake-files.md: Foreign table usagedocs/data-lake-import-export.md: COPY command usage
The repository provides both automated installation (install.sh) and comprehensive manual documentation (docs/building-from-source.md). These must be kept in sync when dependencies or build steps change.
- Location:
install.shat repository root - Purpose: Automated installation script for developers to quickly set up pg_lake environments
- Capabilities:
- Installs to existing PostgreSQL (default) or builds PostgreSQL from source (--build-postgres)
- Installs vcpkg and Azure SDK dependencies
- Builds and installs pg_lake extensions
- Optionally installs test dependencies: PostGIS, pgAudit, pg_cron, azurite, pipenv, Java 21+, JDBC driver
- Initializes PostgreSQL database cluster
- Platform support: RHEL/AlmaLinux, Debian/Ubuntu, macOS
- Location:
docs/building-from-source.md - Purpose: Comprehensive manual installation guide with both automated and manual approaches
- Structure:
- Quick Start - points to install.sh for common cases
- Manual installation to existing PostgreSQL - minimal steps for users with PostgreSQL already installed
- Full development environment setup - detailed manual steps for building everything from source
- Test dependencies - optional components for running test suite
- Running tests - how to execute pytest suites
System dependencies change:
- Update package lists in both
install.sh(install_system_deps function) anddocs/building-from-source.md(System Dependencies section) - Ensure all three platforms (Debian, RHEL, macOS) are updated consistently
PostgreSQL build process changes:
- Update
install.sh(install_postgres function) anddocs/building-from-source.md(Build PostgreSQL from Source section) - Keep configure flags, contrib modules, and test modules in sync
vcpkg or Azure SDK versions change:
- Update VCPKG_VERSION variable in
install.sh,docs/building-from-source.md, and this CLAUDE.md file - Update vcpkg package names if they change
New test dependencies:
- Add to
install.sh(install_test_deps function) - Add to
docs/building-from-source.md(Test Dependencies section) - Update the installation checks to be idempotent (skip if already installed)
pg_lake build process changes:
- Usually handled by Makefile, but if manual steps needed, document in
docs/building-from-source.md
Before committing changes to install.sh:
- Test on a clean environment if possible
- Test with both
--build-postgres(full dev setup) and without (existing PostgreSQL) - Test
--with-test-depsflag to ensure all test dependencies install correctly - Verify idempotency - running the script twice should not fail or duplicate work
- Check that the summary output shows correct environment variables and next steps
- Keep install.sh focused on automation - don't add extensive comments explaining "why", put that in the docs
- Keep docs/building-from-source.md comprehensive - explain the "why" behind each step
- When adding new flags to install.sh, document them in the "Development Environment Options" section of the docs
- System dependency lists should match exactly across install.sh and docs for each platform
- Test the manual instructions in docs/building-from-source.md periodically to ensure they still work
- Indentation: Use
pgindentbefore commits (seemake reindent) - Naming:
- Variables/functions:
lower_case_with_underscores - Macros/constants:
UPPER_CASE_WITH_UNDERSCORES - Structs/enums:
CamelCase - Global variables: Prefix with module identifier (e.g.,
IcebergTableCache)
- Variables/functions:
- Comments: Focus on "why" not "what"; use block comments for complex algorithms
- Typedefs: Download from buildfarm via
make typedefsfor pgindent
- Formatting: Use
black(seemake reindentorpipenv run black) - Style: Follow pytest conventions for test naming and fixtures
# Format all code
make reindent
# Check formatting
make check-indentFor testing without cloud S3 latency, use MinIO locally:
# Install and start MinIO
brew install minio # or download from min.io
minio server /tmp/data
# Access UI at http://localhost:9000
# Create access key: testkey / testpassword
# Create bucket: localbucket
# Add to ~/.aws/config:
[services testing-minio]
s3 =
endpoint_url = http://localhost:9000
[profile minio]
region = us-east-1
services = testing-minio
aws_access_key_id = testkey
aws_secret_access_key = testpassword
# Configure pgduck_server (connect to port 5332)
psql -p 5332 -h /tmp -c "
CREATE SECRET s3testMinio (
TYPE S3,
KEY_ID 'testkey',
SECRET 'testpassword',
ENDPOINT 'localhost:9000',
SCOPE 's3://localbucket',
URL_STYLE 'path',
USE_SSL false
);"
# Use in PostgreSQL
psql -c "SET pg_lake_iceberg.default_location_prefix TO 's3://localbucket'"- Read existing code in
<extension>/src/to understand patterns - Modify C source files in
src/ - If adding SQL functions, update
<extension>--<version>.sql - Add pytest tests in
tests/pytests/ - Build and test:
make install-<extension> make check-<extension>
# Bump all extension versions at once
python tools/bump_extension_versions.py 3.1
# This creates upgrade stubs like: pg_lake_engine--3.0--3.1.sql# Connect to PostgreSQL and check pg_lake query plans
psql -c "EXPLAIN (VERBOSE) SELECT * FROM iceberg_table"
# Connect directly to pgduck_server to see DuckDB execution
psql -p 5332 -h /tmp
postgres=> SELECT * FROM duckdb_settings();
postgres=> EXPLAIN SELECT ...;-- Show all Iceberg tables and their metadata locations
SELECT table_name, metadata_location FROM iceberg_tables;
-- View Iceberg snapshots
SELECT * FROM iceberg_snapshots('<table_name>');-- Required: preload base extension in postgresql.conf
shared_preload_libraries = 'pg_extension_base'
-- Iceberg settings
pg_lake_iceberg.default_location_prefix = 's3://bucket/prefix'- JDBC driver path must be set for Spark verification tests:
JDBC_DRIVER_PATH=~/pg_lake-deps/jdbc/postgresql-42.7.10.jar(automatically installed by./install.sh --with-test-deps) - Java 21+ required for Polaris REST catalog tests (automatically installed by
./install.sh --with-test-deps) - Azure tests require
azurite(install via npm:npm install -g azurite, or use./install.sh --with-test-deps)