CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

Project Overview

pg_lake integrates Apache Iceberg and data lake files (Parquet, CSV, JSON) into PostgreSQL, enabling PostgreSQL to function as a lakehouse system. The architecture consists of two main components:

PostgreSQL with pg_lake extensions: Handles query planning, transaction boundaries, and orchestration
pgduck_server: A separate multi-threaded process that implements the PostgreSQL wire protocol and delegates computation to DuckDB's columnar execution engine

Users connect only to PostgreSQL. The pg_lake extensions transparently delegate data scanning and computation to pgduck_server (running DuckDB) when appropriate, while maintaining full transactional guarantees.

Build Commands

First-time build

# Install vcpkg dependencies (required once)
export VCPKG_VERSION=2025.10.17
git clone --recurse-submodules --branch $VCPKG_VERSION https://github.com/Microsoft/vcpkg.git
./vcpkg/bootstrap-vcpkg.sh
./vcpkg/vcpkg install azure-identity-cpp azure-storage-blobs-cpp azure-storage-files-datalake-cpp openssl
export VCPKG_TOOLCHAIN_PATH="$(pwd)/vcpkg/scripts/buildsystems/vcpkg.cmake"

# Build and install all extensions and pgduck_server
make install

Subsequent builds

# Fast build that skips rebuilding DuckDB if already built
make install-fast

Component-specific builds

# Build/install individual extensions
make install-pg_lake_iceberg
make install-pg_lake_table
make install-pgduck_server

# Build all extensions (top-level meta extension)
make install-pg_lake

Running pg_lake

Required setup

-- In postgresql.conf:
shared_preload_libraries = 'pg_extension_base'

-- Connect to PostgreSQL and create extensions:
CREATE EXTENSION pg_lake CASCADE;
-- This installs: pg_extension_base, pg_map, pg_extension_updater,
-- pg_lake_engine, pg_lake_iceberg, pg_lake_table, pg_lake_copy, pg_lake

-- Set default location for Iceberg tables:
SET pg_lake_iceberg.default_location_prefix TO 's3://your-bucket/pglake';

Starting pgduck_server

# Start pgduck_server (must be running for pg_lake to work)
pgduck_server

# With options:
pgduck_server --memory_limit '8GB' --cache_dir /tmp/cache --init_file_path /path/to/init.sql

# pgduck_server listens on port 5332 (unix socket /tmp by default)
# You can connect directly to pgduck_server for debugging:
psql -p 5332 -h /tmp

Testing

The project uses pytest for all regression testing (not traditional PostgreSQL SQL regression tests).

Running all tests

# Install test dependencies (first time only)
pipenv install --dev

# Run all local tests
make check

# Run end-to-end tests (requires S3/cloud access)
make check-e2e

# Run upgrade tests
make check-upgrade

# Run all tests (local + e2e)
make check  # includes check-local and check-e2e

Running tests for specific components

# Test specific extension
make check-pg_lake_table
make check-pg_lake_iceberg
make check-pgduck_server

# Run isolation tests
make check-isolation_pg_lake_table

Running installcheck

# Start pgduck_server with test configuration
pgduck_server --init_file_path pgduck_server/tests/test_secrets.sql --cache_dir /tmp/cache &

# Run installcheck (tests against installed extensions)
make installcheck

# Run installcheck for specific component
make installcheck-pg_lake_table

Running individual pytest tests

cd pg_lake_table
PYTHONPATH=../test_common pipenv run pytest -v tests/pytests/test_specific.py
PYTHONPATH=../test_common pipenv run pytest -v tests/pytests/test_specific.py::test_function_name

Testing with PostgreSQL regression suite

# Run PostgreSQL's own tests with pg_lake extensions loaded
export PG_REGRESS_DIR=/path/to/postgres/src/test/regress

# Run tests
make installcheck-postgres PG_REGRESS_DIR=$PG_REGRESS_DIR
make installcheck-postgres-with_extensions_created PG_REGRESS_DIR=$PG_REGRESS_DIR

Extension Architecture

pg_lake follows a modular design with interoperating components. Each extension has a specific responsibility:

Extension dependency chain

pg_lake (meta-extension)
├── pg_lake_table (FDW for querying data lake files)
│   └── pg_lake_iceberg (Iceberg specification implementation)
│       └── pg_lake_engine (common module for pg_lake extensions)
│           ├── pg_extension_base (foundation for all extensions)
│           ├── pg_map (generic map type)
│           └── pg_extension_updater (automatic extension updates)
└── pg_lake_copy (COPY to/from data lake)
    └── pg_lake_engine

pg_lake_spatial (optional, depends on PostGIS)
pg_lake_benchmark (optional, for benchmarking)

Core extensions

pg_extension_base: Foundation for all extensions, provides common utilities
pg_extension_updater: Automatically updates extensions on startup
pg_map: Generic map/key-value type generator for semi-structured data
pg_lake_engine: Common module shared by data lake extensions (depends on Apache Avro)
pg_lake_iceberg: Full Iceberg v2 protocol implementation with transactional support
pg_lake_table: Foreign data wrapper to query Parquet/CSV/JSON/Iceberg files
pg_lake_copy: COPY command extensions for importing/exporting to data lakes
pg_lake: Meta-extension that installs all required extensions via CASCADE

External components

pgduck_server: Standalone server implementing PostgreSQL wire protocol, executes queries via DuckDB
duckdb_pglake: DuckDB extension adding PostgreSQL-compatible functions to DuckDB
avro: Apache Avro library (patched) for Iceberg metadata handling

Important File Locations

Build system

Makefile: Top-level build orchestration for all components
shared.mk: Shared Makefile rules for extensions
Each extension has its own Makefile following PGXS conventions

Tests

<extension>/tests/pytests/: Main pytest test suites
<extension>/tests/e2e/: End-to-end tests requiring cloud storage
<extension>/tests/isolation/: Isolation tester tests for concurrency
test_common/: Shared test utilities and fixtures
pytest.ini: Root pytest configuration

Documentation

docs/building-from-source.md: Detailed build instructions
docs/iceberg-tables.md: Iceberg table usage
docs/query-data-lake-files.md: Foreign table usage
docs/data-lake-import-export.md: COPY command usage

Maintaining Installation Documentation

Overview

The repository provides both automated installation (install.sh) and comprehensive manual documentation (docs/building-from-source.md). These must be kept in sync when dependencies or build steps change.

install.sh

Location: install.sh at repository root
Purpose: Automated installation script for developers to quickly set up pg_lake environments
Capabilities:
- Installs to existing PostgreSQL (default) or builds PostgreSQL from source (--build-postgres)
- Installs vcpkg and Azure SDK dependencies
- Builds and installs pg_lake extensions
- Optionally installs test dependencies: PostGIS, pgAudit, pg_cron, azurite, pipenv, Java 21+, JDBC driver
- Initializes PostgreSQL database cluster
Platform support: RHEL/AlmaLinux, Debian/Ubuntu, macOS

docs/building-from-source.md

Location: docs/building-from-source.md
Purpose: Comprehensive manual installation guide with both automated and manual approaches
Structure:
1. Quick Start - points to install.sh for common cases
2. Manual installation to existing PostgreSQL - minimal steps for users with PostgreSQL already installed
3. Full development environment setup - detailed manual steps for building everything from source
4. Test dependencies - optional components for running test suite
5. Running tests - how to execute pytest suites

When to Update Both Files

System dependencies change:

Update package lists in both install.sh (install_system_deps function) and docs/building-from-source.md (System Dependencies section)
Ensure all three platforms (Debian, RHEL, macOS) are updated consistently

PostgreSQL build process changes:

Update install.sh (install_postgres function) and docs/building-from-source.md (Build PostgreSQL from Source section)
Keep configure flags, contrib modules, and test modules in sync

vcpkg or Azure SDK versions change:

Update VCPKG_VERSION variable in install.sh, docs/building-from-source.md, and this CLAUDE.md file
Update vcpkg package names if they change

New test dependencies:

Add to install.sh (install_test_deps function)
Add to docs/building-from-source.md (Test Dependencies section)
Update the installation checks to be idempotent (skip if already installed)

pg_lake build process changes:

Usually handled by Makefile, but if manual steps needed, document in docs/building-from-source.md

Testing install.sh Changes

Before committing changes to install.sh:

Test on a clean environment if possible
Test with both --build-postgres (full dev setup) and without (existing PostgreSQL)
Test --with-test-deps flag to ensure all test dependencies install correctly
Verify idempotency - running the script twice should not fail or duplicate work
Check that the summary output shows correct environment variables and next steps

Documentation Best Practices

Keep install.sh focused on automation - don't add extensive comments explaining "why", put that in the docs
Keep docs/building-from-source.md comprehensive - explain the "why" behind each step
When adding new flags to install.sh, document them in the "Development Environment Options" section of the docs
System dependency lists should match exactly across install.sh and docs for each platform
Test the manual instructions in docs/building-from-source.md periodically to ensure they still work

Code Conventions

C code (PostgreSQL extensions and pgduck_server)

Indentation: Use pgindent before commits (see make reindent)
Naming:
- Variables/functions: lower_case_with_underscores
- Macros/constants: UPPER_CASE_WITH_UNDERSCORES
- Structs/enums: CamelCase
- Global variables: Prefix with module identifier (e.g., IcebergTableCache)
Comments: Focus on "why" not "what"; use block comments for complex algorithms
Typedefs: Download from buildfarm via make typedefs for pgindent

Python code

Formatting: Use black (see make reindent or pipenv run black)
Style: Follow pytest conventions for test naming and fixtures

Before committing

# Format all code
make reindent

# Check formatting
make check-indent

Local Development with MinIO

For testing without cloud S3 latency, use MinIO locally:

# Install and start MinIO
brew install minio  # or download from min.io
minio server /tmp/data

# Access UI at http://localhost:9000
# Create access key: testkey / testpassword
# Create bucket: localbucket

# Add to ~/.aws/config:
[services testing-minio]
s3 =
   endpoint_url = http://localhost:9000

[profile minio]
region = us-east-1
services = testing-minio
aws_access_key_id = testkey
aws_secret_access_key = testpassword

# Configure pgduck_server (connect to port 5332)
psql -p 5332 -h /tmp -c "
CREATE SECRET s3testMinio (
    TYPE S3,
    KEY_ID 'testkey',
    SECRET 'testpassword',
    ENDPOINT 'localhost:9000',
    SCOPE 's3://localbucket',
    URL_STYLE 'path',
    USE_SSL false
);"

# Use in PostgreSQL
psql -c "SET pg_lake_iceberg.default_location_prefix TO 's3://localbucket'"

Common Development Workflows

Adding new functionality to an extension

Read existing code in <extension>/src/ to understand patterns
Modify C source files in src/
If adding SQL functions, update <extension>--<version>.sql
Add pytest tests in tests/pytests/

Build and test:

make install-<extension>
make check-<extension>

Creating a version upgrade script

# Bump all extension versions at once
python tools/bump_extension_versions.py 3.1

# This creates upgrade stubs like: pg_lake_engine--3.0--3.1.sql

Debugging query execution

# Connect to PostgreSQL and check pg_lake query plans
psql -c "EXPLAIN (VERBOSE) SELECT * FROM iceberg_table"

# Connect directly to pgduck_server to see DuckDB execution
psql -p 5332 -h /tmp
postgres=> SELECT * FROM duckdb_settings();
postgres=> EXPLAIN SELECT ...;

Viewing Iceberg metadata

-- Show all Iceberg tables and their metadata locations
SELECT table_name, metadata_location FROM iceberg_tables;

-- View Iceberg snapshots
SELECT * FROM iceberg_snapshots('<table_name>');

Key PostgreSQL GUCs (Settings)

-- Required: preload base extension in postgresql.conf
shared_preload_libraries = 'pg_extension_base'

-- Iceberg settings
pg_lake_iceberg.default_location_prefix = 's3://bucket/prefix'

CI and Testing Notes

JDBC driver path must be set for Spark verification tests: JDBC_DRIVER_PATH=~/pg_lake-deps/jdbc/postgresql-42.7.10.jar (automatically installed by ./install.sh --with-test-deps)
Java 21+ required for Polaris REST catalog tests (automatically installed by ./install.sh --with-test-deps)
Azure tests require azurite (install via npm: npm install -g azurite, or use ./install.sh --with-test-deps)

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

CLAUDE.md

Project Overview

Build Commands

First-time build

Subsequent builds

Component-specific builds

Running pg_lake

Required setup

Starting pgduck_server

Testing

Running all tests

Running tests for specific components

Running installcheck

Running individual pytest tests

Testing with PostgreSQL regression suite

Extension Architecture

Extension dependency chain

Core extensions

External components

Important File Locations

Build system

Tests

Documentation

Maintaining Installation Documentation

Overview

install.sh

docs/building-from-source.md

When to Update Both Files

Testing install.sh Changes

Documentation Best Practices

Code Conventions

C code (PostgreSQL extensions and pgduck_server)

Python code

Before committing

Local Development with MinIO

Common Development Workflows

Adding new functionality to an extension

Creating a version upgrade script

Debugging query execution

Viewing Iceberg metadata

Key PostgreSQL GUCs (Settings)

CI and Testing Notes