Add lakehouse options for create table statements #255

Tmonster · 2026-01-07T20:50:10Z

Fixes https://github.com/duckdblabs/duckdb-internal/issues/6589
Fixes https://github.com/duckdblabs/duckdb-internal/issues/6590

Feature

This PR is to introduce lakehouse syntax to the DuckDB create table statement. Specifically for Iceberg, it would be nice to include the ability to set table properties, data location, partitioned by and sorted by in the create table statement.

Engines do this in different ways, but I find the SPARK/Athena way cleanest for the following reasons.

Location, Partitioned by, and sorted by affect how & where data should be stored. These properties are also used when other engines optimize reading table data. Partitioned by and sorted by create partitioning/sort schemas in the metadata.json. These are preserved as a table updates and changes. The location is not preserved, but it is an integral part of defining an iceberg table for certain catalogs.
Table properties are a key-value list of any properties a user wants to add, and they are not always respected by every engine.

Another option was to support a WITH () clause an shove all of these new properties in there. I feel like this will lead to confusion as iceberg & ducklake grow and change, and we will eventually move certain key/value pairs out to be more prominent in the create table statement.

If any of these table properties are declared when creating a DuckDB table, an error is thrown.

Examples in other engines
Athena

CREATE TABLE
  [db_name.]table_name (col_name data_type [COMMENT col_comment] [, ...] )
  [PARTITIONED BY (col_name | transform, ... )]
  [SORTED BY (col_name | transform, ... )] -- not in the example but is technically possible
  LOCATION 's3://amzn-s3-demo-bucket/your-folder/'
  TBLPROPERTIES ( 'table_type' ='ICEBERG' [, property_name=property_value] )

Spark

CREATE TABLE prod.db.sample (
    id bigint,
    data string,
    category string,
    ts timestamp)
USING iceberg
PARTITIONED BY (bucket(16, id), days(ts), category);

Flink

CREATE TABLE `hive_catalog`.`default`.`sample` (
    id BIGINT COMMENT 'unique id',
    data STRING NOT NULL
) 
PARTITIONED BY (data) 
WITH ('format-version'='2'); -- table properties

Trino

CREATE TABLE example_table (
    c1 INTEGER,
    c2 DATE,
    c3 DOUBLE
)
WITH ( 
    format = 'PARQUET', -- this is a table property
    partitioning = ARRAY['c1', 'c2'], -- partitioning
    sorted_by = ARRAY['c3'], -- sorting
    location = 's3://my-bucket/a/path/' -- location
);

Snowflake

CREATE [ OR REPLACE ] ICEBERG TABLE [ IF NOT EXISTS ] <table_name>
  [ EXTERNAL_VOLUME = '<external_volume_name>' ]
  [ CATALOG = '<catalog_integration_name>' ]
  CATALOG_TABLE_NAME = '<rest_catalog_table_name>'
  [ CATALOG_NAMESPACE = '<catalog_namespace>' ]
  [ PARTITION BY ( partitionExpression [ , partitionExpression , ... ] ) ]
...
  [ STORAGE_SERIALIZATION_POLICY = { COMPATIBLE | OPTIMIZED } ]

TODO:

I have a PR for Alter Table as well, I did not want to add too much all at once.

Eventually I want to add this to the Create table as, but I haven't quite found a sql statement that flows when doing this. Some examples

option 1 AS SELECT comes after table name (this means you need to know you columns before writing as select)

CREATE TABLE my_datalake.default.tbl1
PARTITIONED BY (part)
SORTED BY (id)
TBLPROPERTIES ('key'='value')
AS SELECT ...

option 2 PARTITIONED BY comes after the select clause, which will most likely lead to confusion regarding when does the select statement end, and when do the iceberg definitions start?

CREATE TABLE my_datalake.default.tbl1
AS SELECT ...
PARTITIONED BY (part)
SORTED BY (id)
TBLPROPERTIES ('key'='value')

option 3 add a SET to the create table & create table as statements.
For create table

CREATE TABLE my_datalake.default.tbl1
SET 
PARTITIONED BY (part)
SORTED BY (id)
LOCATION 's3://path1/path2'
TBLPROPERTIES ('key'='value')

For Create table As

CREATE TABLE my_datalake.default.tbl1
AS SELECT ...
SET 
PARTITIONED BY (part)
SORTED BY (id)
LOCATION 's3://path1/path2'
TBLPROPERTIES ('key'='value')

This may be the nicest as it follows cleanly from the alter table statement. It also keeps lakehouse options more recognizable. However, it will deviate from most other syntaxes

…atements

- Run all test (including `test_slow`) but skip those that take too long. - Renamed tests from `test` to `test_slow` since they were taking too long for `test`. - Removed some leftover code to enable verification for all tests - we have a config for that now.

* Update ICU to the 2025c time zone data

Fixes: duckdblabs/duckdb-internal#6851 Previously we look through every character query (after tokenizing) to look for the semicolon. This caused us to incorrectly split if a semicolon was inside a comment `SELECT 1; -- ; `. Resulting in this case in 2 queries rather than 1. For this query: ```sql create or replace result my_result as from ( select 1 where starts_with('name', 'test_simple_share') -- ; ); ``` Re the tokenization, I don't think the problem lies in the fact that `) -- ;` is returned as a single token (though comments should maybe become something separate), but rather that we were not looking for the `;` token properly in the splitting logic, rather going character by character. Also this logic is already fixed on main, but slightly differently. Perhaps I should reuse that.

…ges to the search path have no effect on it

Fixes duckdb#20050

…his catalog in the candidates always

* Add the RHS bindings when we are doing SEMI or ANTI ASOF joins with a predicate.

* Disallow using arbitrary predicates in AsOf with RIGHT/FULL/SEMI joins.

* Convert the semi-join to an inner join and import the count directly

* Disallow using arbitrary predicates in AsOf with ANTI joins.

* Convert the semi-join to an inner join and import the count directly

* Disallow using arbitrary predicates in AsOf with RIGHT/FULL/SEMI joins.

* Remove the predicate test and relocation (join predicate push-down will take care of it) * Update test plans and add correctness tests for new cases.

* Enforce ordering in test

Follow-up to duckdb#20348. Related issue: duckdblabs/duckdb-internal#7002

* Remove the predicate test and relocation (join predicate push-down will take care of it) * Update test plans and add correctness tests for new cases.

Bumped while building duckdb-wasm, I would expect other clients or packagers of duckdb might also hit this, and fix is simple.

… in the with clause

… and sorted by are used. Remove lakehouse working

taniabogatsch and others added 30 commits December 11, 2025 17:55

renaming of slow tests, new config for non-standard vector size

331d36c

exclude test for now

956284d

clean more space everywhere, don't test with tpch and tpcds

7fa4da1

Simplify splitting logic, add test case

b664733

Add test case

3398b79

Remove prints

b1189d5

Adjust test

3e7177e

exclude two tests

9f583fe

Merge branch 'v1.4-andium' into ci-nightly-sanitor

f95800c

Format fix

9f0af3c

Use logic of main

39d0283

Merge branch 'v1.4-andium' into ci-nightly-sanitor

59fadce

update comment

a0fa26f

Merge remote-tracking branch 'upstream/v1.4-andium' into splitting-st…

189f63e

…atements

Bump iceberg, and add wasm platforms!

4748304

bump Julia to v1.4.3

9ec60b7

Bump Julia to v1.4.3 (duckdb#20248)

1c87e80

Internal duckdb#6881: 2025c Time Zones

ad654f9

* Update ICU to the 2025c time zone data

Internal duckdb#6881: 2025c Time Zones (duckdb#20258)

6212779

* Update ICU to the 2025c time zone data

adjust the references inside the saved query to be qualified, so chan…

246ea4a

…ges to the search path have no effect on it

add all the schemas of the catalog of the view to the search path

477cbbc

undo changes

e1bdc8f

undo changes

81b5d9d

add ListSchemas

486419c

Reuse correct table-level metadata during checkpoints

0e45e69

test(adbc): Empty statement case that segfaults

b233d1c

fix(adbc): return error when setting an empty sql query

491eb95

Fixes duckdb#20050

adjust the search path instead, add INVALID_SCHEMA to mean: include t…

f2b8545

…his catalog in the candidates always

hawkfish and others added 19 commits January 8, 2026 08:51

Issue duckdb#20413: ASOF SEMI/ANTI Bindings

e6ac703

* Add the RHS bindings when we are doing SEMI or ANTI ASOF joins with a predicate.

Issue duckdb#20413: ASOF Arbitrary Predicates

5ecf5d0

* Disallow using arbitrary predicates in AsOf with RIGHT/FULL/SEMI joins.

Internal duckdb#6975: Window Inner Self-Join

4ed12c7

* Convert the semi-join to an inner join and import the count directly

Issue duckdb#20413: ASOF Arbitrary Predicates

81ed27d

* Disallow using arbitrary predicates in AsOf with ANTI joins.

Internal duckdb#6975: Window Inner Self-Join (duckdb#20459)

ba0a7e6

* Convert the semi-join to an inner join and import the count directly

expose safe string assign function

3702353

Issue duckdb#20413: ASOF Arbitrary Predicates (duckdb#20456)

13e1f22

* Disallow using arbitrary predicates in AsOf with RIGHT/FULL/SEMI joins.

Internal duckdb#6976: Window Self-Join Predicate

223544c

* Remove the predicate test and relocation (join predicate push-down will take care of it) * Update test plans and add correctness tests for new cases.

Internal duckdb#6976: Window Self-Join Predicate

c7776c0

* Enforce ordering in test

[C API] Expose safe string assign function (duckdb#20467)

f4a8fa8

Follow-up to duckdb#20348. Related issue: duckdblabs/duckdb-internal#7002

Internal duckdb#6976: Window Self-Join Predicate (duckdb#20473)

740d25a

* Remove the predicate test and relocation (join predicate push-down will take care of it) * Update test plans and add correctness tests for new cases.

CMake: export also duckdb_generated_extension_loader (duckdb#20449)

d18dfbf

Bumped while building duckdb-wasm, I would expect other clients or packagers of duckdb might also hit this, and fix is simple.

add partition by support for create statements

33ea9a6

also support SORTED BY in create table. add tests

de0ffd6

working with fixed order

e6f914d

added reorderability as well

d69666b

refactor for cleaner code better function names etc.

c02885d

generate serialization

8424f33

fix serialization

20582f3

Tmonster force-pushed the add_lakehouse_options_for_create_table_statements branch from e046f3e to 4d2da25 Compare January 12, 2026 14:20

Tmonster added 2 commits January 12, 2026 14:30

partitioned by and sorted by are in the grammar, the rest is declared…

efdc777

… in the with clause

partition by and sorted by can be mixed

5b3a60c

Tmonster force-pushed the add_lakehouse_options_for_create_table_statements branch from 4d2da25 to 5b3a60c Compare January 12, 2026 14:30

Tmonster added 7 commits January 13, 2026 05:31

move reloption definition

97979ad

removee lakehouse working. use options instead of tblproperties

ea71b5c

fix serialization

bf5adb4

implement virtual function so other catalogs fail when partitioned by…

8308e71

… and sorted by are used. Remove lakehouse working

implement virtual function so other catalogs fail when partitioned by…

2155840

… and sorted by are used. Remove lakehouse working

functions now work for table options

8baaa9f

fix create index to only parse constant expressions

2a375fc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add lakehouse options for create table statements #255

Add lakehouse options for create table statements #255

Uh oh!

Tmonster commented Jan 7, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Add lakehouse options for create table statements #255

Are you sure you want to change the base?

Add lakehouse options for create table statements #255

Uh oh!

Conversation

Tmonster commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Feature

TODO:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

18 participants

Tmonster commented Jan 7, 2026 •

edited

Loading