Struct extract cast pushdown #149

Tishj · 2025-12-16T10:22:33Z

No description provided.

…witch

Doing a try_cast on a variant value containing a struct that had no matching members with the cast target struct was failing with `Binder Error: STRUCT to STRUCT cast must have at least one matching member`. This adds a test for that case and handles it correctly by setting it to null.

Hi DuckDB Team, When using the profiler for queries that have the same table name in different schemas, it was impossible to determine which node was scanning which table. This PR adds the catalog and schema name to the `TableScanToString` method so that the particular table in a catalog and schema can be identified. Thanks, Rusty

Hi DuckDB Team, When reviewing `EXPLAIN ANALYZE` output, operations backed by `MultiFileReader` are often reported like this: ```json { "system_peak_temp_dir_size": 0, "system_peak_buffer_memory": 0, "result_set_size": 480000, "operator_cardinality": 60000, "operator_rows_scanned": 469392, "cpu_time": 0.135243666, "operator_name": "READ_CSV_AUTO ", "extra_info": { "Function": "READ_CSV_AUTO", "Projections": "range", "Estimated Cardinality": "58674", "Total Files Read": "6" }, "operator_type": "TABLE_SCAN", "cumulative_rows_scanned": 469392, "operator_timing": 0.135243666, "cumulative_cardinality": 60000, "children": [] } ``` This PR extends `extra_info` with a new key, `"Filename(s)"`, which reports the first five filenames provided by the user. This makes it much easier to distinguish between multiple calls to `read_csv_auto` (or other functions built on the `MultiFileReader` infrastructure) within the same query, since the specific files being read are now visible. A few notes: 1. I intentionally avoided logging the fully expanded filenames; this might be worth reconsidering. 2. The limit of five filenames is arbitrary and mainly intended to keep the output concise. 3. It would be nice if `extra_info` could accept a full `duckdb::Value` instead of only strings—this would allow representing filenames as a `LIST()` of strings, which would be more natural in the JSON output. Thank you! Rusty

After this we only need support for compressed execution (e.g. constant/flat/dict-vectors) in scalar functions to start moving over some more complex extensions to the C-API. (e.g half of spatial).

* Add missing dictionary indirection

… (performance) improvements to `.mode duckbox` (duckdb#20223) This PR adds the `.last` command to the CLI that can be used to re-render the last query result without any limits. This is useful in particular when using the default (duckbox) rendering - since that truncates rows and values in order to quickly show a summary of the result. If there is information missing that you want to see, you can now quickly type `.last` to re-render the result in a pager without these limits. For example: ```sql memory D from lineitem.parquet; ┌────────────┬───────────┬───────────┬──────────────┬───┬───────────────┬───────────────────┬────────────┬───────────────────────────┐ │ l_orderkey │ l_partkey │ l_suppkey │ l_linenumber │ … │ l_receiptdate │ l_shipinstruct │ l_shipmode │ l_comment │ │ int64 │ int64 │ int64 │ int64 │ … │ date │ varchar │ varchar │ varchar │ ├────────────┼───────────┼───────────┼──────────────┼───┼───────────────┼───────────────────┼────────────┼───────────────────────────┤ │ 1 │ 155190 │ 7706 │ 1 │ … │ 1996-03-22 │ DELIVER IN PERSON │ TRUCK │ to beans x-ray carefull │ │ 1 │ 67310 │ 7311 │ 2 │ … │ 1996-04-20 │ TAKE BACK RETURN │ MAIL │ according to the final … │ │ 1 │ 63700 │ 3701 │ 3 │ … │ 1996-01-31 │ TAKE BACK RETURN │ REG AIR │ ourts cajole above the f… │ │ 1 │ 2132 │ 4633 │ 4 │ … │ 1996-05-16 │ NONE │ AIR │ s cajole busily above t │ │ 1 │ 24027 │ 1534 │ 5 │ … │ 1996-04-01 │ NONE │ FOB │ the regular, regular pa │ │ · │ · │ · │ · │ … │ · │ · │ · │ · │ │ · │ · │ · │ · │ … │ · │ · │ · │ · │ │ · │ · │ · │ · │ … │ · │ · │ · │ · │ │ 5999975 │ 7272 │ 2273 │ 1 │ … │ 1993-10-21 │ COLLECT COD │ REG AIR │ ld deposits aga │ │ 5999975 │ 6452 │ 1453 │ 2 │ … │ 1993-11-19 │ DELIVER IN PERSON │ SHIP │ ffily along the sly │ │ 5999975 │ 37131 │ 2138 │ 3 │ … │ 1993-12-08 │ DELIVER IN PERSON │ FOB │ counts cajole evenly? sl… │ │ 6000000 │ 32255 │ 2256 │ 1 │ … │ 1996-12-01 │ TAKE BACK RETURN │ MAIL │ riously pe │ │ 6000000 │ 96127 │ 6128 │ 2 │ … │ 1996-10-21 │ NONE │ AIR │ pecial excuses nag evenl… │ ├────────────┴───────────┴───────────┴──────────────┴───┴───────────────┴───────────────────┴────────────┴───────────────────────────┤ │ 6001215 rows (6.00 million rows, 10 shown) 16 columns (8 shown) │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ memory D .last ┌────────────┬───────────┬───────────┬──────────────┬───────────────┬─────────────────┬───────────────┬───────────────┬──────────────> │ l_orderkey │ l_partkey │ l_suppkey │ l_linenumber │ l_quantity │ l_extendedprice │ l_discount │ l_tax │ l_returnflag > │ int64 │ int64 │ int64 │ int64 │ decimal(15,2) │ decimal(15,2) │ decimal(15,2) │ decimal(15,2) │ varchar > ├────────────┼───────────┼───────────┼──────────────┼───────────────┼─────────────────┼───────────────┼───────────────┼──────────────> │ 1 │ 155190 │ 7706 │ 1 │ 17.00 │ 21168.23 │ 0.04 │ 0.02 │ N > │ 1 │ 67310 │ 7311 │ 2 │ 36.00 │ 45983.16 │ 0.09 │ 0.06 │ N > │ 1 │ 63700 │ 3701 │ 3 │ 8.00 │ 13309.60 │ 0.10 │ 0.02 │ N > │ 1 │ 2132 │ 4633 │ 4 │ 28.00 │ 28955.64 │ 0.09 │ 0.06 │ N > │ 1 │ 24027 │ 1534 │ 5 │ 24.00 │ 22824.48 │ 0.10 │ 0.04 │ N > │ 1 │ 15635 │ 638 │ 6 │ 32.00 │ 49620.16 │ 0.07 │ 0.02 │ N > │ 2 │ 106170 │ 1191 │ 1 │ 38.00 │ 44694.46 │ 0.00 │ 0.05 │ N > │ 3 │ 4297 │ 1798 │ 1 │ 45.00 │ 54058.05 │ 0.06 │ 0.00 │ R > │ 3 │ 19036 │ 6540 │ 2 │ 49.00 │ 46796.47 │ 0.10 │ 0.00 │ R > │ 3 │ 128449 │ 3474 │ 3 │ 27.00 │ 39890.88 │ 0.06 │ 0.07 │ A > │ 3 │ 29380 │ 1883 │ 4 │ 2.00 │ 2618.76 │ 0.01 │ 0.06 │ A > │ 3 │ 183095 │ 650 │ 5 │ 28.00 │ 32986.52 │ 0.04 │ 0.00 │ R > │ 3 │ 62143 │ 9662 │ 6 │ 26.00 │ 28733.64 │ 0.10 │ 0.02 │ A > │ 4 │ 88035 │ 5560 │ 1 │ 30.00 │ 30690.90 │ 0.03 │ 0.08 │ N > │ 5 │ 108570 │ 8571 │ 1 │ 15.00 │ 23678.55 │ 0.02 │ 0.04 │ R > │ 5 │ 123927 │ 3928 │ 2 │ 26.00 │ 50723.92 │ 0.07 │ 0.08 │ R > │ 5 │ 37531 │ 35 │ 3 │ 50.00 │ 73426.50 │ 0.08 │ 0.03 │ A > │ 6 │ 139636 │ 2150 │ 1 │ 37.00 │ 61998.31 │ 0.08 │ 0.03 │ A > │ 7 │ 182052 │ 9607 │ 1 │ 12.00 │ 13608.60 │ 0.07 │ 0.03 │ N > │ 7 │ 145243 │ 7758 │ 2 │ 9.00 │ 11594.16 │ 0.08 │ 0.08 │ N > ``` Note that the `.last` view is opened in a pager, and the user can use e.g. the arrow keys to scroll to see more columns, or various pager commands to scroll through the result set. #### Performance In order to facilitate this command working quickly, we make several improvements to the `BoxRenderer`. In particular, we want the rendering to be more "stream-y" because we want the first bytes to hit the screen ASAP as that matters for the pager. The following improvements are made: * Split up the `Analyze` step from the rendering step. This determines the column widths based on the result before actually constructing any render values. * Introduce `max_analyze_rows` - the maximum rows to analyze for column widths. This is by default set to 100K, and can be specified as the second parameter to `.maxrows` (i.e. `.maxrows [MAX_ROWS] [MAX_ANALYZE_ROWS]`). Note that this relates to *rendered rows*, not to rows in the result set, so this will only matter when large result sets are rendered. * Render in a streaming manner. Avoid first constructing all rows to render to then render them. Instead do the `prepare -> render` cycle in a loop to allow faster rendering. * Several performance bottlenecks removed from rendering, e.g. avoid copying types for every value, reduce unnecessary copies / scans of the result Because of these changes `.last` is pretty fast. For example, running `FROM lineitem` with TPC-H SF10, and then running `.last` starts rendering the result in ~100ms. There's still a bunch of relatively easy performance wins here, e.g. switching from `std::string` to `string_t` - but for now this seems sufficient.

For duckdblabs/duckdb-internal#6841 -- not an important bug fix, just improving error output to make it throw instead of crash, so this is going on main. Throw an exception if there is an ARTKey mismatch. Before, it was returning an INVALID_INDEX constant which was not being handled anywhere, which led to a SIGSEGV failure. Technically, this could be handled higher up, but this function is only invoked from one place, and the likely scenario is attempting to insert a duplicate rowid on a corrupted ART (since to get to the end of the loop, all the bytes have to match). Throwing here avoids having to either 1) do a check on the return value of the function in the majority of cases where the ART is not corrupted or 2) only having the check work on debug mode.

…er (duckdb#20217) Closes: duckdb#20216

I get the following compilation warnings under debug build. ```sh /home/vscode/duckdb/tools/shell/shell.cpp: In member function 'int duckdb_shell::ShellState::DoMetaCommand(const std::string&)': /home/vscode/duckdb/tools/shell/shell.cpp:2462:95: warning: comparison of integer expressions of different signedness: 'int' and 'std::vector<std::__cxx11::basic_string<char>, std::allocator<std::__cxx11::basic_string<char> > >::size_type' {aka 'long unsigned int'} [-Wsign-compare] 2462 | } else if (command.argument_count == 0 || int(command.argument_count) == args.size()) { | ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~ /home/vscode/duckdb/tools/shell/shell_renderer.cpp: In member function 'void duckdb_shell::ModeBoxRenderer::PrintBoxRowSeparator(duckdb_shell::PrintStream&, duckdb_shell::idx_t, const char*, const char*, const char*, const duckdb::vector<long unsigned int>&)': /home/vscode/duckdb/tools/shell/shell_renderer.cpp:662:39: warning: comparison of integer expressions of different signedness: 'int' and 'duckdb_shell::idx_t' {aka 'long unsigned int'} [-Wsign-compare] 662 | for (i = 1; i < nArg; i++) { | ~~^~~~~~ /home/vscode/duckdb/extension/parquet/column_reader.cpp:393:42: warning: comparison of integer expressions of different signedness: 'uint32_t' {aka 'unsigned int'} and 'int32_t' {aka 'int'} [-Wsign-compare] 393 | if (compressed_page_size != page_hdr.uncompressed_page_size) { | ~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ```

@Maxxen

…uckdb#20158) Follow up from duckdb#19461, with _most_ comments addressed as discussed with @Maxxen. Fixes https://github.com/orgs/duckdblabs/projects/43/views/3?pane=issue&itemId=144025940&issue=duckdblabs%7Cduckdb-internal%7C6825.

* Implement streaming with IGNORE NULLs for first/last_value * Fix overwrite issue found by VS=2. * Add missing dictionary indirection

…ckdb#20063) This PR changes the struct projection pushdown to push the extract down to the storage. I'll explain the differences in the different levels: ## Current Behavior #### Expression: The `struct_extract(s, 'a')` expression remains unchanged #### Storage: We push down `scan_child_column` booleans, to indicate which fields are left unread from the STRUCT. The fields that aren't touched are left empty (null) so we skip reading them. What is emitted from the storage is a column of type STRUCT. #### Execution: The execution of the `struct_extract` function takes out the field it needs. ## New Behavior #### Expression: The `struct_extract(s, 'a')` is replaced with a new `BoundColumnRefExpression`, referencing the field we emit directly from the `LogicalGet` #### Storage: We don't emit a `STRUCT` column if only a subset of the fields are needed by the query, instead we emit the individually referenced field(s) What is emitted from the storage is now a column of the type of the field. #### Execution: The `BoundReferenceExpression` just references the field that was emitted directly by the storage. ## Technical details We communicate this new "pushdown extract" behavior through the `ColumnIndex` down to the `StorageIndex`, as well as the `LogicalType` of the field we're scanning. ## Limitations (Future Work) We disable the pushdown of the extract if fields of the struct are referenced by a filter. To make that work correctly, we would need to rewrite the TableFilterSet, which I suspect has far-reaching consequences I'm not ready to deal with yet.

Mytherin · 2025-12-17T11:58:58Z

src/storage/table/struct_column_data.cpp

+			    Vector intermediate(child_type, target_count);
+			    field.Scan(transaction, vector_index, field_state, intermediate, target_count);
+			    //! FIXME: this is an optional ptr.. when is it not set??
+			    VectorOperations::Cast(*state.context.GetClientContext(), intermediate, target_vector, target_count);


Maybe this should be a generic expression instead of a cast - where we keep the cast state in the ColumnScanState. This could prevent re-initializing and such, and would be more flexible in the future.

For example, one optimization this could enable in the future is pushing expressions into the projection which is useful if they exist in both the filter and the projection, e.g.:

SELECT regexp_extract('...') AS extracted FROM tbl WHERE extracted = ...

Currently we need to execute regexp_extract twice (once in the filter, once in the projection) - but allowing expressions to be pushed would avoid this.

Mytherin · 2025-12-17T12:00:14Z

src/storage/table/struct_column_data.cpp

+			    Vector intermediate(child_type, target_count);
+			    field.Scan(transaction, vector_index, field_state, intermediate, target_count);
+			    //! FIXME: this is an optional ptr.. when is it not set??
+			    VectorOperations::Cast(*state.context.GetClientContext(), intermediate, target_vector, target_count);


Maybe the ClientContext can be made mandatory for scans.

Mytherin · 2025-12-17T12:00:57Z

src/storage/table/struct_column_data.cpp

+		if (child_type != result.GetType()) {
+			Vector intermediate(child_type, 1);
+			sub_column.FetchRow(transaction, state, child_storage_index, row_id, intermediate, 0);
+			auto context = transaction.transaction->context.lock();


Maybe this should also be mandatory

Fixes: duckdblabs/duckdb-internal#5904 This PR introduces the `SWITCH` expression as syntactic sugar for the standard `CASE` `WHEN/THEN` expression. This is a first version of the `SWITCH` statement, so I am also looking for feedback and see what others think of this syntax :) ~~### General Syntax~~ ~~```sql~~ SWITCH(expression, value THEN result [, value THEN result ...] [, ELSE default]) ~~```~~ ~~### Example ~~```sql~~ -- This syntax: SELECT SWITCH(x, 1 THEN 'a', 2 THEN 'b' ELSE 'c') FROM tbl; -- Is equivalent to: SELECT CASE x WHEN 1 THEN 'a' WHEN 2 THEN 'b' ELSE 'c' END FROM tbl; ~~```~~ ~~Implementation Notes~~ ~~`SWITCH` has been added as a **reserved** keyword.~~ ~~The logic is handled entirely within the parser. The `SWITCH` expression is immediately rewritten into a `CaseExpression` node. This means no changes were required for the transformer or binder.~~ ~~I utilized the `THEN` keyword to separate the comparison value from the result (e.g., `1 THEN 'a'`).~~ ~~I attempted to use a comma (`,`) or a colon-style syntax (`CASE val : result`), but these approaches caused shift/reduce conflicts in the grammar. Reusing `THEN` resolved these ambiguities cleanly.~~ See comment below for reworked syntax.

…e the ColumnScanState

…_pushdown

…ve, to make sure we only perform the cast at the deepest level

…on state

Maxxen and others added 30 commits October 21, 2025 12:42

refactor create index operator and introduce index build abstraction

4dcdf3a

use pointer stable index type set, null function pointers properly

d9c6d4f

move

9761ca8

Add syntax for switch

abe29b5

Merge remote-tracking branch 'upstream/main' into friendly-switch

e4961ef

Add switch test case

fabb262

Change grammar rules

3c728c0

Rework grammar rules to use map expression

7ec854c

Improve error message

4f6a65f

Handle empty map

c8eaad7

Remove print

16899c2

Add include for function expression

ed40b75

Add test that dynamic maps don't work yet

6bdb081

Revert grammar changes

a1d5ad8

Revert grammar changes

32f9d13

Correctly revert grammar typo

c906f2f

Merge remote-tracking branch 'origin/friendly-switch' into friendly-s…

641c2e0

…witch

Remove unused rules

9ca48fc

Remove grammar rules for switch

83b8872

First attempt for switch function

9c281f9

FIrst version working

84ed0ec

Missing else case now works

a7d59f5

Remove special null handling

dd00dec

Move checking logic to bind

70211c3

Simplify logic and unify the implementation to a single function

b9732f6

Tests now pass

cce7cab

resolve merge conflict

60a8a1a

Format fix

6baa6af

ff

ba07d07

getdata -> getdatainternal for source

3eeca72

Mytherin and others added 21 commits December 16, 2025 12:33

Fix empty result rendering

838aa9f

Wrap header values as well

d104f25

Merge branch 'main' into clirenderingspeed

e7c96e2

numeric cast

1146d4e

Tidy fix

f9d39b2

Flatten

c1ea762

Throw exception if there is ARTKey mismatch

824cb6b

[C-API] Add support for scalar function local states (duckdb#20109)

bee96ac

After this we only need support for compressed execution (e.g. constant/flat/dict-vectors) in scalar functions to start moving over some more complex extensions to the C-API. (e.g half of spatial).

change exception to fatal exception

64ea5ad

Merge branch 'main' into streaming-last

68ff211

Internal duckdb#6837: Streaming IGNORE NULLS

4a982bd

* Add missing dictionary indirection

Fix Windows shell input: correct variable buffering and character ord…

fa5c593

…er (duckdb#20217) Closes: duckdb#20216

Internal duckdb#6837: Streaming IGNORE NULLS (duckdb#20226)

57901f6

* Implement streaming with IGNORE NULLs for first/last_value * Fix overwrite issue found by VS=2. * Add missing dictionary indirection

Mytherin reviewed Dec 17, 2025

View reviewed changes

lnkuiper and others added 8 commits December 17, 2025 14:32

generalize the cast expression to an ExpressionExecutor, cached insid…

09beae2

…e the ColumnScanState

cleanup

90fc9be

Merge remote-tracking branch 'upstream/main' into struct_extract_cast…

c6e3879

…_pushdown

also add the cast execution to ScanCommitted and ScanCount

a5c84b3

make the condition for performing the cast in FetchRow more restricti…

a53ded9

…ve, to make sure we only perform the cast at the deepest level

create yet another abstraction to deduplicate the use of the expressi…

60088e4

…on state

add missing header

483ffba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Struct extract cast pushdown #149

Struct extract cast pushdown #149

Uh oh!

Tishj commented Dec 16, 2025

Uh oh!

Mytherin Dec 17, 2025

Uh oh!

Mytherin Dec 17, 2025

Uh oh!

Mytherin Dec 17, 2025

Uh oh!

Mytherin Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants

Struct extract cast pushdown #149

Are you sure you want to change the base?

Struct extract cast pushdown #149

Uh oh!

Conversation

Tishj commented Dec 16, 2025

Uh oh!

Mytherin Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Mytherin Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Mytherin Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Mytherin Dec 17, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

15 participants