Skip to content

21577: feat: Extend sqllogictest framework to uptake custom datafusion.format.* settings#308

Open
martin-augment wants to merge 2 commits intomainfrom
pr-21577-2026-04-13-09-40-21
Open

21577: feat: Extend sqllogictest framework to uptake custom datafusion.format.* settings#308
martin-augment wants to merge 2 commits intomainfrom
pr-21577-2026-04-13-09-40-21

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

21577: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 13, 2026

Walkthrough

The pull request adds repository-wide rules for AI agent behavior in code reviews, constraining them from linking to GitHub issues/pull requests and from reviewing AI configuration files. It also refactors the DataFusion sqllogictest formatting pipeline by introducing a new get_format_options helper function that derives format configuration from the session context. This function is then propagated through the normalization and query execution paths (convert_batches and cell_to_string) across multiple runner implementations. The test file expectations are updated to reflect the new session-driven formatting behavior for date, time, timestamp, and duration types.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pr-21577-2026-04-13-09-40-21

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for session-level formatting options in sqllogictest, allowing query results to reflect settings like datafusion.format.date_format and datafusion.format.timestamp_format. While the infrastructure for passing FormatOptions is implemented, several issues were identified: the cell_to_string function still uses a hardcoded NULL string instead of the provided format options, and get_format_options explicitly overrides user-defined null labels. Additionally, the current implementation creates an ArrayFormatter for every cell, which is inefficient and should be optimized to occur once per column. New tests were added to set_variable.slt, but they currently only verify that configuration variables can be set, not that they actually affect the output of null values.

Comment on lines +188 to +193
pub fn cell_to_string(
col: &ArrayRef,
row: usize,
is_spark_path: bool,
format_options: &FormatOptions<'_>,
) -> Result<String> {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While format_options is now passed to cell_to_string, the function's implementation (specifically for null values) still uses the hardcoded NULL_STR ("NULL"). To fully support the datafusion.format.null setting as intended by this PR, the implementation should be updated to use format_options.null_label for both top-level and nested null values.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! There are several usages of the NULL_STR constant still in use. They should be replaced with format_options.null to make it following the configured formatting.

(&datafusion_format_options).try_into().unwrap();

let f = ArrayFormatter::try_new(col.as_ref(), &arrow_format_options)?;
let f = ArrayFormatter::try_new(col.as_ref(), format_options)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating a new ArrayFormatter for every single cell is inefficient, especially for large result sets. Since the format_options and the column's data type are constant for all rows in a batch, consider creating the formatter once per column outside the row loop to improve performance.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The same formatter could be used for all values in a column, so it could be reused per column instead of re-created for each value. This will reduce the memory pressure for tests with many results.

// Get passed custom FormatOptions by SessionContext to be used for sqllogictest
pub fn get_format_options(ctx: &SessionContext) -> Result<FormatOptions> {
let mut df_format = ctx.state().config().options().format.clone();
df_format.set("null", "NULL")?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line explicitly overrides the user's datafusion.format.null setting with "NULL". While this ensures compatibility with standard .slt files, it prevents users from testing custom null representations, which seems to be part of the goal of this PR (as evidenced by the new tests in set_variable.slt). Consider only setting this to "NULL" if the user hasn't provided a custom value, or clarify if null is intentionally excluded from the custom formatting support.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! This will set the format.null option to NULL, to be backward compatible with the NULL_STR constant. But to make this option customizable like the date/time formatting this should be done only if the current value is "", i.e. the default value from FormatOptions::default(). A custom value set with SET datafusion.format.null=... should be not be overwritten.


# null display string (default: (empty))
statement ok
SET datafusion.format.null = 'NuLL'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test verifies that the datafusion.format.null setting can be changed and shown, but it doesn't verify that the setting actually affects query output. Given the current implementation in util.rs and normalize.rs, query results will still use the hardcoded "NULL" string regardless of this setting. Consider adding a query test case here once the implementation is fixed.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The test just verifies that the formatting option is properly set but it does not verify that it is being used when a SELECT query returns a null value. An addition test should be added.

@claude
Copy link
Copy Markdown

claude bot commented Apr 13, 2026

Code Review

The overall approach is sound — passing FormatOptions from the session context down through the call chain is cleaner than constructing it inside cell_to_string on every call, and the new SLT test cases give good coverage of the feature.

A few issues worth addressing:


Bug: Unused import will fail cargo clippy -- -D warnings

In datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/runner.rs (line 29), ConfigField is imported but never used directly in that file — the actual usage is in util.rs::get_format_options:

// runner.rs — ConfigField is not used here
use datafusion::config::ConfigField;

This will fail the mandatory clippy check (unused_imports → error with -D warnings). The import should be removed from this file.


Misleading test: timestamp_tz_format doesn't affect the query output

The new test block sets timestamp_tz_format and then queries SELECT TIMESTAMPTZ '2026-04-07 13:31:00', but the expected output (2026-04-07T13:31:00) matches the default timestamp_format, not the set timestamp_tz_format (%Y-%m-%d %H:%M:%S %z):

# timestamp_tz_format (default: NULL)
statement ok
SET datafusion.format.timestamp_tz_format = '%Y-%m-%d %H:%M:%S %z'

query P
SELECT TIMESTAMPTZ '2026-04-07 13:31:00';
----
2026-04-07T13:31:00    -- uses timestamp_format default, not timestamp_tz_format

This suggests TIMESTAMPTZ '...' literals produce a Timestamp(Nanosecond, None) type in DataFusion (no attached timezone), so timestamp_tz_format is never applied. The test as written verifies the format setting has no effect rather than demonstrating that it works. Consider either:

  1. Using SELECT arrow_cast(..., 'Timestamp(Nanosecond, UTC)') to produce a timezone-aware timestamp that actually exercises timestamp_tz_format, or
  2. Adding a comment clarifying that TIMESTAMPTZ literals are timezone-stripped and that this behavior is expected.

Design: null format setting is silently overridden

get_format_options always forces null to "NULL" regardless of what the user set for datafusion.format.null:

pub fn get_format_options(ctx: &SessionContext) -> Result<FormatOptions> {
    let mut df_format = ctx.state().config().options().format.clone();
    df_format.set("null", "NULL")?;  // overrides user setting unconditionally
    Ok(df_format)
}

This means SET datafusion.format.null = 'custom_null' won't change how nulls appear in query results from sqllogictest (at least for nested nulls processed by ArrayFormatter; top-level nulls already use NULL_STR via an early return in cell_to_string). This is consistent with the existing behavior, but it makes the null setting a partial exception among all the other format settings. A comment explaining the rationale (e.g., "SLT tests depend on 'NULL' for null matching") would help future readers.


Minor: Doc comment style

get_format_options uses a // line comment instead of the /// doc-comment style used by other public functions in util.rs:

// Get passed custom FormatOptions by SessionContext to be used for sqllogictest
pub fn get_format_options(ctx: &SessionContext) -> Result<FormatOptions> {

Should be ///.


Positive notes

  • Removing the per-call FormatOptions construction inside the _ arm of cell_to_string is a clean improvement.
  • The case-insensitive SET datafusion.FORMAT.DURATION_FORMAT test is a nice addition.
  • The duration_format and time_format tests effectively demonstrate the feature.

@augmentcode
Copy link
Copy Markdown

augmentcode bot commented Apr 13, 2026

🤖 Augment PR Summary

Summary: This PR updates the DataFusion sqllogictest runners to respect session-level datafusion.format.* settings when formatting query results.

Changes:

  • Thread Arrow FormatOptions into convert_batches/cell_to_string rather than constructing default formatting options per value
  • Add get_format_options helper to pull the current session’s formatting settings and enforce a stable NULL representation for SLT output
  • Update both the standard DataFusion runner and the Substrait round-trip runner to convert DataFusion format options into Arrow display format options
  • Expand set_variable.slt to assert that date/time/timestamp/duration display changes according to session formatting variables
  • Add additional coverage for case-insensitive format variable names

Technical Notes: Uses TryFrom<&datafusion::config::FormatOptions> to build Arrow FormatOptions and applies the resulting options consistently during result normalization.

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

query P
SELECT TIMESTAMPTZ '2026-04-07 13:31:00';
----
2026-04-07T13:31:00
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp_tz_format is set to a pattern that includes %z, but the expected output for the TIMESTAMPTZ query doesn’t include any timezone/offset. This may not actually validate that datafusion.format.timestamp_tz_format is being applied (or could become flaky if TIMESTAMPTZ formatting starts including the offset).

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The test validates that the formatting setting is not used. It should be fixed to verify that the custom date formatting is used!

// Get passed custom FormatOptions by SessionContext to be used for sqllogictest
pub fn get_format_options(ctx: &SessionContext) -> Result<FormatOptions> {
let mut df_format = ctx.state().config().options().format.clone();
df_format.set("null", "NULL")?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_format_options overwrites the session’s datafusion.format.null to "NULL", so SET datafusion.format.null = ... won’t affect how NULLs are displayed in query output even though other datafusion.format.* settings now do. Is that intentional for sqllogictest semantics, and should it be documented/tests reflect that behavior?

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! This will set the format.null option to NULL, to be backward compatible with the NULL_STR constant. But to make this option customizable like the date/time formatting this should be done only if the current value is "", i.e. the default value from FormatOptions::default(). A custom value set with SET datafusion.format.null=... should be not be overwritten.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6623349. Configure here.

query P
SELECT TIMESTAMPTZ '2026-04-07 13:31:00';
----
2026-04-07T13:31:00
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for timestamp_tz_format doesn't verify format is applied

Medium Severity

The timestamp_tz_format test sets the format to '%Y-%m-%d %H:%M:%S %z' but expects 2026-04-07T13:31:00, which matches the default timestamp_format, not the custom tz format. This happens because datafusion.execution.time_zone is NULL at this point, causing TIMESTAMPTZ '...' to produce Timestamp(Nanosecond, None) — a non-tz type — so timestamp_tz_format is never applied. The test passes but doesn't actually validate that the timestamp_tz_format setting affects output, giving false confidence that the feature works through the sqllogictest framework.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6623349. Configure here.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! The test validates that the formatting setting is not used. It should be fixed to verify that the custom date formatting is used!

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
datafusion/sqllogictest/test_files/set_variable.slt (1)

504-520: Inconsistent query type annotation for TIME.

The TIME query uses query D (DateTime type) on lines 504 and 517, but TIME values might be better represented with a different type annotation. This appears to be pre-existing behavior and not directly related to this PR's changes, so it's a minor observation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@datafusion/sqllogictest/test_files/set_variable.slt` around lines 504 - 520,
The TIME tests are annotated with "query D" (DateTime) but should use a
time-specific annotation; update the two TIME test blocks that start with "query
D" for "SELECT TIME '01:02:12.123' AS time;" to use "query T" (or whatever test
label your test harness expects for TIME) so the expected type matches the
value, and keep the RESET/SHOW lines as-is (verify both occurrences around the
RESET datafusion.format.time_format and after the SHOW remain consistent).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/runner.rs`:
- Line 29: The import "ConfigField" is unused in runner.rs; remove the line "use
datafusion::config::ConfigField;" from that file (search for the symbol
ConfigField to confirm there are no references), then run a build to ensure no
other code depends on it and tidy imports (cargo fmt/cargo check) to finalize
the cleanup.

---

Nitpick comments:
In `@datafusion/sqllogictest/test_files/set_variable.slt`:
- Around line 504-520: The TIME tests are annotated with "query D" (DateTime)
but should use a time-specific annotation; update the two TIME test blocks that
start with "query D" for "SELECT TIME '01:02:12.123' AS time;" to use "query T"
(or whatever test label your test harness expects for TIME) so the expected type
matches the value, and keep the RESET/SHOW lines as-is (verify both occurrences
around the RESET datafusion.format.time_format and after the SHOW remain
consistent).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 38522c38-d63b-4bb7-8e6f-788f7b581e64

📥 Commits

Reviewing files that changed from the base of the PR and between 644be38 and 6623349.

📒 Files selected for processing (7)
  • .cursor/rules.md
  • AGENTS.md
  • datafusion/sqllogictest/src/engines/datafusion_engine/normalize.rs
  • datafusion/sqllogictest/src/engines/datafusion_engine/runner.rs
  • datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/runner.rs
  • datafusion/sqllogictest/src/util.rs
  • datafusion/sqllogictest/test_files/set_variable.slt

};
use arrow::record_batch::RecordBatch;
use async_trait::async_trait;
use datafusion::config::ConfigField;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check if ConfigField is used in this file beyond the import

rg -n 'ConfigField' datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/runner.rs

Repository: martin-augment/datafusion

Length of output: 108


🏁 Script executed:

cd datafusion/sqllogictest && cargo clippy --all-targets --all-features -- -D warnings 2>&1 | grep -A 5 -B 5 "ConfigField\|unused" | head -100

Repository: martin-augment/datafusion

Length of output: 155


🏁 Script executed:

# Read the entire file to check for ConfigField usage in context
cat -n datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/runner.rs | head -60

Repository: martin-augment/datafusion

Length of output: 2777


🏁 Script executed:

# Also search more broadly for ConfigField in the entire directory
rg 'ConfigField' datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/

Repository: martin-augment/datafusion

Length of output: 189


Remove unused import.

The ConfigField import on line 29 is not used in this file. It appears this import was added but is not referenced anywhere in the runner.rs file.

Proposed fix
-use datafusion::config::ConfigField;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
use datafusion::config::ConfigField;
(This line should be removed from the imports section)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/runner.rs`
at line 29, The import "ConfigField" is unused in runner.rs; remove the line
"use datafusion::config::ConfigField;" from that file (search for the symbol
ConfigField to confirm there are no references), then run a build to ensure no
other code depends on it and tidy imports (cargo fmt/cargo check) to finalize
the cleanup.

@martin-augment
Copy link
Copy Markdown
Owner Author

Misleading test: timestamp_tz_format doesn't affect the query output

The new test block sets timestamp_tz_format and then queries SELECT TIMESTAMPTZ '2026-04-07 13:31:00', but the expected output (2026-04-07T13:31:00) matches the default timestamp_format, not the set timestamp_tz_format (%Y-%m-%d %H:%M:%S %z):

# timestamp_tz_format (default: NULL)
statement ok
SET datafusion.format.timestamp_tz_format = '%Y-%m-%d %H:%M:%S %z'

query P
SELECT TIMESTAMPTZ '2026-04-07 13:31:00';
----
2026-04-07T13:31:00    -- uses timestamp_format default, not timestamp_tz_format

This suggests TIMESTAMPTZ '...' literals produce a Timestamp(Nanosecond, None) type in DataFusion (no attached timezone), so timestamp_tz_format is never applied. The test as written verifies the format setting has no effect rather than demonstrating that it works. Consider either:

  1. Using SELECT arrow_cast(..., 'Timestamp(Nanosecond, UTC)') to produce a timezone-aware timestamp that actually exercises timestamp_tz_format, or
  2. Adding a comment clarifying that TIMESTAMPTZ literals are timezone-stripped and that this behavior is expected.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! The test validates that the formatting setting is not used. It should be fixed to verify that the custom date formatting is used!

@martin-augment
Copy link
Copy Markdown
Owner Author

Design: null format setting is silently overridden

get_format_options always forces null to "NULL" regardless of what the user set for datafusion.format.null:

pub fn get_format_options(ctx: &SessionContext) -> Result<FormatOptions> {
    let mut df_format = ctx.state().config().options().format.clone();
    df_format.set("null", "NULL")?;  // overrides user setting unconditionally
    Ok(df_format)
}

This means SET datafusion.format.null = 'custom_null' won't change how nulls appear in query results from sqllogictest (at least for nested nulls processed by ArrayFormatter; top-level nulls already use NULL_STR via an early return in cell_to_string). This is consistent with the existing behavior, but it makes the null setting a partial exception among all the other format settings. A comment explaining the rationale (e.g., "SLT tests depend on 'NULL' for null matching") would help future readers.

value:useful; category:bug; feedback: The Claude AI reviewer is correct! This will set the format.null option to NULL, to be backward compatible with the NULL_STR constant. But to make this option customizable like the date/time formatting this should be done only if the current value is "", i.e. the default value from FormatOptions::default(). A custom value set with SET datafusion.format.null=... should be not be overwritten.

@martin-augment
Copy link
Copy Markdown
Owner Author

Minor: Doc comment style

get_format_options uses a // line comment instead of the /// doc-comment style used by other public functions in util.rs:

// Get passed custom FormatOptions by SessionContext to be used for sqllogictest
pub fn get_format_options(ctx: &SessionContext) -> Result<FormatOptions> {

Should be ///.

value:good-to-have; category:documentation; feedback: The Claude AI reviewer is correct! The new public function has a doc comment instead of a proper docstring that will be rendered in the produced documentation. The doc comment will be visible only in the source code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants