Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .cursor/rules.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

5 changes: 5 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,8 @@ When creating a PR, you MUST follow the [PR template](.github/pull_request_templ

See the [Testing Quick Start](docs/source/contributor-guide/testing.md#testing-quick-start)
for the recommended pre-PR test commands.
The AI agents should never link to any issue or a pull request
in any GitHub repository in the code reviews!

The AI agents should not review AI agents' config files like CLAUDE.md or AGENTS.md!

30 changes: 16 additions & 14 deletions datafusion/sqllogictest/src/engines/datafusion_engine/normalize.rs
Original file line number Diff line number Diff line change
Expand Up @@ -20,10 +20,9 @@ use super::error::{DFSqlLogicTestError, Result};
use crate::engines::output::DFColumnType;
use arrow::array::{Array, AsArray};
use arrow::datatypes::{Fields, Schema};
use arrow::util::display::ArrayFormatter;
use arrow::util::display::{ArrayFormatter, FormatOptions};
use arrow::{array, array::ArrayRef, datatypes::DataType, record_batch::RecordBatch};
use datafusion::common::internal_datafusion_err;
use datafusion::config::ConfigField;
use std::path::PathBuf;
use std::sync::LazyLock;

Expand All @@ -32,6 +31,7 @@ pub fn convert_batches(
schema: &Schema,
batches: Vec<RecordBatch>,
is_spark_path: bool,
format_options: &FormatOptions<'_>,
) -> Result<Vec<Vec<String>>> {
let mut rows = vec![];
for batch in batches {
Expand All @@ -50,7 +50,7 @@ pub fn convert_batches(
batch
.columns()
.iter()
.map(|col| cell_to_string(col, row, is_spark_path))
.map(|col| cell_to_string(col, row, is_spark_path, format_options))
.collect::<Result<Vec<String>>>()
})
.collect::<Result<Vec<Vec<String>>>>()?
Expand Down Expand Up @@ -185,7 +185,12 @@ macro_rules! get_row_value {
/// [NULL Values and empty strings]: https://duckdb.org/dev/sqllogictest/result_verification#null-values-and-empty-strings
///
/// Floating numbers are rounded to have a consistent representation with the Postgres runner.
pub fn cell_to_string(col: &ArrayRef, row: usize, is_spark_path: bool) -> Result<String> {
pub fn cell_to_string(
col: &ArrayRef,
row: usize,
is_spark_path: bool,
format_options: &FormatOptions<'_>,
) -> Result<String> {
Comment on lines +188 to +193
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While format_options is now passed to cell_to_string, the function's implementation (specifically for null values) still uses the hardcoded NULL_STR ("NULL"). To fully support the datafusion.format.null setting as intended by this PR, the implementation should be updated to use format_options.null_label for both top-level and nested null values.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! There are several usages of the NULL_STR constant still in use. They should be replaced with format_options.null to make it following the configured formatting.

if col.is_null(row) {
// represent any null value with the string "NULL"
Ok(NULL_STR.to_string())
Expand Down Expand Up @@ -233,18 +238,15 @@ pub fn cell_to_string(col: &ArrayRef, row: usize, is_spark_path: bool) -> Result
DataType::Dictionary(_, _) => {
let dict = col.as_any_dictionary();
let key = dict.normalized_keys()[row];
Ok(cell_to_string(dict.values(), key, is_spark_path)?)
Ok(cell_to_string(
dict.values(),
key,
is_spark_path,
format_options,
)?)
}
_ => {
let mut datafusion_format_options =
datafusion::config::FormatOptions::default();

datafusion_format_options.set("null", "NULL").unwrap();

let arrow_format_options: arrow::util::display::FormatOptions =
(&datafusion_format_options).try_into().unwrap();

let f = ArrayFormatter::try_new(col.as_ref(), &arrow_format_options)?;
let f = ArrayFormatter::try_new(col.as_ref(), format_options)?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Creating a new ArrayFormatter for every single cell is inefficient, especially for large result sets. Since the format_options and the column's data type are constant for all rows in a batch, consider creating the formatter once per column outside the row loop to improve performance.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:good-to-have; category:bug; feedback: The Gemini AI reviewer is correct! The same formatter could be used for all values in a column, so it could be reused per column instead of re-created for each value. This will reduce the memory pressure for tests with many results.


Ok(f.value(row).to_string())
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ use std::{path::PathBuf, time::Duration};
use super::{DFSqlLogicTestError, error::Result, normalize};
use crate::engines::currently_executed_sql::CurrentlyExecutingSqlTracker;
use crate::engines::output::{DFColumnType, DFOutput};
use crate::is_spark_path;
use crate::{get_format_options, is_spark_path};
use arrow::record_batch::RecordBatch;
use async_trait::async_trait;
use datafusion::physical_plan::common::collect;
Expand Down Expand Up @@ -212,7 +212,12 @@ async fn run_query(
let stream = execute_stream(plan, task_ctx)?;
let types = normalize::convert_schema_to_types(stream.schema().fields());
let results: Vec<RecordBatch> = collect(stream).await?;
let rows = normalize::convert_batches(&schema, results, is_spark_path)?;

let df_format = get_format_options(ctx)?;
let arrow_format: arrow::util::display::FormatOptions<'_> =
(&df_format).try_into()?;
let rows =
normalize::convert_batches(&schema, results, is_spark_path, &arrow_format)?;

if rows.is_empty() && types.is_empty() {
Ok(DBOutput::StatementComplete(0))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,12 @@ use std::{path::PathBuf, time::Duration};
use crate::engines::currently_executed_sql::CurrentlyExecutingSqlTracker;
use crate::engines::datafusion_engine::Result;
use crate::engines::output::{DFColumnType, DFOutput};
use crate::{DFSqlLogicTestError, convert_batches, convert_schema_to_types};
use crate::{
DFSqlLogicTestError, convert_batches, convert_schema_to_types, get_format_options,
};
use arrow::record_batch::RecordBatch;
use async_trait::async_trait;
use datafusion::config::ConfigField;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Check if ConfigField is used in this file beyond the import

rg -n 'ConfigField' datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/runner.rs

Repository: martin-augment/datafusion

Length of output: 108


🏁 Script executed:

cd datafusion/sqllogictest && cargo clippy --all-targets --all-features -- -D warnings 2>&1 | grep -A 5 -B 5 "ConfigField\|unused" | head -100

Repository: martin-augment/datafusion

Length of output: 155


🏁 Script executed:

# Read the entire file to check for ConfigField usage in context
cat -n datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/runner.rs | head -60

Repository: martin-augment/datafusion

Length of output: 2777


🏁 Script executed:

# Also search more broadly for ConfigField in the entire directory
rg 'ConfigField' datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/

Repository: martin-augment/datafusion

Length of output: 189


Remove unused import.

The ConfigField import on line 29 is not used in this file. It appears this import was added but is not referenced anywhere in the runner.rs file.

Proposed fix
-use datafusion::config::ConfigField;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
use datafusion::config::ConfigField;
(This line should be removed from the imports section)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@datafusion/sqllogictest/src/engines/datafusion_substrait_roundtrip_engine/runner.rs`
at line 29, The import "ConfigField" is unused in runner.rs; remove the line
"use datafusion::config::ConfigField;" from that file (search for the symbol
ConfigField to confirm there are no references), then run a build to ensure no
other code depends on it and tidy imports (cargo fmt/cargo check) to finalize
the cleanup.

use datafusion::logical_expr::LogicalPlan;
use datafusion::physical_plan::common::collect;
use datafusion::physical_plan::execute_stream;
Expand Down Expand Up @@ -166,7 +169,11 @@ async fn run_query_substrait_round_trip(
let stream = execute_stream(physical_plan, task_ctx)?;
let types = convert_schema_to_types(stream.schema().fields());
let results: Vec<RecordBatch> = collect(stream).await?;
let rows = convert_batches(&schema, results, false)?;

let df_format = get_format_options(ctx)?;
let arrow_format: arrow::util::display::FormatOptions<'_> =
(&df_format).try_into()?;
let rows = convert_batches(&schema, results, false, &arrow_format)?;

if rows.is_empty() && types.is_empty() {
Ok(DBOutput::StatementComplete(0))
Expand Down
9 changes: 9 additions & 0 deletions datafusion/sqllogictest/src/util.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
// under the License.

use datafusion::common::{Result, exec_datafusion_err};
use datafusion::config::{ConfigField, FormatOptions};
use datafusion::prelude::SessionContext;
use itertools::Itertools;
use log::Level::Warn;
use log::{info, log_enabled, warn};
Expand Down Expand Up @@ -141,6 +143,13 @@ pub fn is_spark_path(relative_path: &Path) -> bool {
relative_path.starts_with("spark/")
}

// Get passed custom FormatOptions by SessionContext to be used for sqllogictest
pub fn get_format_options(ctx: &SessionContext) -> Result<FormatOptions> {
let mut df_format = ctx.state().config().options().format.clone();
df_format.set("null", "NULL")?;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This line explicitly overrides the user's datafusion.format.null setting with "NULL". While this ensures compatibility with standard .slt files, it prevents users from testing custom null representations, which seems to be part of the goal of this PR (as evidenced by the new tests in set_variable.slt). Consider only setting this to "NULL" if the user hasn't provided a custom value, or clarify if null is intentionally excluded from the custom formatting support.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! This will set the format.null option to NULL, to be backward compatible with the NULL_STR constant. But to make this option customizable like the date/time formatting this should be done only if the current value is "", i.e. the default value from FormatOptions::default(). A custom value set with SET datafusion.format.null=... should be not be overwritten.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_format_options overwrites the session’s datafusion.format.null to "NULL", so SET datafusion.format.null = ... won’t affect how NULLs are displayed in query output even though other datafusion.format.* settings now do. Is that intentional for sqllogictest semantics, and should it be documented/tests reflect that behavior?

Severity: low

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! This will set the format.null option to NULL, to be backward compatible with the NULL_STR constant. But to make this option customizable like the date/time formatting this should be done only if the current value is "", i.e. the default value from FormatOptions::default(). A custom value set with SET datafusion.format.null=... should be not be overwritten.

Ok(df_format)
}

#[cfg(test)]
mod tests {
use super::*;
Expand Down
112 changes: 97 additions & 15 deletions datafusion/sqllogictest/test_files/set_variable.slt
Original file line number Diff line number Diff line change
Expand Up @@ -395,7 +395,7 @@ datafusion.format.timestamp_format
datafusion.format.timestamp_tz_format
datafusion.format.types_info

# date_format: SET / SHOW / RESET / SHOW
# date_format: query result display uses session format (default: %Y-%m-%d)
statement ok
SET datafusion.format.date_format = '%d-%m-%Y'

Expand All @@ -404,6 +404,11 @@ SHOW datafusion.format.date_format
----
datafusion.format.date_format %d-%m-%Y

query D
SELECT DATE '2026-04-07'
----
07-04-2026

statement ok
RESET datafusion.format.date_format

Expand All @@ -412,14 +417,23 @@ SHOW datafusion.format.date_format
----
datafusion.format.date_format %Y-%m-%d

# datetime_format
query D
SELECT DATE '2026-04-07'
----
2026-04-07

# datetime_format (default: %Y-%m-%dT%H:%M:%S%.f)
statement ok
SET datafusion.format.datetime_format = '%Y/%m/%d %H:%M:%S'
SET datafusion.format.datetime_format = '%d-%m-%YT%H:%M:%S'

query TT
SHOW datafusion.format.datetime_format
----
datafusion.format.datetime_format %Y/%m/%d %H:%M:%S
datafusion.format.datetime_format %d-%m-%YT%H:%M:%S

# DATETIME literals are not implemented in the SQL parser yet.
query error DataFusion error: This feature is not implemented: Unsupported SQL type DATETIME
SELECT DATETIME '2026-04-07 00:10:00';

statement ok
RESET datafusion.format.datetime_format
Expand All @@ -429,14 +443,19 @@ SHOW datafusion.format.datetime_format
----
datafusion.format.datetime_format %Y-%m-%dT%H:%M:%S%.f

# timestamp_format
# timestamp_format (default: %Y-%m-%dT%H:%M:%S%.f)
statement ok
SET datafusion.format.timestamp_format = '%FT%H:%M:%S'
SET datafusion.format.timestamp_format = '%d-%m-%YT%H:%M:%S'

query TT
SHOW datafusion.format.timestamp_format
----
datafusion.format.timestamp_format %FT%H:%M:%S
datafusion.format.timestamp_format %d-%m-%YT%H:%M:%S

query P
SELECT TIMESTAMP '2026-04-07 13:31:00';
----
07-04-2026T13:31:00

statement ok
RESET datafusion.format.timestamp_format
Expand All @@ -446,7 +465,12 @@ SHOW datafusion.format.timestamp_format
----
datafusion.format.timestamp_format %Y-%m-%dT%H:%M:%S%.f

# timestamp_tz_format (default NULL)
query P
SELECT TIMESTAMP '2026-04-07 13:31:00';
----
2026-04-07T13:31:00

# timestamp_tz_format (default: NULL)
statement ok
SET datafusion.format.timestamp_tz_format = '%Y-%m-%d %H:%M:%S %z'

Expand All @@ -455,6 +479,11 @@ SHOW datafusion.format.timestamp_tz_format
----
datafusion.format.timestamp_tz_format %Y-%m-%d %H:%M:%S %z

query P
SELECT TIMESTAMPTZ '2026-04-07 13:31:00';
----
2026-04-07T13:31:00
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timestamp_tz_format is set to a pattern that includes %z, but the expected output for the TIMESTAMPTZ query doesn’t include any timezone/offset. This may not actually validate that datafusion.format.timestamp_tz_format is being applied (or could become flaky if TIMESTAMPTZ formatting starts including the offset).

Severity: medium

Fix This in Augment

🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Augment AI reviewer is correct! The test validates that the formatting setting is not used. It should be fixed to verify that the custom date formatting is used!

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for timestamp_tz_format doesn't verify format is applied

Medium Severity

The timestamp_tz_format test sets the format to '%Y-%m-%d %H:%M:%S %z' but expects 2026-04-07T13:31:00, which matches the default timestamp_format, not the custom tz format. This happens because datafusion.execution.time_zone is NULL at this point, causing TIMESTAMPTZ '...' to produce Timestamp(Nanosecond, None) — a non-tz type — so timestamp_tz_format is never applied. The test passes but doesn't actually validate that the timestamp_tz_format setting affects output, giving false confidence that the feature works through the sqllogictest framework.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6623349. Configure here.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Bugbot AI reviewer is correct! The test validates that the formatting setting is not used. It should be fixed to verify that the custom date formatting is used!


statement ok
RESET datafusion.format.timestamp_tz_format

Expand All @@ -463,14 +492,19 @@ SHOW datafusion.format.timestamp_tz_format
----
datafusion.format.timestamp_tz_format NULL

# time_format
# time_format (default: %H:%M:%S%.f)
statement ok
SET datafusion.format.time_format = '%H-%M-%S'
SET datafusion.format.time_format = '%S-%M-%H'

query TT
SHOW datafusion.format.time_format
----
datafusion.format.time_format %H-%M-%S
datafusion.format.time_format %S-%M-%H

query D
SELECT TIME '01:02:12.123' AS time;
----
12-02-01

statement ok
RESET datafusion.format.time_format
Expand All @@ -480,7 +514,12 @@ SHOW datafusion.format.time_format
----
datafusion.format.time_format %H:%M:%S%.f

# duration_format: values are normalized to lowercase; ISO8601 and pretty are valid
query D
SELECT TIME '01:02:12.123' AS time;
----
01:02:12.123

# duration_format: (default: pretty) values are normalized to lowercase; ISO8601 and pretty are valid
statement ok
SET datafusion.format.duration_format = ISO8601

Expand All @@ -489,6 +528,12 @@ SHOW datafusion.format.duration_format
----
datafusion.format.duration_format iso8601

# Session duration_format controls display of Duration columns (not SQL INTERVAL)
query ?
SELECT arrow_cast(3661, 'Duration(Second)');
----
PT3661S

statement ok
SET datafusion.format.duration_format to 'PRETTY'

Expand All @@ -497,6 +542,11 @@ SHOW datafusion.format.duration_format
----
datafusion.format.duration_format pretty

query ?
SELECT arrow_cast(3661, 'Duration(Second)');
----
0 days 1 hours 1 mins 1 secs

statement ok
RESET datafusion.format.duration_format

Expand All @@ -505,7 +555,29 @@ SHOW datafusion.format.duration_format
----
datafusion.format.duration_format pretty

# null display string
query ?
SELECT arrow_cast(3661, 'Duration(Second)');
----
0 days 1 hours 1 mins 1 secs

# Case-insensitive duration_format variable name
statement ok
SET datafusion.FORMAT.DURATION_FORMAT = 'ISO8601'

query TT
SHOW datafusion.format.duration_format
----
datafusion.format.duration_format iso8601

query ?
SELECT arrow_cast(61, 'Duration(Second)');
----
PT61S

statement ok
RESET datafusion.format.duration_format

# null display string (default: (empty))
statement ok
SET datafusion.format.null = 'NuLL'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This test verifies that the datafusion.format.null setting can be changed and shown, but it doesn't verify that the setting actually affects query output. Given the current implementation in util.rs and normalize.rs, query results will still use the hardcoded "NULL" string regardless of this setting. Consider adding a query test case here once the implementation is fixed.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback: The Gemini AI reviewer is correct! The test just verifies that the formatting option is properly set but it does not verify that it is being used when a SELECT query returns a null value. An addition test should be added.


Expand All @@ -522,7 +594,7 @@ SHOW datafusion.format.null
----
datafusion.format.null (empty)

# safe
# safe (default: true)
statement ok
SET datafusion.format.safe = false

Expand All @@ -539,7 +611,7 @@ SHOW datafusion.format.safe
----
datafusion.format.safe true

# types_info
# types_info (default: false)
statement ok
SET datafusion.format.types_info to true

Expand All @@ -565,6 +637,11 @@ SHOW datafusion.format.date_format
----
datafusion.format.date_format %m/%d/%Y

query D
SELECT DATE '2026-04-07';
----
04/07/2026

statement ok
RESET datafusion.format.date_format

Expand All @@ -573,6 +650,11 @@ SHOW datafusion.format.date_format
----
datafusion.format.date_format %Y-%m-%d

query D
SELECT DATE '2026-04-07';
----
2026-04-07

# Invalid format option name
statement error DataFusion error: Invalid or Unsupported Configuration: Config value "unknown_option" not found on FormatOptions
SET datafusion.format.unknown_option = true
Expand Down
Loading