18615: feat: Make extract SQL expression timezone aware.#16
18615: feat: Make extract SQL expression timezone aware.#16martin-augment wants to merge 4 commits intomainfrom
Conversation
WalkthroughThis pull request adds timezone-aware functionality to the ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| }; | ||
|
|
||
| // For fixed offsets (like +04:00, -05:30), apply the offset to extract values | ||
| if is_timezone_aware { |
There was a problem hiding this comment.
For timezone-aware timestamps, the array is already adjusted to local time via adjust_timestamp_array; applying a fixed offset again to extracted Hour/Minute/Day values will double-apply the offset and produce incorrect results. Consider removing this post-extraction adjustment for tz-aware inputs (also applies to other locations in this block).
🤖 Was this useful? React with 👍 or 👎
| .iter() | ||
| .map(|day| { | ||
| day.map(|d| { | ||
| if offset_hours >= 24 || offset_hours <= -24 { |
There was a problem hiding this comment.
apply_day_offset adjusts the day solely based on offset hours, but day changes depend on whether the time crosses midnight; this can yield wrong day values and seems redundant if timestamps were already adjusted before extraction. What do you think about avoiding this adjustment or basing it on actual timestamp boundaries?
🤖 Was this useful? React with 👍 or 👎
Pull Request Review: Make EXTRACT SQL Expression Timezone AwareThank you for this contribution! This is a substantial improvement to DataFusion's datetime handling. Below is my detailed review. ✅ Strengths1. Good Architecture
2. Comprehensive Test Coverage
3. Documentation
|
| extracted | ||
| } | ||
| } else { | ||
| extracted |
There was a problem hiding this comment.
Bug: Double Timezone Offset Corrupts Data
The timezone offset is being applied twice for fixed offset timezones. The timestamp is first adjusted to local time via adjust_timestamp_array (lines 215-242), then the offset is incorrectly applied again via apply_hour_offset, apply_minute_offset, or apply_day_offset (lines 294-327). This double-adjustment corrupts the extracted values for timestamps with fixed offset timezones like +04:00 or -05:30.
There was a problem hiding this comment.
Actionable comments posted: 4
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
datafusion/functions/src/datetime/date_part.rs(8 hunks)datafusion/functions/src/datetime/mod.rs(2 hunks)datafusion/functions/src/datetime/to_local_time.rs(1 hunks)datafusion/sqllogictest/test_files/extract_tz.slt(1 hunks)docs/source/user-guide/sql/scalar_functions.md(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
datafusion/functions/src/datetime/date_part.rs (1)
datafusion/functions/src/datetime/mod.rs (1)
adjust_to_local_time(48-94)
datafusion/functions/src/datetime/to_local_time.rs (2)
datafusion/functions/src/datetime/date_part.rs (2)
adjust_to_local_time(366-366)as_primitive_array(361-361)datafusion/functions/src/datetime/mod.rs (1)
adjust_to_local_time(48-94)
datafusion/functions/src/datetime/mod.rs (2)
datafusion/functions/src/datetime/to_local_time.rs (7)
arrow(578-580)adjust_to_local_time(140-140)adjust_to_local_time(151-151)adjust_to_local_time(162-162)adjust_to_local_time(173-173)adjust_to_local_time(192-192)adjust_to_local_time(391-391)datafusion/functions/src/datetime/date_part.rs (1)
adjust_to_local_time(366-366)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: Cursor Bugbot
- GitHub Check: Analyze (rust)
- GitHub Check: claude-review
🔇 Additional comments (8)
datafusion/functions/src/datetime/mod.rs (2)
22-27: LGTM: Imports are appropriate for timezone handling.The new imports support the timezone-aware timestamp adjustment functionality and are all necessary for the
adjust_to_local_timefunction.
47-94: LGTM: Well-structured timezone adjustment utility.The
adjust_to_local_timefunction properly centralizes timezone adjustment logic with appropriate error handling for ambiguous and non-existent local times. The per-unit timestamp conversion is comprehensive and the offset calculation follows the expected pattern.datafusion/functions/src/datetime/date_part.rs (3)
134-134: LGTM: Extract alias added.Adding "extract" as an alias aligns with standard SQL syntax for EXTRACT operations.
356-372: LGTM: Array adjustment helper correctly delegates to shared utility.The
adjust_timestamp_arrayfunction properly applies timezone adjustment to arrays while handling nulls.
411-431: Potential logic error: Offset direction may be incorrect.The
apply_hour_offsetfunction addsoffset_hoursto the extracted hour value, but this may produce incorrect results.Consider this scenario:
- Timestamp:
'2025-10-30 10:45:30+02:00'(10:45 in +02:00 timezone)- Session timezone:
+00:00(UTC)- Expected result: 8 (since 10:45+02:00 = 08:45 UTC)
Current logic flow:
adjust_timestamp_arraywith tz=+02:00 produces adjusted valuedate_partextracts hourapply_hour_offsetadds +2 hoursThis would give 10 or 12 depending on the first adjustment, but test at line 71-73 of
extract_tz.sltexpects 8.Please verify:
- The intended semantics of EXTRACT with timezone-aware timestamps
- Whether the offset should be added or subtracted
- Whether the double-adjustment approach (adjust array, then adjust extracted value) is correct
docs/source/user-guide/sql/scalar_functions.md (1)
2390-2390: LGTM: Documentation correctly reflects extract alias.The documentation updates properly document
extractas an alias ofdate_part, consistent with the code changes.Also applies to: 2533-2537, 2575-2577
datafusion/functions/src/datetime/to_local_time.rs (1)
30-30: LGTM: Successful refactoring to use centralized timezone adjustment.The refactoring to delegate to
crate::datetime::adjust_to_local_timeeliminates code duplication and improves maintainability while preserving the original functionality.Also applies to: 139-199
datafusion/sqllogictest/test_files/extract_tz.slt (1)
1-106: Timezone-aware EXTRACT tests require verification for potential double-offset bug.The test expectations appear correct for extracting in the session timezone (UTC), but code analysis reveals the implementation applies timezone offsets twice:
adjust_timestamp_array()callsadjust_to_local_time()which adds the offset to the timestamp valueapply_hour_offset()then adds the same offset again to the extracted hour valueThis double-application would produce incorrect results. For example,
EXTRACT(HOUR FROM TIMESTAMP '2025-10-30 10:45:30+02:00')with session timezone '+00:00' would extract 12 instead of the expected 8.Additionally, the implementation comment states "extract in their own timezone," contradicting the test's expectation to extract in the session timezone. The tests should be run to confirm they fail with the current implementation, and the offset application logic in
date_part.rs(lines 307-324) should be reviewed for correctness.
| let (is_timezone_aware, tz_str_opt) = match array.data_type() { | ||
| Timestamp(_, Some(tz_str)) => (true, Some(Arc::clone(tz_str))), | ||
| _ => (false, None), | ||
| }; | ||
|
|
||
| let part_trim = part_normalization(&part); | ||
| let is_epoch = is_epoch(&part); | ||
|
|
||
| // Epoch is timezone-independent - it always returns seconds since 1970-01-01 UTC | ||
| let array = if is_epoch { | ||
| array | ||
| } else if is_timezone_aware { | ||
| // For timezone-aware timestamps, extract in their own timezone | ||
| match tz_str_opt.as_ref() { | ||
| Some(tz_str) => { | ||
| let tz = match tz_str.parse::<Tz>() { | ||
| Ok(tz) => tz, | ||
| Err(_) => return exec_err!("Invalid timezone"), | ||
| }; | ||
| match array.data_type() { | ||
| Timestamp(time_unit, _) => match time_unit { | ||
| Nanosecond => adjust_timestamp_array::< | ||
| TimestampNanosecondType, | ||
| >(&array, tz)?, | ||
| Microsecond => adjust_timestamp_array::< | ||
| TimestampMicrosecondType, | ||
| >(&array, tz)?, | ||
| Millisecond => adjust_timestamp_array::< | ||
| TimestampMillisecondType, | ||
| >(&array, tz)?, | ||
| Second => { | ||
| adjust_timestamp_array::<TimestampSecondType>(&array, tz)? | ||
| } | ||
| }, | ||
| _ => array, | ||
| } | ||
| } | ||
| None => array, | ||
| } | ||
| } else if let Timestamp(time_unit, None) = array.data_type() { | ||
| // For naive timestamps, interpret in session timezone if available | ||
| match config.execution.time_zone.as_ref() { | ||
| Some(tz_str) => { | ||
| let tz = match tz_str.parse::<Tz>() { | ||
| Ok(tz) => tz, | ||
| Err(_) => return exec_err!("Invalid timezone"), | ||
| }; | ||
|
|
||
| match time_unit { | ||
| Nanosecond => { | ||
| adjust_timestamp_array::<TimestampNanosecondType>(&array, tz)? | ||
| } | ||
| Microsecond => { | ||
| adjust_timestamp_array::<TimestampMicrosecondType>( | ||
| &array, tz, | ||
| )? | ||
| } | ||
| Millisecond => { | ||
| adjust_timestamp_array::<TimestampMillisecondType>( | ||
| &array, tz, | ||
| )? | ||
| } | ||
| Second => { | ||
| adjust_timestamp_array::<TimestampSecondType>(&array, tz)? | ||
| } | ||
| } | ||
| } | ||
| None => array, | ||
| } | ||
| } else { | ||
| array | ||
| }; |
There was a problem hiding this comment.
🧩 Analysis chain
Verify EXTRACT timezone behavior aligns with SQL standards.
The implementation extracts timezone-aware timestamps in their own timezone (line 216) and naive timestamps in the session timezone (line 244). Please confirm this matches the intended SQL semantics for EXTRACT.
In standard SQL and most databases (PostgreSQL, etc.), EXTRACT(HOUR FROM timestamp_with_tz) typically extracts the value in the session timezone, not the timestamp's embedded timezone. The current logic may produce unexpected results.
For example, with session timezone +00:00:
EXTRACT(HOUR FROM TIMESTAMP '2025-10-30 10:45:30+02:00')- Should extract in UTC (+00:00): result = 8 (since 10:45:30+02:00 = 08:45:30 UTC)
- Current logic extracts in +02:00: result = 10
🏁 Script executed:
#!/bin/bash
# Verify how PostgreSQL handles EXTRACT with timezone-aware timestamps
# This will help confirm the expected behavior
echo "Checking PostgreSQL EXTRACT behavior with timezones..."
echo ""
echo "Expected behavior for EXTRACT(HOUR FROM TIMESTAMP '2025-10-30 10:45:30+02:00'):"
echo "- When session timezone is UTC: should return 8"
echo "- When session timezone is +02:00: should return 10"
echo ""
echo "Please verify test expectations match this behavior"Length of output: 288
Now I need to examine the actual code to verify what adjust_timestamp_array does and whether the logic in the review matches what PostgreSQL actually does:
Based on my research, when the input value is of type timestamp with time zone, operations are performed with respect to a particular time zone; by default, operations are done with respect to the current TimeZone setting. Additionally, the originally stated or assumed time zone is not retained in PostgreSQL's timestamptz type—timestamps with time zones are converted to UTC, and the original timezone is not stored.
The review comment's concern is valid. The code's logic of extracting "in the timestamp's own timezone" deviates from standard SQL semantics and PostgreSQL's behavior. When PostgreSQL extracts a field like HOUR from a timezone-aware timestamp, it does so using the session's current timezone setting, not the timestamp's (non-existent) original timezone.
Your EXTRACT timezone behavior is incorrect; extract in session timezone, not timestamp's timezone.
The code extracts timezone-aware timestamps in their embedded timezone offset (lines 218-228) rather than the session timezone as PostgreSQL does by default. PostgreSQL performs such operations with respect to the current TimeZone setting by default, not the timestamp's original timezone. The adjust_timestamp_array call converts the timestamp to a specific timezone, but this differs from PostgreSQL semantics where the session timezone is used unless explicitly overridden with AT TIME ZONE.
For example, with session timezone UTC:
EXTRACT(HOUR FROM '2025-10-30 10:45:30+02:00')should extract in UTC → result: 8 (since 10:45:30+02:00 = 08:45:30 UTC)- Your code extracts in +02:00 → result: 10
Align extraction with session timezone semantics for PostgreSQL compatibility.
| // Check if a timezone string is a fixed offset | ||
| fn is_fixed_offset(tz_str: &str) -> bool { | ||
| tz_str.starts_with('+') || tz_str.starts_with('-') | ||
| } | ||
|
|
||
| // Holds the components of a timezone offset (hours and minutes). | ||
| struct OffsetInfo { | ||
| hours: i32, | ||
| minutes: i32, | ||
| } | ||
|
|
||
| // Extracts the offset components from a timezone string like "+04:00" or "-05:30". | ||
| fn extract_offset_components(tz_str: &str) -> Option<OffsetInfo> { | ||
| if tz_str.len() < 6 { | ||
| return None; | ||
| } | ||
|
|
||
| let sign = match &tz_str[0..1] { | ||
| "+" => 1, | ||
| "-" => -1, | ||
| _ => return None, | ||
| }; | ||
|
|
||
| let hours: i32 = tz_str[1..3].parse().ok()?; | ||
| let minutes: i32 = tz_str[4..6].parse().ok()?; | ||
|
|
||
| Some(OffsetInfo { | ||
| hours: sign * hours, | ||
| minutes: sign * minutes, | ||
| }) | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
Verify fixed-offset timezone format handling.
The extract_offset_components function assumes a specific format (+HH:MM with 6 characters). Please verify this handles all valid fixed-offset timezone representations that Arrow might use.
Potential concerns:
- Does Arrow always use the
+HH:MMformat with leading zeros? - What about formats like
+4:00or+04(without minutes)? - Consider adding format validation and clearer error messages
🏁 Script executed:
#!/bin/bash
# Check what timezone string formats Arrow's Tz type accepts and produces
echo "Verifying Arrow timezone string formats..."
rg -n "parse::<Tz>" --type rust -A 2 -B 2
echo ""
echo "Check Arrow timezone documentation for valid formats"Length of output: 1959
[/run_scripts]
<web_search>
</web_search>
🏁 Script executed:
#!/bin/bash
# Find where Tz is imported from and check its definition
echo "=== Finding Tz import in date_part.rs ==="
head -50 datafusion/functions/src/datetime/date_part.rs | grep -E "^use|^extern"
echo ""
echo "=== Searching for Tz type definition ==="
rg -n "^use.*\bTz\b" --type rust datafusion/functions/src/datetime/
echo ""
echo "=== Check if extract_offset_components is actually used ==="
rg -n "extract_offset_components" --type rust
echo ""
echo "=== Look for timezone format patterns in tests/examples ==="
rg -n "\\+[0-9]|\\-[0-9]" --type rust -B2 -A2 datafusion/functions/src/datetime/ | head -30Length of output: 3866
🌐 Web query:
Apache Arrow timezone type Tz valid formats fixed offset
💡 Result:
Apache Arrow accepts two kinds of timezone strings for the Timestamp (Tz) type:
- IANA/Olson zone names (e.g. "America/New_York", "UTC") — full tz database names. [1][3]
- Fixed (absolute) offsets, in hour/minute form (e.g. "+07:30", "-05:00"). Shorter variants like "-09" or "+0930" are also accepted in some Arrow builds. When chrono-tz/zone DB support is disabled, only fixed-offset forms are allowed. [2][3]
References: [1] Arrow FlatBuffers Timestamp docs (timezone accepted forms). [2] arrow-rs datatype docs (fixed-offset parsing examples). [3] Arrow Rust/schema & Python docs (timestamp timezone semantics).
[/run_scripts]
🏁 Script executed:
#!/bin/bash
# Check line 299 context where extract_offset_components is called
echo "=== Context around line 299 where extract_offset_components is called ==="
sed -n '290,310p' datafusion/functions/src/datetime/date_part.rs
echo ""
echo "=== Look for any timezone format tests ==="
rg -n "\\+[0-9][0-9]" --type rust datafusion/functions/src/datetime/ -B1 -A1 | head -40Length of output: 4975
Fix extract_offset_components to handle all valid Arrow fixed-offset formats.
The function rejects valid timezone formats that Arrow accepts. Test cases in the codebase (date_trunc.rs) use formats like +05, +08, +00, but the current implementation requires at least 6 characters and assumes the format +HH:MM. This causes silent failures when offset components cannot be extracted.
Update the parsing logic to handle:
- Short formats:
+05,-09(sign + hours only) - Formats without colon:
+0530,-0545(sign + hours + minutes) - Standard formats:
+05:30,-09:45(sign + hours + colon + minutes)
Consider using a regex or more flexible parsing approach, and add explicit error handling instead of silently returning None.
| // Applies the timezone offset to minute values in an array. | ||
| fn apply_minute_offset(array: &dyn Array, offset_minutes: i32) -> Result<ArrayRef> { | ||
| let minute_array = as_int32_array(array)?; | ||
| let result: Int32Array = minute_array | ||
| .iter() | ||
| .map(|minute| { | ||
| minute.map(|m| { | ||
| let adjusted = m + offset_minutes; | ||
| ((adjusted % 60) + 60) % 60 | ||
| }) | ||
| }) | ||
| .collect(); | ||
| Ok(Arc::new(result)) | ||
| } |
There was a problem hiding this comment.
Similar offset direction concern for minutes.
This function has the same potential logic issue as apply_hour_offset - it adds the offset rather than subtracting it. See the comment on lines 411-431 for details.
🤖 Prompt for AI Agents
In datafusion/functions/src/datetime/date_part.rs around lines 433 to 446, the
minute offset function adds offset_minutes to each minute which reverses
timezone adjustment; change the logic to subtract the offset (use m -
offset_minutes) and then normalize into 0..59 using the same ((value % 60) + 60)
% 60 pattern while preserving nulls; update any variable names/comments if
needed to reflect subtraction so behavior matches apply_hour_offset.
| // Applies the timezone offset to day values in an array. | ||
| fn apply_day_offset(array: &dyn Array, offset_hours: i32) -> Result<ArrayRef> { | ||
| let day_array = as_int32_array(array)?; | ||
| let result: Int32Array = day_array | ||
| .iter() | ||
| .map(|day| { | ||
| day.map(|d| { | ||
| if offset_hours >= 24 || offset_hours <= -24 { | ||
| d + (offset_hours / 24) | ||
| } else if offset_hours > 0 { | ||
| d + 1 | ||
| } else if offset_hours < 0 { | ||
| d - 1 | ||
| } else { | ||
| d | ||
| } | ||
| }) | ||
| }) | ||
| .collect(); | ||
| Ok(Arc::new(result)) | ||
| } |
There was a problem hiding this comment.
Day offset logic appears oversimplified.
The apply_day_offset function has several concerns:
-
Offset direction: Same concern as hour/minute offset functions - should this add or subtract?
-
Partial-day offset handling: For offsets between 1-23 hours, the function simply adds/subtracts 1 day. This ignores the actual time-of-day which determines whether a day boundary is crossed.
For example:
- Timestamp at 01:00 with -5 hour offset: same day
- Timestamp at 23:00 with -5 hour offset: previous day
The current logic would treat both identically, which is incorrect.
Consider calculating the day adjustment more precisely based on the actual hour value combined with the offset.
18615: To review by AI