Skip to content

Nullice#19

Draft
Joe-Abraham wants to merge 5 commits intomainfrom
nullice
Draft

Nullice#19
Joe-Abraham wants to merge 5 commits intomainfrom
nullice

Conversation

@Joe-Abraham
Copy link
Owner

No description provided.

@Joe-Abraham Joe-Abraham requested a review from Copilot July 29, 2025 05:26

This comment was marked as outdated.

@Joe-Abraham Joe-Abraham force-pushed the nullice branch 3 times, most recently from e737df5 to 41e21f8 Compare August 1, 2025 07:51
@Joe-Abraham Joe-Abraham force-pushed the nullice branch 4 times, most recently from d518ed8 to c04faef Compare August 12, 2025 06:37
@Joe-Abraham Joe-Abraham requested a review from Copilot August 18, 2025 05:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces several major enhancements and fixes to the Velox codebase, including new Spark SQL functions, type system improvements, and various bug fixes.

Key changes:

  • Adds new Spark SQL functions: to_json, timestampadd, and enhanced split function with fast path optimization
  • Improves type system with new valueToString method and utility functions for better string representation
  • Fixes map filter function to support FlatMapVector encoding and resolves element_at caching issues

Reviewed Changes

Copilot reviewed 214 out of 215 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
velox/type/Type.h Adds valueToString template method and new enum parameters for type system
velox/functions/sparksql/ToJson.h New comprehensive ToJson function implementation with timezone support
velox/functions/sparksql/Split.h Enhanced split function with fast path optimization for simple delimiters
velox/functions/prestosql/FilterFunctions.cpp Major enhancement to map_filter supporting FlatMapVector encoding
velox/functions/lib/SubscriptUtil.cpp Fixes element_at caching bug with non-zero offset handling

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

std::string& result,
const JsonOptions& /*options*/,
bool /*isMapKey*/) {
VELOX_DCHECK(input.type()->isDecimal(), "HUGEINT must be a decimal type.");
Copy link

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message should be more descriptive. Consider changing to "Input type HUGEINT must be a decimal type" to provide clearer context about what input is being validated.

Suggested change
VELOX_DCHECK(input.type()->isDecimal(), "HUGEINT must be a decimal type.");
VELOX_DCHECK(input.type()->isDecimal(), "Input type HUGEINT must be a decimal type.");

Copilot uses AI. Check for mistakes.
if (std::stoi(str + 1) > 177) {
return false;
}
decimal = std::stoi(str + 1, nullptr, 8);
Copy link

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using std::stoi without error handling can throw exceptions for invalid input. Since this is checking octal validity, consider using a safer parsing approach or add exception handling.

Suggested change
decimal = std::stoi(str + 1, nullptr, 8);
int value = 0;
for (int i = 1; i < size; ++i) {
value = value * 8 + (str[i] - '0');
}
if (value > 0x7F) { // 177 octal == 127 decimal
return false;
}
decimal = static_cast<char>(value);

Copilot uses AI. Check for mistakes.
if (std::stoi(str + 1) > 177) {
return false;
}
decimal = std::stoi(str + 1, nullptr, 8);
Copy link

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above - std::stoi can throw exceptions. Add proper error handling or use a safer parsing method.

Suggested change
decimal = std::stoi(str + 1, nullptr, 8);
try {
if (std::stoi(str + 1) > 177) {
return false;
}
decimal = std::stoi(str + 1, nullptr, 8);
} catch (const std::invalid_argument&) {
return false;
} catch (const std::out_of_range&) {
return false;
}

Copilot uses AI. Check for mistakes.
BufferPtr decodedIndices =
AlignedBuffer::allocate<vector_size_t>(numRows, flatMap.pool());
auto mutableIndices = decodedIndices->asMutable<vector_size_t>();
for (int i = 0; i < decodedMap.size(); i++) {
Copy link

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Consider using vector_size_t instead of int for consistency with Velox codebase conventions for vector indexing.

Suggested change
for (int i = 0; i < decodedMap.size(); i++) {
for (vector_size_t i = 0; i < decodedMap.size(); i++) {

Copilot uses AI. Check for mistakes.
// Whether to ignore null fields during json generating.
const bool ignoreNullFields{true};
};

Copy link

Copilot AI Aug 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The detail namespace and its classes/functions lack documentation. Consider adding brief comments explaining the purpose of the JsonOptions struct and helper functions.

Suggested change
// Internal namespace containing helper structs and functions for JSON serialization.
namespace detail {
// Options for controlling JSON serialization behavior.
struct JsonOptions {
// Time zone to use for formatting date/time values.
const tz::TimeZone* timeZone;
// Whether to ignore null fields during JSON generation.
const bool ignoreNullFields{true};
};
// Appends a floating-point value to the result string, handling special cases for map keys and non-finite values.

Copilot uses AI. Check for mistakes.
@Joe-Abraham Joe-Abraham force-pushed the nullice branch 2 times, most recently from 9f98023 to cc1b865 Compare August 20, 2025 07:38
@Joe-Abraham Joe-Abraham changed the base branch from iceberg9.3 to main August 28, 2025 05:14
@Joe-Abraham Joe-Abraham force-pushed the nullice branch 5 times, most recently from 6f693f4 to 8cd2fc6 Compare September 1, 2025 08:49
@Joe-Abraham Joe-Abraham force-pushed the nullice branch 3 times, most recently from 62c2f7a to 4cee25b Compare September 9, 2025 07:07
@Joe-Abraham Joe-Abraham force-pushed the nullice branch 4 times, most recently from 80ff909 to 729107f Compare September 22, 2025 06:01
yingsu00 and others added 2 commits October 1, 2025 16:17
The function toValues removes duplicated values from the vector and
return them in a std::vector. It was used to build an InPredicate. It
will be needed for building NOT IN filters for Iceberg equality delete
read as well, therefore moving it from velox/functions/prestosql/InPred
icate.cpp to velox/type/Filter.h. This commit also renames it to
deDuplicateValues to make it easier to understand.
This commit introduces EqualityDeleteFileReader, which is used to read
Iceberg splits with equality delete files. The equality delete files
are read to construct domain filters or filter functions, which then
would be evaluated in the base file readers.

When there is only one equality delete field, and when that field is
an Iceberg identifier field, i.e. non-floating point primitive types,
the values would be converted to a list as a NOT IN domain filter,
with the NULL treated separately. This domain filter would then be
pushed to the ColumnReaders to filter our unwanted rows before they
are read into Velox vectors. When the equality delete column is a
nested column, e.g. a sub-column in a struct, or the key in a map,
such column may not be in the base file ScanSpec. We need to add/remove
these subfields to/from the SchemaWithId and ScanSpec recursively if
they were not in the ScanSpec already. A test is also added for such
case.

If there are more than one equality delete field, or the field is not
an Iceberg identifier field, the values would be converted to a typed
expression in the conjunct of disconjunts form. This expression would
be evaluated as the remaining filter function after the rows are read
into the Velox vectors. Note that this only works for Presto now as
the "neq" function is not registered by Spark. See https://github.com/
facebookincubator/issues/12667

Note that this commit only supports integral types. VARCHAR and
VARBINARY need to be supported in future commits (see
facebookincubator#12664).

Co-authored-by: Naveen Kumar Mahadevuni <Naveen.Mahadevuni@ibm.com>
New Features
- Added full VARCHAR and VARBINARY support for equality deletes
- Introduced test cases for FLOAT/DOUBLE with proper Iceberg error handling
- Added support for short decimal (DECIMAL(6,2)) and long decimal (DECIMAL(25,5)) types
- Expanded null value testing across kNoNulls, kPartialNulls, and kAllNulls configurations
- Extended multi-column and mixed-type equality delete test coverage

Test Infrastructure
- Split IcebergReadTest into specialized: IcebergReadEqualityDeleteTest and IcebergReadPositionalDeleteTest
- Extracted shared functionality into IcebergTestBase
- Templatized helper functions for stronger type safety and reusability
- Refactored from template-based tests to parameterized tests for broader coverage
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants