Conversation
e737df5 to
41e21f8
Compare
d518ed8 to
c04faef
Compare
There was a problem hiding this comment.
Pull Request Overview
This PR introduces several major enhancements and fixes to the Velox codebase, including new Spark SQL functions, type system improvements, and various bug fixes.
Key changes:
- Adds new Spark SQL functions:
to_json,timestampadd, and enhancedsplitfunction with fast path optimization - Improves type system with new
valueToStringmethod and utility functions for better string representation - Fixes map filter function to support FlatMapVector encoding and resolves element_at caching issues
Reviewed Changes
Copilot reviewed 214 out of 215 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| velox/type/Type.h | Adds valueToString template method and new enum parameters for type system |
| velox/functions/sparksql/ToJson.h | New comprehensive ToJson function implementation with timezone support |
| velox/functions/sparksql/Split.h | Enhanced split function with fast path optimization for simple delimiters |
| velox/functions/prestosql/FilterFunctions.cpp | Major enhancement to map_filter supporting FlatMapVector encoding |
| velox/functions/lib/SubscriptUtil.cpp | Fixes element_at caching bug with non-zero offset handling |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| std::string& result, | ||
| const JsonOptions& /*options*/, | ||
| bool /*isMapKey*/) { | ||
| VELOX_DCHECK(input.type()->isDecimal(), "HUGEINT must be a decimal type."); |
There was a problem hiding this comment.
The error message should be more descriptive. Consider changing to "Input type HUGEINT must be a decimal type" to provide clearer context about what input is being validated.
| VELOX_DCHECK(input.type()->isDecimal(), "HUGEINT must be a decimal type."); | |
| VELOX_DCHECK(input.type()->isDecimal(), "Input type HUGEINT must be a decimal type."); |
| if (std::stoi(str + 1) > 177) { | ||
| return false; | ||
| } | ||
| decimal = std::stoi(str + 1, nullptr, 8); |
There was a problem hiding this comment.
Using std::stoi without error handling can throw exceptions for invalid input. Since this is checking octal validity, consider using a safer parsing approach or add exception handling.
| decimal = std::stoi(str + 1, nullptr, 8); | |
| int value = 0; | |
| for (int i = 1; i < size; ++i) { | |
| value = value * 8 + (str[i] - '0'); | |
| } | |
| if (value > 0x7F) { // 177 octal == 127 decimal | |
| return false; | |
| } | |
| decimal = static_cast<char>(value); |
| if (std::stoi(str + 1) > 177) { | ||
| return false; | ||
| } | ||
| decimal = std::stoi(str + 1, nullptr, 8); |
There was a problem hiding this comment.
Same issue as above - std::stoi can throw exceptions. Add proper error handling or use a safer parsing method.
| decimal = std::stoi(str + 1, nullptr, 8); | |
| try { | |
| if (std::stoi(str + 1) > 177) { | |
| return false; | |
| } | |
| decimal = std::stoi(str + 1, nullptr, 8); | |
| } catch (const std::invalid_argument&) { | |
| return false; | |
| } catch (const std::out_of_range&) { | |
| return false; | |
| } |
| BufferPtr decodedIndices = | ||
| AlignedBuffer::allocate<vector_size_t>(numRows, flatMap.pool()); | ||
| auto mutableIndices = decodedIndices->asMutable<vector_size_t>(); | ||
| for (int i = 0; i < decodedMap.size(); i++) { |
There was a problem hiding this comment.
[nitpick] Consider using vector_size_t instead of int for consistency with Velox codebase conventions for vector indexing.
| for (int i = 0; i < decodedMap.size(); i++) { | |
| for (vector_size_t i = 0; i < decodedMap.size(); i++) { |
| // Whether to ignore null fields during json generating. | ||
| const bool ignoreNullFields{true}; | ||
| }; | ||
|
|
There was a problem hiding this comment.
The detail namespace and its classes/functions lack documentation. Consider adding brief comments explaining the purpose of the JsonOptions struct and helper functions.
| // Internal namespace containing helper structs and functions for JSON serialization. | |
| namespace detail { | |
| // Options for controlling JSON serialization behavior. | |
| struct JsonOptions { | |
| // Time zone to use for formatting date/time values. | |
| const tz::TimeZone* timeZone; | |
| // Whether to ignore null fields during JSON generation. | |
| const bool ignoreNullFields{true}; | |
| }; | |
| // Appends a floating-point value to the result string, handling special cases for map keys and non-finite values. |
9f98023 to
cc1b865
Compare
6f693f4 to
8cd2fc6
Compare
62c2f7a to
4cee25b
Compare
80ff909 to
729107f
Compare
The function toValues removes duplicated values from the vector and return them in a std::vector. It was used to build an InPredicate. It will be needed for building NOT IN filters for Iceberg equality delete read as well, therefore moving it from velox/functions/prestosql/InPred icate.cpp to velox/type/Filter.h. This commit also renames it to deDuplicateValues to make it easier to understand.
This commit introduces EqualityDeleteFileReader, which is used to read Iceberg splits with equality delete files. The equality delete files are read to construct domain filters or filter functions, which then would be evaluated in the base file readers. When there is only one equality delete field, and when that field is an Iceberg identifier field, i.e. non-floating point primitive types, the values would be converted to a list as a NOT IN domain filter, with the NULL treated separately. This domain filter would then be pushed to the ColumnReaders to filter our unwanted rows before they are read into Velox vectors. When the equality delete column is a nested column, e.g. a sub-column in a struct, or the key in a map, such column may not be in the base file ScanSpec. We need to add/remove these subfields to/from the SchemaWithId and ScanSpec recursively if they were not in the ScanSpec already. A test is also added for such case. If there are more than one equality delete field, or the field is not an Iceberg identifier field, the values would be converted to a typed expression in the conjunct of disconjunts form. This expression would be evaluated as the remaining filter function after the rows are read into the Velox vectors. Note that this only works for Presto now as the "neq" function is not registered by Spark. See https://github.com/ facebookincubator/issues/12667 Note that this commit only supports integral types. VARCHAR and VARBINARY need to be supported in future commits (see facebookincubator#12664). Co-authored-by: Naveen Kumar Mahadevuni <Naveen.Mahadevuni@ibm.com>
New Features - Added full VARCHAR and VARBINARY support for equality deletes - Introduced test cases for FLOAT/DOUBLE with proper Iceberg error handling - Added support for short decimal (DECIMAL(6,2)) and long decimal (DECIMAL(25,5)) types - Expanded null value testing across kNoNulls, kPartialNulls, and kAllNulls configurations - Extended multi-column and mixed-type equality delete test coverage Test Infrastructure - Split IcebergReadTest into specialized: IcebergReadEqualityDeleteTest and IcebergReadPositionalDeleteTest - Extracted shared functionality into IcebergTestBase - Templatized helper functions for stronger type safety and reusability - Refactored from template-based tests to parameterized tests for broader coverage
No description provided.