refactor: Update Base64 to match coding standard#32
Conversation
# Conflicts: # velox/functions/prestosql/BinaryFunctions.h
Reviewer's GuideRefactors the Base64 encoding/decoding API across Velox to use std::string_view + std::string-based interfaces, removes raw buffer/size variants, and updates all call sites and tests to match the new, safer API and padding semantics. Class diagram for refactored Base64 encoding/decoding APIclassDiagram
namespace facebook_velox_encoding {
class Base64 {
<<utility>>
+static string encode(string_view input, bool includePadding)
+static void encode(string_view input, string& output)
+static string encodeUrl(string_view input, bool includePadding)
+static void encodeUrl(string_view input, string& output)
+static string encode(const folly_IOBuf* inputBuffer)
+static string encodeUrl(const folly_IOBuf* inputBuffer)
+static string decode(string_view input)
+static Status decode(string_view input, string& output)
+static string decodeUrl(string_view input)
+static Status decodeUrl(string_view input, string& output)
+static Status decodeMime(string_view input, string& output)
+static void encodeMime(string_view input, string& output)
+static size_t calculateEncodedSize(size_t inputSize, bool withPadding)
+static Expected_size_t calculateDecodedSize(string_view input)
+static Expected_size_t calculateMimeDecodedSize(string_view input)
+static size_t calculateMimeEncodedSize(size_t binarySize)
-static bool isPadded(string_view input)
-static size_t numPadding(string_view input)
-static Expected_size_t decodeImpl(string_view input, string& output, ReverseIndex& reverseIndex)
-static Expected_uint8_t base64ReverseLookup(uint8_t character, ReverseIndex& reverseIndex)
-static const size_t kEncodedBlockByteSize
-static const size_t kBinaryBlockByteSize
-static const size_t kReverseIndexSize
-static const char kPadding
-static const size_t kMaxLineLength
-static const char[] kBase64Charset
-static const char[] kBase64UrlCharset
-static const uint8_t[] kBase64ReverseIndexTable
-static const uint8_t[] kBase64UrlReverseIndexTable
}
}
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
/claude-review |
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- In TextRowReader::readElement, both the success and error paths construct StringView from temporaries (
decodedStrandstr) rather than fromvarBinBuf_, which means the StringView can outlive its backing storage; you should instead build the StringView overvarBinBuf_->data()with the appropriate size to preserve the existing ownership semantics. - The new
Base64::encode(std::string_view input, bool includePadding = false)andencodeUrl(..., bool includePadding = false)change the default from padded to unpadded output compared to the previous API; please double‑check all call sites (e.g., ConstantTypedExpr::serialize, PlanNode::serialize) to ensure this behavioral change is intentional and, if not, passincludePadding = trueexplicitly. - For the MIME helpers, callers like UnBase64Function now compute
calculateMimeDecodedSizeand thendecodeMimestill growsoutputincrementally; consider reservingoutputto the precomputed size insidedecodeMime(or passing the expected size in) to avoid repeated reallocations and keep the size logic centralized.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- In TextRowReader::readElement, both the success and error paths construct StringView from temporaries (`decodedStr` and `str`) rather than from `varBinBuf_`, which means the StringView can outlive its backing storage; you should instead build the StringView over `varBinBuf_->data()` with the appropriate size to preserve the existing ownership semantics.
- The new `Base64::encode(std::string_view input, bool includePadding = false)` and `encodeUrl(..., bool includePadding = false)` change the default from padded to unpadded output compared to the previous API; please double‑check all call sites (e.g., ConstantTypedExpr::serialize, PlanNode::serialize) to ensure this behavioral change is intentional and, if not, pass `includePadding = true` explicitly.
- For the MIME helpers, callers like UnBase64Function now compute `calculateMimeDecodedSize` and then `decodeMime` still grows `output` incrementally; consider reserving `output` to the precomputed size inside `decodeMime` (or passing the expected size in) to avoid repeated reallocations and keep the size logic centralized.
## Individual Comments
### Comment 1
<location path="velox/dwio/text/reader/TextReader.cpp" line_range="1108-1114" />
<code_context>
- Status status = encoding::Base64::decode(
- str.data(), str.size(), varBinBuf_->data(), blen.value_or(0));
+ // Decode from base64 to the blob buffer in one call.
+ std::string decodedStr;
+ Status status = encoding::Base64::decode(str, decodedStr);
if (status.code() == StatusCode::kOK) {
- flatVector->set(
- insertionRow,
- StringView(varBinBuf_->data(), static_cast<int32_t>(blen.value())));
+ varBinBuf_->resize(decodedStr.size());
+ std::memcpy(varBinBuf_->data(), decodedStr.data(), decodedStr.size());
+ flatVector->set(insertionRow, StringView(decodedStr));
} else {
// Not valid base64: just copy as-is for compatibility.
</code_context>
<issue_to_address>
**issue (bug_risk):** StringView now refers to a temporary std::string which will be destroyed, leading to a dangling pointer.
`decodedStr` is a local that is destroyed at the end of `readElement`, so `flatVector->set(insertionRow, StringView(decodedStr))` (and similarly `StringView(str)` in the error path) leaves the vector holding a dangling pointer. Please keep `StringView` backed by `varBinBuf_` (e.g., construct it from `varBinBuf_->data()` and its size) or otherwise ensure the underlying storage outlives the vector.
</issue_to_address>
### Comment 2
<location path="velox/dwio/text/reader/TextReader.cpp" line_range="1117-1122" />
<code_context>
+ Status status = encoding::Base64::decode(str, decodedStr);
if (status.code() == StatusCode::kOK) {
- flatVector->set(
- insertionRow,
- StringView(varBinBuf_->data(), static_cast<int32_t>(blen.value())));
</code_context>
<issue_to_address>
**issue (bug_risk):** Error-path StringView also points to stack-local storage and will dangle.
In the non-base64 case, you copy into `varBinBuf_` but still build `StringView(str)`, where `str` is local to `readElement`. That `StringView` will dangle just like on the success path. Instead, construct the `StringView` over `varBinBuf_->data()` (or another buffer with a longer lifetime) to avoid use-after-free.
</issue_to_address>
### Comment 3
<location path="velox/common/encode/Base64.cpp" line_range="492-496" />
<code_context>
// static
-Status Base64::decodeMime(const char* input, size_t inputSize, char* output) {
+Status Base64::decodeMime(std::string_view input, std::string& output) {
+ size_t inputSize = input.size();
if (inputSize == 0) {
</code_context>
<issue_to_address>
**issue (bug_risk):** decodeMime appends to `output` without clearing it, which can cause data from previous calls to leak into the result.
Previously, callers passed a fresh buffer, so there was no risk of stale data. With `std::string& output`, the function neither clears nor modifies `output` when `inputSize == 0`, and otherwise always appends. Callers that reuse the same `std::string` will see leftover or accumulated data. To preserve prior behavior, clear `output` at the start and optionally `reserve(calculateMimeDecodedSize(inputSize))` for efficiency.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| std::string decodedStr; | ||
| Status status = encoding::Base64::decode(str, decodedStr); | ||
|
|
||
| if (status.code() == StatusCode::kOK) { | ||
| flatVector->set( | ||
| insertionRow, | ||
| StringView(varBinBuf_->data(), static_cast<int32_t>(blen.value()))); | ||
| varBinBuf_->resize(decodedStr.size()); | ||
| std::memcpy(varBinBuf_->data(), decodedStr.data(), decodedStr.size()); | ||
| flatVector->set(insertionRow, StringView(decodedStr)); |
There was a problem hiding this comment.
issue (bug_risk): StringView now refers to a temporary std::string which will be destroyed, leading to a dangling pointer.
decodedStr is a local that is destroyed at the end of readElement, so flatVector->set(insertionRow, StringView(decodedStr)) (and similarly StringView(str) in the error path) leaves the vector holding a dangling pointer. Please keep StringView backed by varBinBuf_ (e.g., construct it from varBinBuf_->data() and its size) or otherwise ensure the underlying storage outlives the vector.
| flatVector->set( | ||
| insertionRow, | ||
| StringView(varBinBuf_->data(), static_cast<int32_t>(blen.value()))); | ||
| varBinBuf_->resize(decodedStr.size()); | ||
| std::memcpy(varBinBuf_->data(), decodedStr.data(), decodedStr.size()); | ||
| flatVector->set(insertionRow, StringView(decodedStr)); | ||
| } else { | ||
| // Not valid base64: just copy as-is for compatibility. | ||
| // |
There was a problem hiding this comment.
issue (bug_risk): Error-path StringView also points to stack-local storage and will dangle.
In the non-base64 case, you copy into varBinBuf_ but still build StringView(str), where str is local to readElement. That StringView will dangle just like on the success path. Instead, construct the StringView over varBinBuf_->data() (or another buffer with a longer lifetime) to avoid use-after-free.
| Status Base64::decodeMime(std::string_view input, std::string& output) { | ||
| size_t inputSize = input.size(); | ||
| if (inputSize == 0) { | ||
| return Status::OK(); | ||
| } |
There was a problem hiding this comment.
issue (bug_risk): decodeMime appends to output without clearing it, which can cause data from previous calls to leak into the result.
Previously, callers passed a fresh buffer, so there was no risk of stale data. With std::string& output, the function neither clears nor modifies output when inputSize == 0, and otherwise always appends. Callers that reuse the same std::string will see leftover or accumulated data. To preserve prior behavior, clear output at the start and optionally reserve(calculateMimeDecodedSize(inputSize)) for efficiency.
There was a problem hiding this comment.
Pull request overview
Refactors Base64 encode/decode APIs and call sites to use std::string_view inputs and std::string outputs, aligning implementations with updated coding conventions.
Changes:
- Updated
encoding::Base64public API signatures (favoringstd::string_view+std::stringout params) and rewrote core encode/decode internals accordingly. - Migrated multiple Velox call sites (SparkSQL/PrestoSQL functions, text reader/writer, plan/expression serialization) to the new Base64 API.
- Simplified and updated Base64 unit tests to match the new API.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| velox/functions/sparksql/UnBase64Function.h | Switches to string_view + std::string decode path before copying into Velox varbinary. |
| velox/functions/sparksql/Base64Function.h | Switches to string_view + std::string encode path before copying into Velox varchar. |
| velox/functions/prestosql/BinaryFunctions.h | Updates Base64/Base64Url encode/decode paths to new string_view + std::string APIs. |
| velox/dwio/text/writer/TextWriter.cpp | Updates VARBINARY serialization to Base64 encode from string_view. |
| velox/dwio/text/reader/TextReader.cpp | Updates VARBINARY parsing to Base64 decode into std::string then copy. |
| velox/core/PlanNode.cpp | Updates ValuesNode serialization to new Base64::encode overload. |
| velox/core/Expressions.cpp | Updates ConstantTypedExpr serialization to new Base64::encode overload. |
| velox/common/encode/tests/Base64Test.cpp | Updates tests to new API and output-buffer handling style. |
| velox/common/encode/Base64.h | Replaces pointer/size overloads with string_view and std::string& output APIs; adjusts helpers. |
| velox/common/encode/Base64.cpp | Reimplements encode/decode to write into std::string via push_back/reserve; updates MIME helpers. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| std::string decodedStr; | ||
| Status status = encoding::Base64::decode(str, decodedStr); | ||
|
|
||
| if (status.code() == StatusCode::kOK) { | ||
| flatVector->set( | ||
| insertionRow, | ||
| StringView(varBinBuf_->data(), static_cast<int32_t>(blen.value()))); | ||
| varBinBuf_->resize(decodedStr.size()); | ||
| std::memcpy(varBinBuf_->data(), decodedStr.data(), decodedStr.size()); | ||
| flatVector->set(insertionRow, StringView(decodedStr)); |
There was a problem hiding this comment.
flatVector->set(insertionRow, StringView(decodedStr)) stores a view into decodedStr, which is a local variable and will be destroyed at the end of the scope. This leaves the vector holding a dangling pointer. Use the owned buffer (varBinBuf_) when constructing the StringView (e.g., view over varBinBuf_->data() with the decoded size), or ensure the backing storage outlives the vector.
| varBinBuf_->resize(str.size()); | ||
|
|
||
| VELOX_CHECK_NOT_NULL(str.data()); | ||
|
|
||
| len = str.size(); | ||
| memcpy(varBinBuf_->data(), str.data(), str.size()); | ||
|
|
||
| // Use StringView, set(vector_size_t idx, T value) fails because | ||
| // strlen(varBinBuf_->data()) is undefined due to lack of null | ||
| // terminator | ||
| flatVector->set( | ||
| insertionRow, | ||
| StringView(varBinBuf_->data(), static_cast<int32_t>(str.size()))); | ||
| flatVector->set(insertionRow, StringView(str)); |
There was a problem hiding this comment.
In the non-base64 branch, the code copies bytes into varBinBuf_ but then sets the vector to StringView(str) instead of viewing the copied buffer. This defeats the purpose of the copy and can make the stored view depend on the lifetime of str. Set the value from varBinBuf_ (data pointer + size) to ensure stable ownership.
| Status Base64::decodeMime(std::string_view input, std::string& output) { | ||
| size_t inputSize = input.size(); | ||
| if (inputSize == 0) { | ||
| return Status::OK(); | ||
| } |
There was a problem hiding this comment.
decodeMime(..., std::string& output) appends via output.push_back(...) later, but never clears output at the start. If callers reuse the same std::string, results will be concatenated with previous contents. Clear (and ideally reserve) output at function entry.
| // If there's nothing to encode, do nothing. | ||
| size_t inputSize = input.size(); | ||
| if (inputSize == 0) { | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
encodeMime(..., std::string& output) writes using output.push_back(...) but doesn't clear output first. Reusing the same output string will produce concatenated encodings. Clear output at the start (and reserve expected size) to make the API safe and consistent with other encode paths.
| // If there's nothing to encode, do nothing. | |
| size_t inputSize = input.size(); | |
| if (inputSize == 0) { | |
| return; | |
| } | |
| // Ensure the output starts empty for each call. | |
| output.clear(); | |
| // If there's nothing to encode, leave output empty and return. | |
| size_t inputSize = input.size(); | |
| if (inputSize == 0) { | |
| return; | |
| } | |
| // Reserve the full encoded size, including CRLFs, to avoid reallocations. | |
| output.reserve(calculateMimeEncodedSize(inputSize)); |
|
|
||
| obj["data"] = | ||
| encoding::Base64::encode(serializedData.data(), serializedData.size()); | ||
| obj["data"] = encoding::Base64::encode(serializedData); |
There was a problem hiding this comment.
This call now uses Base64::encode(std::string_view, bool includePadding = false) with the default includePadding=false. Previously, the encode(const char*, size_t) overload encoded with padding. Changing padding behavior in serialized plan JSON can break backward compatibility for stored/transported plans. Consider explicitly passing includePadding=true here (or otherwise guaranteeing stable serialization).
| obj["data"] = encoding::Base64::encode(serializedData); | |
| obj["data"] = encoding::Base64::encode(serializedData, true); |
| auto serializedValue = out.str(); | ||
| obj["valueVector"] = encoding::Base64::encode( | ||
| serializedValue.data(), serializedValue.size()); | ||
| obj["valueVector"] = encoding::Base64::encode(serializedValue); |
There was a problem hiding this comment.
Same concern as in PlanNode.cpp: this now defaults to includePadding=false, potentially changing the serialized expression format compared to the previous padded encoding. If this data is persisted or exchanged, explicitly request padding to preserve compatibility.
| obj["valueVector"] = encoding::Base64::encode(serializedValue); | |
| obj["valueVector"] = encoding::Base64::encode(serializedValue, true); |
| void Base64::encode(const char* input, size_t inputSize, char* output) { | ||
| encodeImpl(std::string_view(input, inputSize), kBase64Charset, true, output); | ||
| void Base64::encode(std::string_view input, std::string& output) { | ||
| encodeImpl(input, kBase64Charset, false, output); |
There was a problem hiding this comment.
Base64::encode(std::string_view, std::string&) hardcodes includePadding=false, while historical base64 defaults typically include padding (and the previous pointer-based encode path did). This is a behavior change for callers migrating to the new API. Consider (a) adding an includePadding parameter to this overload, or (b) defaulting to includePadding=true for standard Base64 to match prior behavior, while keeping URL-safe variants configurable.
| encodeImpl(input, kBase64Charset, false, output); | |
| encodeImpl(input, kBase64Charset, true, output); |
| auto status = encoding::Base64::decodeMime( | ||
| std::string_view(input.data(), input.size()), decodedStr); | ||
| if (status.ok()) { | ||
| result.resize(decodedSize.value()); |
There was a problem hiding this comment.
result is resized based on decodedSize.value() but the copy uses decodedStr.size(). If these diverge (e.g., due to input normalization rules), this can either overflow (decodedStr.size() > result.size()) or leave uninitialized trailing bytes (decodedStr.size() < result.size()). Prefer resizing to decodedStr.size() (or validating equality before copying).
| result.resize(decodedSize.value()); | |
| result.resize(decodedStr.size()); |
| encoding::Base64::encodeMime(input.data(), input.size(), result.data()); | ||
| std::string encodedStr; | ||
| encoding::Base64::encodeMime( | ||
| std::string_view(input.data(), input.size()), encodedStr); |
There was a problem hiding this comment.
result is resized based on calculateMimeEncodedSize(input.size()), but the code copies encodedStr.size() bytes without verifying the sizes match. If encodeMime ever produces an output length different from the calculated size, this can write past result or leave extra garbage at the end. Consider resizing result to encodedStr.size() after encoding (or explicitly asserting the sizes match) before copying.
| std::string_view(input.data(), input.size()), encodedStr); | |
| std::string_view(input.data(), input.size()), encodedStr); | |
| result.resize(encodedStr.size()); |
| Base64::calculateMimeDecodedSize(in.data(), in.size()).value(); | ||
| std::string out(decSize, '\0'); | ||
| auto result = Base64::decodeMime(in.data(), in.size(), out.data()); | ||
| const auto decSizeOpt = Base64::calculateMimeDecodedSize(in); |
There was a problem hiding this comment.
decSizeOpt.value() is used without checking decSizeOpt.hasError(). If a caller reuses this helper with invalid input later, this would throw/assert unexpectedly. Consider asserting decSizeOpt.hasValue() (or failing the test with the error message) before using .value().
| const auto decSizeOpt = Base64::calculateMimeDecodedSize(in); | |
| const auto decSizeOpt = Base64::calculateMimeDecodedSize(in); | |
| if (decSizeOpt.hasError()) { | |
| VELOX_USER_FAIL(decSizeOpt.error().message()); | |
| } |
Summary by Sourcery
Refactor Base64 encoding/decoding APIs to use std::string_view and std::string-based interfaces and update all callers accordingly.
Enhancements: