3610: AVRO-4222: [C++] Support writer to specify compression level#23
3610: AVRO-4222: [C++] Support writer to specify compression level#23martin-augment wants to merge 1 commit intomainfrom
Conversation
WalkthroughThis pull request adds support for optional compression levels in the Avro C++ DataFile implementation. The changes extend the DataFileWriter API to accept an optional compression level parameter, update DEFLATE and ZSTD compression routines to utilize this parameter, and introduce codec validation and availability utilities. The Codec enum is modified to unconditionally define SNAPPY_CODEC and ZSTD_CODEC. ZstdCompressWrapper's compress method signature is updated to accept an optional compression level. Corresponding public API declarations and constructor signatures are updated across DataFile.hh, and comprehensive test functions are added to validate codec behavior and compression round-trips. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
PR Review: AVRO-4222 - Support writer to specify compression levelSummaryThis PR adds support for specifying compression levels when writing Avro data files in C++. The implementation looks solid overall with good test coverage and proper validation. ✅ Strengths
🔍 Issues & SuggestionsCritical Issues
Code Quality
Test Improvements
Documentation
🔒 Security ConsiderationsNo security concerns identified. The validation properly prevents:
⚡ Performance Considerations
📋 RecommendationsMust fix before merge:
Should fix: Nice to have: 📊 Overall AssessmentQuality: 8/10 - Well-implemented feature with excellent test coverage, but needs the critical duplicate metadata bug fixed. The PR demonstrates good understanding of:
Great work overall! Please address the critical metadata issue and consider the other suggestions. |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
lang/c++/impl/DataFile.cc (1)
176-188: Redundant metadata assignment.Line 177 unconditionally sets
AVRO_CODEC_KEYtoAVRO_NULL_CODEC, but lines 178-188 immediately overwrite it based on the actual codec. The initial assignment on line 177 is unnecessary.🔎 Suggested fix
validateCodec(codec_, compressionLevel_); - setMetadata(AVRO_CODEC_KEY, AVRO_NULL_CODEC); if (codec_ == NULL_CODEC) { setMetadata(AVRO_CODEC_KEY, AVRO_NULL_CODEC);
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
lang/c++/impl/DataFile.cclang/c++/impl/ZstdCompressWrapper.cclang/c++/impl/ZstdCompressWrapper.hhlang/c++/include/avro/DataFile.hhlang/c++/test/DataFileTests.cc
🧰 Additional context used
🧬 Code graph analysis (3)
lang/c++/impl/ZstdCompressWrapper.cc (1)
lang/c++/impl/ZstdCompressWrapper.hh (1)
uncompressed(36-36)
lang/c++/test/DataFileTests.cc (2)
lang/c++/include/avro/DataFile.hh (1)
isCodecAvailable(48-48)lang/c++/impl/DataFile.cc (2)
isCodecAvailable(113-133)isCodecAvailable(113-113)
lang/c++/include/avro/DataFile.hh (1)
lang/c++/test/DataFileTests.cc (2)
codec(225-236)codec(225-225)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Cursor Bugbot
- GitHub Check: claude-review
🔇 Additional comments (19)
lang/c++/impl/ZstdCompressWrapper.cc (1)
28-40: LGTM!The updated
compressmethod correctly usesstd::optional<int>for the compression level parameter and appliesvalue_or(ZSTD_CLEVEL_DEFAULT)to fall back to ZSTD's default when no level is specified. This aligns with the header declaration and the broader PR goal.lang/c++/impl/ZstdCompressWrapper.hh (1)
24-36: LGTM!The header correctly adds
<optional>include and updates thecompressmethod signature with a default value ofstd::nullopt, ensuring backward compatibility for existing callers.lang/c++/test/DataFileTests.cc (7)
1287-1321: LGTM! Good property-based testing approach.The validation test comprehensively covers the deflate compression level range (0-9), invalid levels, and the
nulloptcase. The random sampling provides good coverage.
1323-1359: LGTM!The ZSTD validation test correctly tests the 1-22 valid range and is properly guarded by
ZSTD_CODEC_AVAILABLE.
1361-1413: LGTM!The round-trip test properly exercises various compression levels including the
nulloptcase and verifies data integrity. The random data generation and verification logic is sound.
1415-1470: LGTM!The ZSTD round-trip test correctly handles the valid level range (1-22) and maps 0 to
nullopt. Well-structured parallel to the deflate test.
1472-1479: LGTM!Testing explicit enum values helps ensure ABI stability and prevents accidental changes to the codec ordering.
1481-1498: LGTM!The test correctly validates runtime codec availability, matching the compile-time configuration via
SNAPPY_CODEC_AVAILABLEandZSTD_CODEC_AVAILABLEguards.
1703-1717: LGTM!Test registration is complete and properly guards ZSTD-specific tests with
ZSTD_CODEC_AVAILABLE.lang/c++/include/avro/DataFile.hh (4)
38-48: LGTM! Good API design.Making
SNAPPY_CODECandZSTD_CODECunconditionally visible in the enum while providingisCodecAvailable()for runtime checks is a clean design. This allows callers to reference codec values without compile-time guards while still getting proper runtime errors for unavailable codecs.
83-83: LGTM!The
compressionLevel_member correctly usesstd::optional<int>to store the optional compression level.
135-140: LGTM!The
DataFileWriterBaseconstructor declarations correctly add the optional compression level parameter with a sensible default, maintaining backward compatibility.
174-182: LGTM!The
DataFileWritertemplate constructors correctly forward thecompressionLevelparameter to the base class, maintaining API consistency.lang/c++/impl/DataFile.cc (6)
67-75: LGTM!The
codecToStringhelper provides clear string representations for all codec types, useful for error messages.
77-109: LGTM!The
validateCodecfunction correctly validates codec availability and compression level ranges. The early return fornulloptlevels and the per-codec validation logic is well-structured.
113-133: LGTM!The
isCodecAvailablefunction correctly reports codec availability based on compile-time configuration, enabling runtime checks before codec usage.
229-232: LGTM!The deflate path correctly uses
value_or(Z_DEFAULT_COMPRESSION)to apply the specified level or fall back to zlib's default compression level.
317-318: LGTM!The ZSTD path correctly passes the optional
compressionLevel_to the wrapper, which handles the default internally.
135-167: LGTM!Both
DataFileWriterBaseconstructors correctly initialize the newcompressionLevel_member and maintain consistency between the file-based and stream-based variants.
🤖 Augment PR SummarySummary: Adds optional compression-level support for the C++ Avro data-file writer, primarily for deflate and zstandard. Changes:
Notes: When no level is provided, the library defaults ( 🤖 Was this summary useful? React with 👍 or 👎 |
| setMetadata(AVRO_CODEC_KEY, AVRO_DEFLATE_CODEC); | ||
| #ifdef SNAPPY_CODEC_AVAILABLE | ||
| } else if (codec_ == SNAPPY_CODEC) { | ||
| setMetadata(AVRO_CODEC_KEY, AVRO_SNAPPY_CODEC); |
There was a problem hiding this comment.
This references AVRO_SNAPPY_CODEC (and similarly AVRO_ZSTD_CODEC below) unconditionally, but those metadata strings are still only defined under #ifdef SNAPPY_CODEC_AVAILABLE / #ifdef ZSTD_CODEC_AVAILABLE earlier in this file. As a result, builds without those optional codecs enabled will fail to compile here.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The Pull Request author tried to simplify the code by removing the usage of the build arguments but introduced build errors in case some of the arguments are not defined.
| syncInterval_(syncInterval), | ||
| codec_(codec), | ||
| compressionLevel_(compressionLevel), | ||
| stream_(fileOutputStream(filename)), |
There was a problem hiding this comment.
Because validateCodec() runs inside init() after stream_(fileOutputStream(filename)) has already executed, an unavailable codec or invalid compression level can still create/truncate the target file before throwing. This is a new side-effect path introduced by the compression-level validation.
🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
value:useful; category:bug; feedback:The Augment AI reviewer is correct! The output file is created/truncated and then the validation checks are executed. This is an old issue - it was the same until now if syncInterval was invalid. Prevents creating/truncating the output file if it won't be used due to validation failure.
value:good-to-have; category:bug; feedback:The CodeRabbit AI reviewer is correct! There is no need to check whether the codec is AVRO_NULL_CODEC because it is set as a default anyway. Prevents re-setting the AVRO_NULL_CODEC a second time. |
value:good-to-have; category:bug; feedback:The Claude AI reviewer is correct! There is no need to check whether the codec is AVRO_NULL_CODEC because it is set as a default anyway. Prevents re-setting the AVRO_NULL_CODEC a second time. |
value:useful; category:bug; feedback:The Claude AI reviewer is correct! This |
3610: To review by AI