Skip to content

Conversation

@wgtmac
Copy link
Member

@wgtmac wgtmac commented Dec 22, 2025

Refactors the ManifestReader implementation to support advanced data skipping
and projection pushdown, and adds full test coverage ported from the Java implementation.

Implementation Changes:

  • Consolidate ManifestReader logic: Move implementation from manifest_reader_internal.cc
    to manifest_reader.cc and remove the obsolete internal class.
  • Fluent API: Add Select(), FilterPartitions(), and FilterRows() to support
    column projection and expression-based filtering (partition & metrics).
  • Optimization: Support lazy initialization of the underlying Avro reader.
  • ManifestEntry: Update schema handling to support reading specific columns and stats.

Test Coverage:

  • Add manifest_reader_test.cc: Ports TestManifestReader from Java, covering:
    • Basic reading and partition filtering.
    • V2/V3 specific features: Delete Files and Deletion Vectors (DVs).
    • Invalid usage validation (missing snapshot ID).
  • Add manifest_reader_stats_test.cc: Ports TestManifestReaderStats, verifying:
    • Metrics projection logic (stats columns included/dropped based on filter/select).
    • Exact match of stats parsing against expected values.
  • Parameterized Testing: All tests run against Format Versions 1, 2, and 3 to ensure cross-version compatibility.

@wgtmac wgtmac force-pushed the enhance_manifest_reader branch 3 times, most recently from 537a136 to d4f7c24 Compare December 23, 2025 04:27
@wgtmac wgtmac marked this pull request as ready for review December 23, 2025 04:30
@wgtmac wgtmac force-pushed the enhance_manifest_reader branch from d4f7c24 to 81b1732 Compare December 23, 2025 04:35

namespace {

#define PARSE_PRIMITIVE_FIELD(item, array_view, type) \
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github is not clever enough to figure out changes in this file. Most parts come from combining manifest_reader.cc and manifest_reader_internal.cc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change these macros to template functions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. But I prefer to do it in a separate PR.

- Consolidate `ManifestReader` implementation into `manifest_reader.cc`
  and remove `manifest_reader_internal.cc`.
- Implement fluent API for column selection, partition filtering, and
  row filtering.
- Support lazy initialization of the underlying Avro reader.
- Add various filtering support for entries.
@wgtmac wgtmac force-pushed the enhance_manifest_reader branch from 81b1732 to 5b11c46 Compare December 23, 2025 04:39
@wgtmac wgtmac changed the title feat: add projection and filtering to manifest reader feat: enhance ManifestReader with projection, filtering and comprehensive tests Dec 23, 2025
@wgtmac wgtmac changed the title feat: enhance ManifestReader with projection, filtering and comprehensive tests feat: enhance ManifestReader with projection and filtering support Dec 23, 2025

/// \brief Add stats columns to the column list if needed.
std::vector<std::string> WithStatsColumns(
const std::vector<std::string>& columns) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a static function in ManifestReader

Copy link
Contributor

@HuaHuaY HuaHuaY Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be a function declaration without implementation. I guess there is a careless mistake.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is due to a half-way refactoring that I forgot to delete it from the original place.

class ICEBERG_EXPORT ManifestReader {
public:
/// \brief Special value to select all columns from manifest files.
inline static const std::vector<std::string> kAllColumns{"*"};
Copy link
Contributor

@WZhuo WZhuo Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to be a vector, and it's also defined in Schema::Select

Comment on lines 589 to 592
if (std::ranges::all_of(ManifestReader::kAllColumns,
[&columns](const std::string& col) {
return std::ranges::find(columns, col) != columns.cend();
})) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (std::ranges::all_of(ManifestReader::kAllColumns,
[&columns](const std::string& col) {
return std::ranges::find(columns, col) != columns.cend();
})) {
if (std::ranges::find(columns, ManifestReader::kAllColumns) != columns.cend()) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use std::ranges::contains instead.

Comment on lines 563 to 567
if (std::ranges::all_of(
ManifestReader::kAllColumns,
[&selected](const std::string& col) { return selected.contains(col); })) {
return false;
}
Copy link
Contributor

@WZhuo WZhuo Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (std::ranges::all_of(
ManifestReader::kAllColumns,
[&selected](const std::string& col) { return selected.contains(col); })) {
return false;
}
if (selected.contains(ManifestReader::kAllColumns)) {
return false;
}

If kAllColumns is a string

/// present
std::optional<int64_t> content_size_in_bytes;

inline static constexpr int32_t kContentFieldId = 134;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove inline on static constexpr.


const StructType& PartitionFieldSummary::Type() {
static const StructType kInstance{{
const std::shared_ptr<StructType>& PartitionFieldSummary::Type() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change the return type from const T& to const std::shared_ptr<T>&?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it is a singleton that can be reused.


namespace {

#define PARSE_PRIMITIVE_FIELD(item, array_view, type) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change these macros to template functions?

Comment on lines 589 to 592
if (std::ranges::all_of(ManifestReader::kAllColumns,
[&columns](const std::string& col) {
return std::ranges::find(columns, col) != columns.cend();
})) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use std::ranges::contains instead.


Status ParseLiteral(ArrowArrayView* view_of_partition, int64_t row_idx,
std::vector<ManifestEntry>& manifest_entries) {
if (view_of_partition->storage_type == ArrowType::NANOARROW_TYPE_BOOL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use switch instead of multiple if?

return manifest_entries;
}

const std::unordered_set<std::string> kStatsColumns = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think std::vector<std::string> is enough.

Comment on lines 821 to 831
if (result.has_value()) {
internal::ArrowArrayGuard array_guard(&result.value());
ICEBERG_ASSIGN_OR_RAISE(
auto parse_result, ParseManifestList(&arrow_schema, &result.value(), *schema_));
manifest_files.insert(manifest_files.end(),
std::make_move_iterator(parse_result.begin()),
std::make_move_iterator(parse_result.end()));
} else {
// eof
break;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (result.has_value()) {
internal::ArrowArrayGuard array_guard(&result.value());
ICEBERG_ASSIGN_OR_RAISE(
auto parse_result, ParseManifestList(&arrow_schema, &result.value(), *schema_));
manifest_files.insert(manifest_files.end(),
std::make_move_iterator(parse_result.begin()),
std::make_move_iterator(parse_result.end()));
} else {
// eof
break;
}
if (!result.has_value()) {
// eof
break;
}
internal::ArrowArrayGuard array_guard(&result.value());
ICEBERG_ASSIGN_OR_RAISE(
auto parse_result, ParseManifestList(&arrow_schema, &result.value(), *schema_));
manifest_files.insert(manifest_files.end(),
std::make_move_iterator(parse_result.begin()),
std::make_move_iterator(parse_result.end()));

I prefer to write this way, with one less tab.

}

bool ManifestReaderImpl::HasPartitionFilter() const {
return part_filter_ && part_filter_->op() != Expression::Operation::kTrue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part_filter_ always true

Suggested change
return part_filter_ && part_filter_->op() != Expression::Operation::kTrue;
return part_filter_->op() != Expression::Operation::kTrue;

}

bool ManifestReaderImpl::HasRowFilter() const {
return row_filter_ && row_filter_->op() != Expression::Operation::kTrue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return row_filter_ && row_filter_->op() != Expression::Operation::kTrue;
return row_filter_->op() != Expression::Operation::kTrue;

/// present
std::optional<int64_t> content_size_in_bytes;

inline static constexpr int32_t kContentFieldId = 134;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: constexpr implies inline, maybe remove the inline, not a strong opinion

if (view_of_column->storage_type != ArrowType::NANOARROW_TYPE_LIST) {
return InvalidManifestList("partitions field should be a list.");
}
auto view_of_list_iterm = view_of_column->children[0];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto view_of_list_iterm = view_of_column->children[0];
auto view_of_list_item = view_of_column->children[0];

typo?

auto view_of_list_iterm = view_of_column->children[0];
// view_of_list_iterm is struct<PartitionFieldSummary>
if (view_of_list_iterm->storage_type != ArrowType::NANOARROW_TYPE_STRUCT) {
return InvalidManifestList("partitions list field should be a list.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return InvalidManifestList("partitions list field should be a list.");
return InvalidManifestList("partitions list item should be a struct.");

return manifest_files;
}

Status ParseLiteral(ArrowArrayView* view_of_partition, int64_t row_idx,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rename this to ParsePartitionValues, the current name is a little bit confusing to me.

class ICEBERG_EXPORT ManifestReader {
public:
/// \brief Special value to select all columns from manifest files.
inline static const std::vector<std::string> kAllColumns{"*"};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is a vector, a single string should be enough?

first_row_id_(first_row_id) {}

ManifestReader& ManifestReaderImpl::Select(const std::vector<std::string>& columns) {
columns_ = columns;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Java maps user-requested names to immutable Field IDs from the manifest's actual schema.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do something like this?

ManifestReader& ManifestReaderImpl::Select(const std::vector<std::string>& columns) {
   // Validate columns exist in data file schema before storing
   for (const auto& col_name : columns) {
     if (col_name != kAllColumns) {
       ICEBERG_CHECK(DataFile::GetSchemaByName(col_name, case_sensitive_).has_value(),
                     "Column '{}' not found in data file schema", col_name);
     }
   }
   columns_ = columns;
   return *this;
 }

}

ManifestReader& ManifestReaderImpl::FilterPartitions(std::shared_ptr<Expression> expr) {
part_filter_ = expr ? std::move(expr) : True::Instance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}

ManifestReader& ManifestReaderImpl::FilterRows(std::shared_ptr<Expression> expr) {
row_filter_ = expr ? std::move(expr) : True::Instance();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is missing AND combination comparing with Java.

if (!row_filter || row_filter->op() == Expression::Operation::kTrue) {
return false;
}
if (columns.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function return true when stats columns are needed for filtering but NOT all selected. Currently it returns true when stats are NOT all selected, which is backwards. This will cause unnecessary stats projection when stats columns ARE already selected.

Something like below:

  bool RequireStatsProjection(const std::shared_ptr<Expression>& row_filter,
                              const std::vector<std::string>& columns) {
    if (!row_filter || row_filter->op() == Expression::Operation::kTrue) {
      return false;  // No row filter, no stats needed
    }
    if (columns.empty() || std::ranges::contains(columns, ManifestReader::kAllColumns)) {
      return false;  // All columns selected, stats already included
    }
    // Return true if ANY stats column is missing from selection
    const std::unordered_set<std::string> selected(columns.cbegin(), columns.cend());
    return !std::ranges::all_of(kStatsColumns, [&selected](const std::string& col) {
      return selected.contains(col);
    });
  }

}

Result<Evaluator*> ManifestReaderImpl::GetEvaluator() {
if (!evaluator_) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add null checks for row_filter_ and part_filter_ before us?

class ICEBERG_EXPORT ManifestReader {
public:
/// \brief Special value to select all columns from manifest files.
inline static const std::string kAllColumns = "*";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to use constexpr std::string_view for better performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants