-
Notifications
You must be signed in to change notification settings - Fork 0
8817: Fix bug in handling of empty Parquet page index structures #18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -113,6 +113,10 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> { | |
| for (row_group_idx, row_group) in self.row_groups.iter_mut().enumerate() { | ||
| for (column_idx, column_metadata) in row_group.columns.iter_mut().enumerate() { | ||
| if let Some(column_index) = &column_indexes[row_group_idx][column_idx] { | ||
| // Missing indexes may also have the placeholder ColumnIndexMetaData::NONE | ||
| if matches!(column_index, ColumnIndexMetaData::NONE) { | ||
| continue; | ||
| } | ||
| let start_offset = self.buf.bytes_written(); | ||
| self.object_writer.write_column_index( | ||
| column_index, | ||
|
|
@@ -227,22 +231,38 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> { | |
| None => builder.set_row_groups(row_groups), | ||
| }; | ||
|
|
||
| let column_indexes: Option<ParquetColumnIndex> = column_indexes.map(|ovvi| { | ||
| ovvi.into_iter() | ||
| .map(|vi| { | ||
| vi.into_iter() | ||
| .map(|oi| oi.unwrap_or(ColumnIndexMetaData::NONE)) | ||
| .collect() | ||
| }) | ||
| .collect() | ||
| }); | ||
|
|
||
| // FIXME(ets): this will panic if there's a missing index. | ||
| let offset_indexes: Option<ParquetOffsetIndex> = offset_indexes.map(|ovvi| { | ||
| ovvi.into_iter() | ||
| .map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect()) | ||
| .collect() | ||
| }); | ||
| // test to see if all indexes for this file are empty | ||
| let all_none = column_indexes | ||
| .as_ref() | ||
| .is_some_and(|ci| ci.iter().all(|cii| cii.iter().all(|idx| idx.is_none()))); | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This 🤖 Was this useful? React with 👍 or 👎
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. Here the author wants to check whether all the items are just Option::None. |
||
| let column_indexes: Option<ParquetColumnIndex> = if all_none { | ||
| None | ||
| } else { | ||
| column_indexes.map(|ovvi| { | ||
| ovvi.into_iter() | ||
| .map(|vi| { | ||
| vi.into_iter() | ||
| .map(|oi| oi.unwrap_or(ColumnIndexMetaData::NONE)) | ||
| .collect() | ||
| }) | ||
| .collect() | ||
| }) | ||
| }; | ||
|
|
||
| // test to see if all indexes for this file are empty | ||
| let all_none = offset_indexes | ||
| .as_ref() | ||
| .is_some_and(|oi| oi.iter().all(|oii| oii.iter().all(|idx| idx.is_none()))); | ||
| let offset_indexes: Option<ParquetOffsetIndex> = if all_none { | ||
| None | ||
| } else { | ||
| // FIXME(ets): this will panic if there's a missing index. | ||
| offset_indexes.map(|ovvi| { | ||
| ovvi.into_iter() | ||
| .map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect()) | ||
| .collect() | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Bug: Inconsistent Index Handling Leads to PanicsThe offset index conversion at line 262 calls
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:useful; category:bug; feedback:The Bugbot Ai reviewer is correct that the call to |
||
| }) | ||
| }; | ||
|
Comment on lines
+234
to
+265
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Critical: Incomplete fix for offset indexes - can still panic. The column indexes are handled safely (lines 234-250) using Issue: The Impact: This contradicts the PR objective to "safely skip empty column and offset indexes, preventing panics when indexes are missing." The second test ( Consider one of these approaches: Option 1: Add a let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
None
} else {
- // FIXME(ets): this will panic if there's a missing index.
offset_indexes.map(|ovvi| {
ovvi.into_iter()
- .map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())
+ .map(|vi| vi.into_iter().map(|oi| oi.unwrap_or(OffsetIndexMetaData::NONE)).collect())
.collect()
})
};Option 2: Filter out None entries and check if the result is empty: let offset_indexes: Option<ParquetOffsetIndex> = offset_indexes.and_then(|ovvi| {
- let all_none = ovvi.iter().all(|oii| oii.iter().all(|idx| idx.is_none()));
- if all_none {
- None
- } else {
- Some(ovvi.into_iter()
- .map(|vi| vi.into_iter().filter_map(|oi| oi).collect())
+ let result: Vec<Vec<OffsetIndexMetaData>> = ovvi.into_iter()
+ .map(|vi| vi.into_iter().filter_map(|oi| oi).collect())
+ .collect();
+ if result.iter().all(|vi| vi.is_empty()) {
+ None
+ } else {
+ Some(result)
- .collect())
}
-};
+});
🤖 Prompt for AI Agents
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:useful; category:bug; feedback:The Bugbot Ai reviewer is correct that the call to Option::unwrap() may lead to a panic. This is acknowledged by the PR author in a comment at line 259. The PR should not be merged until this issue is addressed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: Length of output: 9824 @martin-augment, thank you for confirming the issue. I've investigated the code and found that The safest fix is to check if any offset index entry is // test to see if all indexes for this file are empty OR if any are missing
let has_none = offset_indexes
.as_ref()
.is_some_and(|oi| oi.iter().any(|oii| oii.iter().any(|idx| idx.is_none())));
let offset_indexes: Option<ParquetOffsetIndex> = if has_none {
None
} else {
offset_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())
.collect()
})
};Key changes:
This approach ensures that:
Would you like me to help implement this fix or explore alternative approaches? 🧬 Code Graph Analysis Results[relevant_snippet] Filename: parquet/src/file/metadata/writer.rs Snippet: |
||
|
|
||
| builder = builder.set_column_index(column_indexes); | ||
| builder = builder.set_offset_index(offset_indexes); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1068,6 +1068,7 @@ mod tests { | |
| use crate::schema::parser::parse_message_type; | ||
| use crate::schema::types; | ||
| use crate::schema::types::{ColumnDescriptor, ColumnPath}; | ||
| use crate::util::test_common::file_util::get_test_file; | ||
| use crate::util::test_common::rand_gen::RandGen; | ||
|
|
||
| #[test] | ||
|
|
@@ -2442,4 +2443,74 @@ mod tests { | |
| start += 1; | ||
| } | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_rewrite_no_page_indexes() { | ||
| let file = get_test_file("alltypes_tiny_pages.parquet"); | ||
| let metadata = ParquetMetaDataReader::new() | ||
| .with_page_index_policy(PageIndexPolicy::Optional) | ||
| .parse_and_finish(&file) | ||
| .unwrap(); | ||
|
|
||
| let props = Arc::new(WriterProperties::builder().build()); | ||
| let schema = metadata.file_metadata().schema_descr().root_schema_ptr(); | ||
| let output = Vec::<u8>::new(); | ||
| let mut writer = SerializedFileWriter::new(output, schema, props).unwrap(); | ||
|
|
||
| for rg in metadata.row_groups() { | ||
| let mut rg_out = writer.next_row_group().unwrap(); | ||
| for column in rg.columns() { | ||
| let result = ColumnCloseResult { | ||
| bytes_written: column.compressed_size() as _, | ||
| rows_written: rg.num_rows() as _, | ||
| metadata: column.clone(), | ||
| bloom_filter: None, | ||
| column_index: None, | ||
| offset_index: None, | ||
| }; | ||
| rg_out.append_column(&file, result).unwrap(); | ||
| } | ||
| rg_out.close().unwrap(); | ||
| } | ||
| writer.close().unwrap(); | ||
| } | ||
|
|
||
| #[test] | ||
| fn test_rewrite_missing_column_index() { | ||
| // this file has an INT96 column that lacks a column index entry | ||
| let file = get_test_file("alltypes_tiny_pages.parquet"); | ||
| let metadata = ParquetMetaDataReader::new() | ||
| .with_page_index_policy(PageIndexPolicy::Optional) | ||
| .parse_and_finish(&file) | ||
| .unwrap(); | ||
|
|
||
| let props = Arc::new(WriterProperties::builder().build()); | ||
| let schema = metadata.file_metadata().schema_descr().root_schema_ptr(); | ||
| let output = Vec::<u8>::new(); | ||
| let mut writer = SerializedFileWriter::new(output, schema, props).unwrap(); | ||
|
|
||
| let column_indexes = metadata.column_index(); | ||
| let offset_indexes = metadata.offset_index(); | ||
|
|
||
| for (rg_idx, rg) in metadata.row_groups().iter().enumerate() { | ||
| let rg_column_indexes = column_indexes.and_then(|ci| ci.get(rg_idx)); | ||
| let rg_offset_indexes = offset_indexes.and_then(|oi| oi.get(rg_idx)); | ||
| let mut rg_out = writer.next_row_group().unwrap(); | ||
| for (col_idx, column) in rg.columns().iter().enumerate() { | ||
| let column_index = rg_column_indexes.and_then(|row| row.get(col_idx)).cloned(); | ||
| let offset_index = rg_offset_indexes.and_then(|row| row.get(col_idx)).cloned(); | ||
| let result = ColumnCloseResult { | ||
| bytes_written: column.compressed_size() as _, | ||
| rows_written: rg.num_rows() as _, | ||
| metadata: column.clone(), | ||
| bloom_filter: None, | ||
| column_index, | ||
| offset_index, | ||
| }; | ||
| rg_out.append_column(&file, result).unwrap(); | ||
| } | ||
| rg_out.close().unwrap(); | ||
| } | ||
| writer.close().unwrap(); | ||
| } | ||
|
Comment on lines
+2478
to
+2515
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chainVerify: Test may expose offset index panic. This test exercises the case where individual columns may have missing indexes (per the comment on line 2480: "this file has an INT96 column that lacks a column index entry"). Lines 2500-2501 retrieve Concern: If the test file also has missing Run this test in isolation to confirm it doesn't panic: If the test passes, it suggests offset indexes are always present in the test file (even when column indexes are missing). If it panics, it confirms the critical issue in 🏁 Script executed: #!/bin/bash
# Verify the test doesn't hit the offset_index unwrap panic
cd parquet
cargo test test_rewrite_missing_column_index -- --exact --nocaptureLength of output: 109 I'll use static analysis to verify the offset_index issue since the test cannot be run in this environment. 🏁 Script executed: #!/bin/bash
# Search for offset_index unwrap patterns and finish() method
rg -n "finish|offset_index.*unwrap" parquet/src/file/writer.rs | head -20Length of output: 978 🏁 Script executed: #!/bin/bash
# Look at the finish() method implementation
rg -n "pub fn finish" parquet/src/file/writer.rs -A 20Length of output: 795 🏁 Script executed: #!/bin/bash
# Search for OffsetIndexBuilder in metadata/writer.rs - the file mentioned in original review
rg -n "unwrap\|finish" parquet/src/file/metadata/writer.rs | head -30Length of output: 0 🏁 Script executed: #!/bin/bash
# Check ColumnCloseResult definition and append_column usage
rg -n "struct ColumnCloseResult|fn append_column" parquet/src/file/writer.rs -A 15 | head -50Length of output: 623 🏁 Script executed: #!/bin/bash
# Find ColumnCloseResult struct definition
rg -n "struct ColumnCloseResult" parquet/src/file/writer.rs -A 10Length of output: 0 🏁 Script executed: #!/bin/bash
# Search for write_metadata implementation
rg -n "fn write_metadata" parquet/src/file/writer.rs -A 30 | head -50Length of output: 1341 🏁 Script executed: #!/bin/bash
# List files in parquet/src/file/metadata/
fd . parquet/src/file/metadata/Length of output: 380 🏁 Script executed: #!/bin/bash
# Broadly search for offset_index().unwrap() anywhere in parquet source
rg "offset_index.*\.unwrap" parquet/src/ -nLength of output: 2363 🏁 Script executed: #!/bin/bash
# Look at parquet/src/file/metadata/writer.rs around line 262
sed -n '250,280p' parquet/src/file/metadata/writer.rsLength of output: 997 🏁 Script executed: #!/bin/bash
# Search for ColumnCloseResult definition across all parquet source
rg -n "ColumnCloseResult" parquet/src/ -B 2 -A 8 | head -60Length of output: 3938 🏁 Script executed: #!/bin/bash
# Look at the offset_index unwrap in column/writer/mod.rs
sed -n '2995,3010p' parquet/src/column/writer/mod.rsLength of output: 538 🏁 Script executed: #!/bin/bash
# Search for OffsetIndexBuilder to understand how offset_index is built
rg -n "OffsetIndexBuilder|offset_indexes.push" parquet/src/file/writer.rs -B 2 -A 5 | head -50Length of output: 894 🏁 Script executed: #!/bin/bash
# Check if there's been any fix to the FIXME about unwrap panicking
rg -n "FIXME.*missing.*index|panic.*missing" parquet/src/file/metadata/writer.rsLength of output: 74 🏁 Script executed: #!/bin/bash
# Understand the test better - see what metadata the test loads
sed -n '2478,2495p' parquet/src/file/writer.rsLength of output: 821 Clarify: This test likely doesn't expose the offset_index panic, but the underlying critical bug exists. The test correctly handles potentially-missing column and offset indexes. However, the concern is probably not triggered by The real critical issue flagged by this review exists elsewhere: 🤖 Prompt for AI Agents
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. value:useful; category:bug; feedback:The CodeRabbit Ai reviewer is correct that the call to Option::unwrap() may lead to a panic. This is acknowledged by the PR author in a comment at line 259. The PR should not be merged until this issue is addressed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: Length of output: 997 🏁 Script executed: Length of output: 3424 🏁 Script executed: Length of output: 3119 @martin-augment, thank you for confirming this critical issue. You're absolutely right that the PR should not be merged with this panic risk. Looking at the code in The Issue: let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
None
} else {
// FIXME(ets): this will panic if there's a missing index.
offset_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect()) // ⚠️ PANICS on None
.collect()
})
};The Suggested Fix: let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
None
} else {
offset_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| {
vi.into_iter()
.map(|oi| oi.unwrap_or(OffsetIndexMetaData::NONE)) // Safe handling
.collect()
})
.collect()
})
};You should also update the Recommendation: Create a GitHub issue to track this fix explicitly, or update the PR description to note this is a work-in-progress that addresses the issue partially but requires the unwrap fix before merge. |
||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column_indexhere is a&ColumnIndexMetaData, so this pattern won’t match as written and thematches!check will never be true, preventing skipping of placeholderColumnIndexMetaData::NONE. This leads to attempting to write placeholder indexes instead of continuing.🤖 Was this useful? React with 👍 or 👎
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. If the type implements the Copy trait then
matches!()macro can compare the items without making sure that both of them are references.