Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 36 additions & 16 deletions parquet/src/file/metadata/writer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,10 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
for (row_group_idx, row_group) in self.row_groups.iter_mut().enumerate() {
for (column_idx, column_metadata) in row_group.columns.iter_mut().enumerate() {
if let Some(column_index) = &column_indexes[row_group_idx][column_idx] {
// Missing indexes may also have the placeholder ColumnIndexMetaData::NONE
if matches!(column_index, ColumnIndexMetaData::NONE) {
continue;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column_index here is a &ColumnIndexMetaData, so this pattern won’t match as written and the matches! check will never be true, preventing skipping of placeholder ColumnIndexMetaData::NONE. This leads to attempting to write placeholder indexes instead of continuing.

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. If the type implements the Copy trait then matches!() macro can compare the items without making sure that both of them are references.

}
let start_offset = self.buf.bytes_written();
self.object_writer.write_column_index(
column_index,
Expand Down Expand Up @@ -227,22 +231,38 @@ impl<'a, W: Write> ThriftMetadataWriter<'a, W> {
None => builder.set_row_groups(row_groups),
};

let column_indexes: Option<ParquetColumnIndex> = column_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| {
vi.into_iter()
.map(|oi| oi.unwrap_or(ColumnIndexMetaData::NONE))
.collect()
})
.collect()
});

// FIXME(ets): this will panic if there's a missing index.
let offset_indexes: Option<ParquetOffsetIndex> = offset_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())
.collect()
});
// test to see if all indexes for this file are empty
let all_none = column_indexes
.as_ref()
.is_some_and(|ci| ci.iter().all(|cii| cii.iter().all(|idx| idx.is_none())));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all_none calculation only considers None entries and ignores Some(ColumnIndexMetaData::NONE) placeholders, which are treated as missing above. If all entries are placeholders, this will incorrectly leave column_index as Some instead of None (also applies to other locations in the PR).

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback:The Augment AI reviewer is not correct. Here the author wants to check whether all the items are just Option::None.

let column_indexes: Option<ParquetColumnIndex> = if all_none {
None
} else {
column_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| {
vi.into_iter()
.map(|oi| oi.unwrap_or(ColumnIndexMetaData::NONE))
.collect()
})
.collect()
})
};

// test to see if all indexes for this file are empty
let all_none = offset_indexes
.as_ref()
.is_some_and(|oi| oi.iter().all(|oii| oii.iter().all(|idx| idx.is_none())));
let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
None
} else {
// FIXME(ets): this will panic if there's a missing index.
offset_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())
.collect()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Inconsistent Index Handling Leads to Panics

The offset index conversion at line 262 calls .unwrap() on Option values, which will panic if any offset index is None. The all_none check only prevents panics when all indexes are None, but doesn't handle cases where some indexes exist and others are None. This is inconsistent with the column index handling above (lines 234-250) which uses .unwrap_or(ColumnIndexMetaData::NONE) to safely handle partial None values.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Owner Author

@martin-augment martin-augment Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot Ai reviewer is correct that the call to Option::unwrap() may lead to a panic. This is acknowledged by the PR author in a comment at line 259. The PR should not be merged until this issue is addressed.

})
};
Comment on lines +234 to +265
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Incomplete fix for offset indexes - can still panic.

The column indexes are handled safely (lines 234-250) using unwrap_or(ColumnIndexMetaData::NONE), but offset indexes (lines 252-265) still use .unwrap() on line 262, which will panic if individual offset indexes are None. The FIXME comment on line 259 acknowledges this risk but the issue remains unfixed.

Issue: The all_none check only prevents panics when all offset indexes are None. If there's a mix of Some and None entries (e.g., when rewriting files with missing offset indexes for specific columns), line 262 will panic.

Impact: This contradicts the PR objective to "safely skip empty column and offset indexes, preventing panics when indexes are missing." The second test (test_rewrite_missing_column_index) exercises this scenario and may panic if the test file has missing offset indexes.

Consider one of these approaches:

Option 1: Add a NONE variant to OffsetIndexMetaData (if feasible):

 let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
     None
 } else {
-    // FIXME(ets): this will panic if there's a missing index.
     offset_indexes.map(|ovvi| {
         ovvi.into_iter()
-            .map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())
+            .map(|vi| vi.into_iter().map(|oi| oi.unwrap_or(OffsetIndexMetaData::NONE)).collect())
             .collect()
     })
 };

Option 2: Filter out None entries and check if the result is empty:

 let offset_indexes: Option<ParquetOffsetIndex> = offset_indexes.and_then(|ovvi| {
-    let all_none = ovvi.iter().all(|oii| oii.iter().all(|idx| idx.is_none()));
-    if all_none {
-        None
-    } else {
-        Some(ovvi.into_iter()
-            .map(|vi| vi.into_iter().filter_map(|oi| oi).collect())
+    let result: Vec<Vec<OffsetIndexMetaData>> = ovvi.into_iter()
+        .map(|vi| vi.into_iter().filter_map(|oi| oi).collect())
+        .collect();
+    if result.iter().all(|vi| vi.is_empty()) {
+        None
+    } else {
+        Some(result)
-            .collect())
     }
-};
+});

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In parquet/src/file/metadata/writer.rs around lines 234-265, the offset_indexes
mapping still calls .unwrap() (line ~262) which will panic when some inner
entries are None; change the mapping to mirror the safe handling used for
column_indexes: replace the .unwrap() usage by either (preferred) using an
OffsetIndexMetaData::NONE fallback (i.e.,
oi.unwrap_or(OffsetIndexMetaData::NONE)) if you can add/use a NONE sentinel, or
(alternatively) use filter_map to drop None entries for each inner vector and
then treat an entirely-empty inner vector as absent (turning the outer result
into None) so no unwrap() is used and mixed Some/None cases no longer panic.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The Bugbot Ai reviewer is correct that the call to Option::unwrap() may lead to a panic. This is acknowledged by the PR author in a comment at line 259. The PR should not be merged until this issue is addressed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for OffsetIndexMetaData definition to check if it has a NONE variant
echo "=== Searching for OffsetIndexMetaData definition ==="
rg -nP "(?:enum|struct)\s+OffsetIndexMetaData" --type rust -A 20

echo -e "\n=== Searching for OffsetIndexMetaData::NONE usage ==="
rg -n "OffsetIndexMetaData::NONE" --type rust

echo -e "\n=== Searching for ColumnIndexMetaData definition and NONE variant ==="
rg -nP "(?:enum|struct)\s+ColumnIndexMetaData" --type rust -A 20

echo -e "\n=== Searching for ColumnIndexMetaData::NONE usage ==="
rg -n "ColumnIndexMetaData::NONE" --type rust -C 2

Length of output: 9824


@martin-augment, thank you for confirming the issue. I've investigated the code and found that OffsetIndexMetaData is a struct without a NONE variant, unlike ColumnIndexMetaData which is an enum with a NONE variant. This means we cannot use the same unwrap_or(OffsetIndexMetaData::NONE) pattern.

The safest fix is to check if any offset index entry is None (not just all of them), and if so, skip writing offset indexes entirely. This prevents panics while maintaining data integrity:

// test to see if all indexes for this file are empty OR if any are missing
let has_none = offset_indexes
    .as_ref()
    .is_some_and(|oi| oi.iter().any(|oii| oii.iter().any(|idx| idx.is_none())));
let offset_indexes: Option<ParquetOffsetIndex> = if has_none {
    None
} else {
    offset_indexes.map(|ovvi| {
        ovvi.into_iter()
            .map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())
            .collect()
    })
};

Key changes:

  • Replace all(...) with any(...) to detect ANY missing entries
  • Remove the FIXME comment since the unwrap is now safe (only called when all entries are Some)
  • Variable renamed to has_none for clarity

This approach ensures that:

  1. ✅ No panics occur when some (but not all) offset indexes are missing
  2. ✅ Maintains the "all or nothing" semantic for offset indexes
  3. ✅ Aligns with the PR objective to safely handle missing indexes

Would you like me to help implement this fix or explore alternative approaches?


🧬 Code Graph Analysis Results

[relevant_snippet]

Filename: parquet/src/file/metadata/writer.rs
Line(s): (finish method near the FIXME comment)

Snippet:
// test to see if all indexes for this file are empty
let all_none = column_indexes
.as_ref()
.is_some_and(|ci| ci.iter().all(|cii| cii.iter().all(|idx| idx.is_none())));
let offset_indexes: Option = if all_none {
None
} else {
// FIXME( ets ): this will panic if there's a missing index.
offset_indexes.map(|ovvi| {
ovvi.into_iter()
.map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())
.collect()
})
};


builder = builder.set_column_index(column_indexes);
builder = builder.set_offset_index(offset_indexes);
Expand Down
71 changes: 71 additions & 0 deletions parquet/src/file/writer.rs
Original file line number Diff line number Diff line change
Expand Up @@ -1068,6 +1068,7 @@ mod tests {
use crate::schema::parser::parse_message_type;
use crate::schema::types;
use crate::schema::types::{ColumnDescriptor, ColumnPath};
use crate::util::test_common::file_util::get_test_file;
use crate::util::test_common::rand_gen::RandGen;

#[test]
Expand Down Expand Up @@ -2442,4 +2443,74 @@ mod tests {
start += 1;
}
}

#[test]
fn test_rewrite_no_page_indexes() {
let file = get_test_file("alltypes_tiny_pages.parquet");
let metadata = ParquetMetaDataReader::new()
.with_page_index_policy(PageIndexPolicy::Optional)
.parse_and_finish(&file)
.unwrap();

let props = Arc::new(WriterProperties::builder().build());
let schema = metadata.file_metadata().schema_descr().root_schema_ptr();
let output = Vec::<u8>::new();
let mut writer = SerializedFileWriter::new(output, schema, props).unwrap();

for rg in metadata.row_groups() {
let mut rg_out = writer.next_row_group().unwrap();
for column in rg.columns() {
let result = ColumnCloseResult {
bytes_written: column.compressed_size() as _,
rows_written: rg.num_rows() as _,
metadata: column.clone(),
bloom_filter: None,
column_index: None,
offset_index: None,
};
rg_out.append_column(&file, result).unwrap();
}
rg_out.close().unwrap();
}
writer.close().unwrap();
}

#[test]
fn test_rewrite_missing_column_index() {
// this file has an INT96 column that lacks a column index entry
let file = get_test_file("alltypes_tiny_pages.parquet");
let metadata = ParquetMetaDataReader::new()
.with_page_index_policy(PageIndexPolicy::Optional)
.parse_and_finish(&file)
.unwrap();

let props = Arc::new(WriterProperties::builder().build());
let schema = metadata.file_metadata().schema_descr().root_schema_ptr();
let output = Vec::<u8>::new();
let mut writer = SerializedFileWriter::new(output, schema, props).unwrap();

let column_indexes = metadata.column_index();
let offset_indexes = metadata.offset_index();

for (rg_idx, rg) in metadata.row_groups().iter().enumerate() {
let rg_column_indexes = column_indexes.and_then(|ci| ci.get(rg_idx));
let rg_offset_indexes = offset_indexes.and_then(|oi| oi.get(rg_idx));
let mut rg_out = writer.next_row_group().unwrap();
for (col_idx, column) in rg.columns().iter().enumerate() {
let column_index = rg_column_indexes.and_then(|row| row.get(col_idx)).cloned();
let offset_index = rg_offset_indexes.and_then(|row| row.get(col_idx)).cloned();
let result = ColumnCloseResult {
bytes_written: column.compressed_size() as _,
rows_written: rg.num_rows() as _,
metadata: column.clone(),
bloom_filter: None,
column_index,
offset_index,
};
rg_out.append_column(&file, result).unwrap();
}
rg_out.close().unwrap();
}
writer.close().unwrap();
}
Comment on lines +2478 to +2515
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Nov 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify: Test may expose offset index panic.

This test exercises the case where individual columns may have missing indexes (per the comment on line 2480: "this file has an INT96 column that lacks a column index entry"). Lines 2500-2501 retrieve column_index and offset_index using .get(col_idx).cloned(), both of which can return None for individual columns.

Concern: If the test file also has missing offset_index entries (not just column_index), this test will likely panic due to the .unwrap() on line 262 of parquet/src/file/metadata/writer.rs (the critical issue flagged in that file's review). The column_index handling is safe, but offset_index handling is not.

Run this test in isolation to confirm it doesn't panic:

If the test passes, it suggests offset indexes are always present in the test file (even when column indexes are missing). If it panics, it confirms the critical issue in writer.rs line 262.


🏁 Script executed:

#!/bin/bash
# Verify the test doesn't hit the offset_index unwrap panic
cd parquet
cargo test test_rewrite_missing_column_index -- --exact --nocapture

Length of output: 109


I'll use static analysis to verify the offset_index issue since the test cannot be run in this environment.


🏁 Script executed:

#!/bin/bash
# Search for offset_index unwrap patterns and finish() method
rg -n "finish|offset_index.*unwrap" parquet/src/file/writer.rs | head -20

Length of output: 978


🏁 Script executed:

#!/bin/bash
# Look at the finish() method implementation
rg -n "pub fn finish" parquet/src/file/writer.rs -A 20

Length of output: 795


🏁 Script executed:

#!/bin/bash
# Search for OffsetIndexBuilder in metadata/writer.rs - the file mentioned in original review
rg -n "unwrap\|finish" parquet/src/file/metadata/writer.rs | head -30

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check ColumnCloseResult definition and append_column usage
rg -n "struct ColumnCloseResult|fn append_column" parquet/src/file/writer.rs -A 15 | head -50

Length of output: 623


🏁 Script executed:

#!/bin/bash
# Find ColumnCloseResult struct definition
rg -n "struct ColumnCloseResult" parquet/src/file/writer.rs -A 10

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for write_metadata implementation
rg -n "fn write_metadata" parquet/src/file/writer.rs -A 30 | head -50

Length of output: 1341


🏁 Script executed:

#!/bin/bash
# List files in parquet/src/file/metadata/
fd . parquet/src/file/metadata/

Length of output: 380


🏁 Script executed:

#!/bin/bash
# Broadly search for offset_index().unwrap() anywhere in parquet source
rg "offset_index.*\.unwrap" parquet/src/ -n

Length of output: 2363


🏁 Script executed:

#!/bin/bash
# Look at parquet/src/file/metadata/writer.rs around line 262
sed -n '250,280p' parquet/src/file/metadata/writer.rs

Length of output: 997


🏁 Script executed:

#!/bin/bash
# Search for ColumnCloseResult definition across all parquet source
rg -n "ColumnCloseResult" parquet/src/ -B 2 -A 8 | head -60

Length of output: 3938


🏁 Script executed:

#!/bin/bash
# Look at the offset_index unwrap in column/writer/mod.rs
sed -n '2995,3010p' parquet/src/column/writer/mod.rs

Length of output: 538


🏁 Script executed:

#!/bin/bash
# Search for OffsetIndexBuilder to understand how offset_index is built
rg -n "OffsetIndexBuilder|offset_indexes.push" parquet/src/file/writer.rs -B 2 -A 5 | head -50

Length of output: 894


🏁 Script executed:

#!/bin/bash
# Check if there's been any fix to the FIXME about unwrap panicking
rg -n "FIXME.*missing.*index|panic.*missing" parquet/src/file/metadata/writer.rs

Length of output: 74


🏁 Script executed:

#!/bin/bash
# Understand the test better - see what metadata the test loads
sed -n '2478,2495p' parquet/src/file/writer.rs

Length of output: 821


Clarify: This test likely doesn't expose the offset_index panic, but the underlying critical bug exists.

The test correctly handles potentially-missing column and offset indexes. However, the concern is probably not triggered by alltypes_tiny_pages.parquet since the file's description mentions only a missing column index, not missing offset indexes. If offset indexes are present for all columns, the test will not panic.

The real critical issue flagged by this review exists elsewhere: parquet/src/file/metadata/writer.rs line 262 contains an unresolved .unwrap() that will panic if any offset_index entry is None, as explicitly noted in the FIXME comment on line 259. This needs to be fixed regardless of whether this specific test triggers it.

🤖 Prompt for AI Agents
In parquet/src/file/metadata/writer.rs around lines 259–262, there is an
uncovered .unwrap() that will panic when an offset_index entry is None (see the
FIXME); change the code to handle Option values safely by matching the Option
instead of unwrapping: if Some(offset_index) serialize and append it, if None
append a None/absent entry (or skip writing that index according to the metadata
format) so no unwrap is called. Ensure the resulting vector/type matches the
expected metadata schema (use map or match to produce
Option<SerializedOffsetIndex> rather than calling unwrap), adjust downstream
expectations/types if needed, and add/update a unit test to cover the missing
offset_index case.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:useful; category:bug; feedback:The CodeRabbit Ai reviewer is correct that the call to Option::unwrap() may lead to a panic. This is acknowledged by the PR author in a comment at line 259. The PR should not be merged until this issue is addressed.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# View the problematic code in metadata/writer.rs around lines 250-280
sed -n '250,280p' parquet/src/file/metadata/writer.rs

Length of output: 997


🏁 Script executed:

#!/bin/bash
# Get more context about how offset_indexes are used in the finish method
rg -n "offset_indexes" parquet/src/file/metadata/writer.rs -B 2 -A 2

Length of output: 3424


🏁 Script executed:

#!/bin/bash
# Look at the finish method implementation that handles these indexes
rg -n "pub fn finish" parquet/src/file/metadata/writer.rs -A 50 | head -70

Length of output: 3119


@martin-augment, thank you for confirming this critical issue. You're absolutely right that the PR should not be merged with this panic risk.

Looking at the code in parquet/src/file/metadata/writer.rs (lines 253-263), the problem is clear:

The Issue:

let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
    None
} else {
    // FIXME(ets): this will panic if there's a missing index.
    offset_indexes.map(|ovvi| {
        ovvi.into_iter()
            .map(|vi| vi.into_iter().map(|oi| oi.unwrap()).collect())  // ⚠️ PANICS on None
            .collect()
    })
};

The all_none check only prevents writing when all indexes are None. When there's a mix of Some and None values (like in alltypes_tiny_pages.parquet), the .unwrap() will panic.

Suggested Fix:
Apply the same pattern used for column_indexes (lines 238-249). Replace the unwrap with unwrap_or(OffsetIndexMetaData::NONE) or similar:

let offset_indexes: Option<ParquetOffsetIndex> = if all_none {
    None
} else {
    offset_indexes.map(|ovvi| {
        ovvi.into_iter()
            .map(|vi| {
                vi.into_iter()
                    .map(|oi| oi.unwrap_or(OffsetIndexMetaData::NONE))  // Safe handling
                    .collect()
            })
            .collect()
    })
};

You should also update the write_offset_indexes method (line 73) to skip writing when encountering OffsetIndexMetaData::NONE, similar to how write_column_indexes handles it.

Recommendation: Create a GitHub issue to track this fix explicitly, or update the PR description to note this is a work-in-progress that addresses the issue partially but requires the unwrap fix before merge.

}