18623: bugfix: select_columns should validate column names#11
18623: bugfix: select_columns should validate column names#11martin-augment wants to merge 3 commits intomainfrom
Conversation
WalkthroughThe changes implement stricter validation for DataFrame column selection operations. The Changes
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Pull Request Review: select_columns Validation FixSummaryThis PR adds proper validation to Code Quality and Best Practices ✅Strengths:
Minor Observations:
Potential Bugs or Issues ✅No bugs identified. The implementation correctly:
Performance Considerations
|
There was a problem hiding this comment.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
datafusion/core/src/dataframe/mod.rs(2 hunks)datafusion/core/tests/dataframe/mod.rs(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
datafusion/core/src/dataframe/mod.rs (1)
datafusion/common/src/error.rs (1)
unqualified_field_not_found(1051-1056)
datafusion/core/tests/dataframe/mod.rs (1)
datafusion/core/src/test_util/mod.rs (1)
test_table(133-135)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Analyze (rust)
🔇 Additional comments (3)
datafusion/core/tests/dataframe/mod.rs (1)
69-70: LGTM: Import addition is correct.The
internal_errimport is properly added and used in the new test function.datafusion/core/src/dataframe/mod.rs (2)
55-56: LGTM: Clean import additions.The new imports support the enhanced validation logic in
select_columns. Theunqualified_field_not_foundhelper will provide clear error messages including the invalid field name and list of valid fields.
310-333: Excellent bugfix: Column validation now properly errors on missing columns.The new implementation correctly validates that all requested column names exist in the schema before proceeding. Key improvements:
- Short-circuit error handling: The
collect::<Result<Vec<_>, _>>()?pattern ensures the first missing column immediately returns an error with context- Clear error messages: Using
unqualified_field_not_foundprovides users with both the invalid column name and the list of valid columns- Preserves multi-field handling: The flatten step correctly handles cases where an unqualified name matches multiple qualified fields
This is the right behavior—
select_columnsshould fail fast when columns don't exist, unlikedrop_columnswhich can safely ignore missing columns.
| #[tokio::test] | ||
| async fn select_columns_with_nonexistent_columns() -> Result<()> { | ||
| let t = test_table().await?; | ||
| let t2 = t.select_columns(&["canada", "c2", "rocks"]); | ||
| let Err(DataFusionError::SchemaError(_, _)) = t2 else { | ||
| return internal_err!("select_columns with nonexistent columns should error"); | ||
| }; | ||
|
|
||
| Ok(()) | ||
| } |
There was a problem hiding this comment.
Fix test data: "c2" is a valid column in the test table.
The test attempts to select ["canada", "c2", "rocks"], but c2 is a valid column in aggregate_test_100 (as shown in the select_columns test at line 296). This weakens the test's validity—if the implementation incorrectly allowed some non-existent columns through, the presence of the valid "c2" column might mask the bug.
Apply this diff to use only non-existent columns:
async fn select_columns_with_nonexistent_columns() -> Result<()> {
let t = test_table().await?;
- let t2 = t.select_columns(&["canada", "c2", "rocks"]);
+ let t2 = t.select_columns(&["canada", "rocks"]);
let Err(DataFusionError::SchemaError(_, _)) = t2 else {
return internal_err!("select_columns with nonexistent columns should error");
};📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| #[tokio::test] | |
| async fn select_columns_with_nonexistent_columns() -> Result<()> { | |
| let t = test_table().await?; | |
| let t2 = t.select_columns(&["canada", "c2", "rocks"]); | |
| let Err(DataFusionError::SchemaError(_, _)) = t2 else { | |
| return internal_err!("select_columns with nonexistent columns should error"); | |
| }; | |
| Ok(()) | |
| } | |
| #[tokio::test] | |
| async fn select_columns_with_nonexistent_columns() -> Result<()> { | |
| let t = test_table().await?; | |
| let t2 = t.select_columns(&["canada", "rocks"]); | |
| let Err(DataFusionError::SchemaError(_, _)) = t2 else { | |
| return internal_err!("select_columns with nonexistent columns should error"); | |
| }; | |
| Ok(()) | |
| } |
🤖 Prompt for AI Agents
In datafusion/core/tests/dataframe/mod.rs around lines 308 to 317, the test uses
["canada", "c2", "rocks"] but "c2" is actually present in the test table,
weakening the assertion that selecting nonexistent columns errors; update the
test to use only nonexistent column names (for example replace "c2" with a
clearly non-existent name like "c22" or "nonexistent_col") so the select_columns
call contains only missing columns and the pattern-match on
Err(DataFusionError::SchemaError(...)) will correctly validate the error
behavior.
There was a problem hiding this comment.
value:incorrect-but-reasonable; category:bug; feedback: The CodeRabbit AI reviewer suggests to remove the valid column name from the array but the idea of the test is to have a mix of valid and invalid columns and the test should return an error for the first invalid one.
There was a problem hiding this comment.
@martin-augment Thank you for the clarification! You're absolutely right, and I apologize for the incorrect suggestion.
I now understand that the test is intentionally using a mix of valid and invalid column names to ensure that select_columns properly validates all columns and returns an error when any of them don't exist—not just when all columns are missing. This is actually a stronger test case than I initially recognized, as it verifies the validation doesn't stop at the first valid column.
To make this intent clearer for future reviewers, you might consider adding a brief comment above the test:
#[tokio::test]
// Test that select_columns errors when ANY column is invalid,
// even when some valid columns are also included in the selection
async fn select_columns_with_nonexistent_columns() -> Result<()> {But the test itself is correct as-is. Thanks for taking the time to explain the rationale!
18623: To review by AI