18623: bugfix: select_columns should validate column names by martin-augment · Pull Request #11 · martin-augment/datafusion

martin-augment · 2025-11-12T06:30:09Z

18623: To review by AI

… which would error on duplicates

coderabbitai · 2025-11-12T06:30:48Z

Walkthrough

The changes implement stricter validation for DataFrame column selection operations. The select_columns function now explicitly returns an error when requested column names are not found in the schema, replacing prior behavior that could silently ignore missing fields. A corresponding test validates this error handling.

Changes

Cohort / File(s)	Summary
Column selection validation `datafusion/core/src/dataframe/mod.rs`	Added `unqualified_field_not_found` import. Refactored `select_columns` logic from `flat_map` to `map` approach that explicitly returns error on missing fields instead of silently skipping them.
Error handling tests `datafusion/core/tests/dataframe/mod.rs`	Added `internal_err` to imports. Introduced `test_columns_with_nonexistent_columns` test that validates SchemaError is raised when selecting nonexistent columns via DataFrame API.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch pr-18623-2025-11-12-06-30-05

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2025-11-12T06:31:52Z

Pull Request Review: select_columns Validation Fix

Summary

This PR adds proper validation to select_columns() to ensure it errors when non-existent columns are specified, bringing it into consistency with similar DataFrame operations. The fix properly addresses issue apache#18623.

Code Quality and Best Practices ✅

Strengths:

Clean transformation from flat_map to map + error checking - this is the right pattern for validation
Proper use of the existing unqualified_field_not_found utility function for consistent error messaging
The import statement follows the codebase conventions
Good use of iterator chains with proper error propagation via collect::<Result<Vec<_>, _>>()?

Minor Observations:

The code adds one additional intermediate collection step (Vec<Vec<_>>) before flattening, which is acceptable for validation purposes
The logic is clear and easy to understand, though it could theoretically be optimized slightly (see Performance section)

Potential Bugs or Issues ✅

No bugs identified. The implementation correctly:

Checks each column name individually
Returns an error on the first non-existent column
Maintains the existing behavior of handling qualified fields
Preserves the flattening behavior for columns that may match multiple qualified fields

Performance Considerations ⚠️

Current Implementation:
The change introduces an additional allocation and iteration:

map → collects into Vec<Result<Vec<_>, _>>
Early-return error checking via collect::<Result<Vec<_>, _>>()?
into_iter().flatten() → collects into final Vec<_>

Impact:

Minimal performance impact for typical use cases (small number of columns)
The additional allocation is negligible compared to the overall DataFrame operation cost
Error-checking happens early, which is good for fail-fast behavior

Optimization Opportunity (optional):
If this becomes a hot path, you could use a single-pass approach:

let mut fields = Vec::new();
for name in columns {
    let matching = self.plan.schema().qualified_fields_with_unqualified_name(name);
    if matching.is_empty() {
        return Err(unqualified_field_not_found(name, self.plan.schema()));
    }
    fields.extend(matching);
}

However, the current implementation is more idiomatic and readable, and the performance difference is likely negligible.

Security Concerns ✅

No security issues identified.

The change adds validation, which improves robustness
Error messages appropriately use the existing error framework
No risk of information leakage or unsafe operations

Test Coverage ⚠️

Positive:

A new test select_columns_with_nonexistent_columns() is added
The test properly validates that a SchemaError is returned

Suggestions for Improvement:

Test specificity: The test could be more specific about which error is returned:

let t2 = t.select_columns(&["canada", "c2", "rocks"]);
match t2 {
    Err(DataFusionError::SchemaError(SchemaError::FieldNotFound { field, .. }, _)) => {
        // Verify it's the first invalid column
        assert_eq!(field.name(), "canada");
    }
    _ => panic!("Expected SchemaError::FieldNotFound for 'canada'"),
}

Additional test cases to consider:
- Test with only non-existent columns: select_columns(&["invalid1", "invalid2"])
- Test with valid columns before invalid ones to ensure early error detection
- Test with empty column list: select_columns(&[])
- Edge case: very similar column names (e.g., "c1" exists, "c11" doesn't)
Consistency check: Consider adding a similar test for drop_columns() since the commit message mentions that "drop_columns does accept non-existent columns" - it would be good to document whether this is intentional divergent behavior and perhaps add a test documenting that behavior.

Consistency Observation 🤔

The PR commit message states: "drop_columns does accept non-existent columns. correct select_columns which would error on duplicates"

Question: Is the difference in behavior between select_columns() (now errors on invalid) and drop_columns() (silently ignores invalid) intentional?

select_columns(): Errors on non-existent columns (after this PR)
drop_columns(): Silently ignores non-existent columns (current behavior using flat_map)

Recommendation: Consider whether drop_columns() should also validate column names for consistency. If the divergent behavior is intentional, it should be documented in the method documentation with a note explaining why.

Overall Assessment

Verdict: LGTM with minor suggestions ✅

This is a solid bug fix that:

Correctly addresses the issue
Uses appropriate error handling patterns
Includes test coverage
Follows the codebase style

The suggestions above are minor enhancements rather than blockers. The PR is ready to merge as-is, though the additional test cases and consistency clarification would strengthen it further.

Reviewed by Claude Code 🤖

augmentcode

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc49fc0 and 3c9777e.

📒 Files selected for processing (2)

datafusion/core/src/dataframe/mod.rs (2 hunks)
datafusion/core/tests/dataframe/mod.rs (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

datafusion/core/src/dataframe/mod.rs (1)

datafusion/common/src/error.rs (1)

unqualified_field_not_found (1051-1056)

datafusion/core/tests/dataframe/mod.rs (1)

datafusion/core/src/test_util/mod.rs (1)

test_table (133-135)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Analyze (rust)

🔇 Additional comments (3)

datafusion/core/tests/dataframe/mod.rs (1)

69-70: LGTM: Import addition is correct.

The internal_err import is properly added and used in the new test function.

datafusion/core/src/dataframe/mod.rs (2)

55-56: LGTM: Clean import additions.

The new imports support the enhanced validation logic in select_columns. The unqualified_field_not_found helper will provide clear error messages including the invalid field name and list of valid fields.

310-333: Excellent bugfix: Column validation now properly errors on missing columns.

The new implementation correctly validates that all requested column names exist in the schema before proceeding. Key improvements:

Short-circuit error handling: The collect::<Result<Vec<_>, _>>()? pattern ensures the first missing column immediately returns an error with context

Clear error messages: Using unqualified_field_not_found provides users with both the invalid column name and the list of valid columns

Preserves multi-field handling: The flatten step correctly handles cases where an unqualified name matches multiple qualified fields

This is the right behavior—select_columns should fail fast when columns don't exist, unlike drop_columns which can safely ignore missing columns.

coderabbitai · 2025-11-12T06:33:23Z

+#[tokio::test]
+async fn select_columns_with_nonexistent_columns() -> Result<()> {
+    let t = test_table().await?;
+    let t2 = t.select_columns(&["canada", "c2", "rocks"]);
+    let Err(DataFusionError::SchemaError(_, _)) = t2 else {
+        return internal_err!("select_columns with nonexistent columns should error");
+    };
+
+    Ok(())
+}


⚠️ Potential issue | 🟠 Major

Fix test data: "c2" is a valid column in the test table.

The test attempts to select ["canada", "c2", "rocks"], but c2 is a valid column in aggregate_test_100 (as shown in the select_columns test at line 296). This weakens the test's validity—if the implementation incorrectly allowed some non-existent columns through, the presence of the valid "c2" column might mask the bug.

Apply this diff to use only non-existent columns:

async fn select_columns_with_nonexistent_columns() -> Result<()> { let t = test_table().await?; - let t2 = t.select_columns(&["canada", "c2", "rocks"]); + let t2 = t.select_columns(&["canada", "rocks"]); let Err(DataFusionError::SchemaError(_, _)) = t2 else { return internal_err!("select_columns with nonexistent columns should error"); };

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

#[tokio::test]

async fn select_columns_with_nonexistent_columns() -> Result<()> {

let t = test_table().await?;

let t2 = t.select_columns(&["canada", "c2", "rocks"]);

let Err(DataFusionError::SchemaError(_, _)) = t2 else {

return internal_err!("select_columns with nonexistent columns should error");

};

Ok(())

}

#[tokio::test]

async fn select_columns_with_nonexistent_columns() -> Result<()> {

let t = test_table().await?;

let t2 = t.select_columns(&["canada", "rocks"]);

let Err(DataFusionError::SchemaError(_, _)) = t2 else {

return internal_err!("select_columns with nonexistent columns should error");

};

Ok(())

}

🤖 Prompt for AI Agents

In datafusion/core/tests/dataframe/mod.rs around lines 308 to 317, the test uses ["canada", "c2", "rocks"] but "c2" is actually present in the test table, weakening the assertion that selecting nonexistent columns errors; update the test to use only nonexistent column names (for example replace "c2" with a clearly non-existent name like "c22" or "nonexistent_col") so the select_columns call contains only missing columns and the pattern-match on Err(DataFusionError::SchemaError(...)) will correctly validate the error behavior.

value:incorrect-but-reasonable; category:bug; feedback: The CodeRabbit AI reviewer suggests to remove the valid column name from the array but the idea of the test is to have a mix of valid and invalid columns and the test should return an error for the first invalid one.

@martin-augment Thank you for the clarification! You're absolutely right, and I apologize for the incorrect suggestion.

I now understand that the test is intentionally using a mix of valid and invalid column names to ensure that select_columns properly validates all columns and returns an error when any of them don't exist—not just when all columns are missing. This is actually a stronger test case than I initially recognized, as it verifies the validation doesn't stop at the first valid column.

To make this intent clearer for future reviewers, you might consider adding a brief comment above the test:

#[tokio::test] // Test that select_columns errors when ANY column is invalid, // even when some valid columns are also included in the selection async fn select_columns_with_nonexistent_columns() -> Result<()> {

But the test itself is correct as-is. Thanks for taking the time to explain the rationale!

timsaucer added 3 commits November 11, 2025 08:39

Add check for invalid columns should return an error

44597fd

Add unit test coverage for invalid column selection and drop

91323e1

drop_columns does accept non-existent columns. correct select_columns…

3c9777e

… which would error on duplicates

augmentcode bot reviewed Nov 12, 2025

View reviewed changes

coderabbitai bot reviewed Nov 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

18623: bugfix: select_columns should validate column names#11

18623: bugfix: select_columns should validate column names#11
martin-augment wants to merge 3 commits intomainfrom
pr-18623-2025-11-12-06-30-05

martin-augment commented Nov 12, 2025

Uh oh!

coderabbitai bot commented Nov 12, 2025 •

edited

Loading

Uh oh!

claude bot commented Nov 12, 2025

Uh oh!

augmentcode bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 12, 2025 •

edited

Loading

Uh oh!

martin-augment Nov 12, 2025

Uh oh!

coderabbitai bot Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

martin-augment commented Nov 12, 2025

Uh oh!

coderabbitai bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Uh oh!

claude bot commented Nov 12, 2025

Pull Request Review: select_columns Validation Fix

Summary

Code Quality and Best Practices ✅

Potential Bugs or Issues ✅

Performance Considerations ⚠️

Security Concerns ✅

Test Coverage ⚠️

Consistency Observation 🤔

Overall Assessment

Uh oh!

augmentcode bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

martin-augment Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Nov 12, 2025 •

edited

Loading

coderabbitai bot Nov 12, 2025 •

edited

Loading