Skip to content

18623: bugfix: select_columns should validate column names#11

Open
martin-augment wants to merge 3 commits intomainfrom
pr-18623-2025-11-12-06-30-05
Open

18623: bugfix: select_columns should validate column names#11
martin-augment wants to merge 3 commits intomainfrom
pr-18623-2025-11-12-06-30-05

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

18623: To review by AI

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 12, 2025

Walkthrough

The changes implement stricter validation for DataFrame column selection operations. The select_columns function now explicitly returns an error when requested column names are not found in the schema, replacing prior behavior that could silently ignore missing fields. A corresponding test validates this error handling.

Changes

Cohort / File(s) Summary
Column selection validation
datafusion/core/src/dataframe/mod.rs
Added unqualified_field_not_found import. Refactored select_columns logic from flat_map to map approach that explicitly returns error on missing fields instead of silently skipping them.
Error handling tests
datafusion/core/tests/dataframe/mod.rs
Added internal_err to imports. Introduced test_columns_with_nonexistent_columns test that validates SchemaError is raised when selecting nonexistent columns via DataFrame API.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-18623-2025-11-12-06-30-05

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 12, 2025

Pull Request Review: select_columns Validation Fix

Summary

This PR adds proper validation to select_columns() to ensure it errors when non-existent columns are specified, bringing it into consistency with similar DataFrame operations. The fix properly addresses issue apache#18623.

Code Quality and Best Practices ✅

Strengths:

  • Clean transformation from flat_map to map + error checking - this is the right pattern for validation
  • Proper use of the existing unqualified_field_not_found utility function for consistent error messaging
  • The import statement follows the codebase conventions
  • Good use of iterator chains with proper error propagation via collect::<Result<Vec<_>, _>>()?

Minor Observations:

  • The code adds one additional intermediate collection step (Vec<Vec<_>>) before flattening, which is acceptable for validation purposes
  • The logic is clear and easy to understand, though it could theoretically be optimized slightly (see Performance section)

Potential Bugs or Issues ✅

No bugs identified. The implementation correctly:

  • Checks each column name individually
  • Returns an error on the first non-existent column
  • Maintains the existing behavior of handling qualified fields
  • Preserves the flattening behavior for columns that may match multiple qualified fields

Performance Considerations ⚠️

Current Implementation:
The change introduces an additional allocation and iteration:

  1. map → collects into Vec<Result<Vec<_>, _>>
  2. Early-return error checking via collect::<Result<Vec<_>, _>>()?
  3. into_iter().flatten() → collects into final Vec<_>

Impact:

  • Minimal performance impact for typical use cases (small number of columns)
  • The additional allocation is negligible compared to the overall DataFrame operation cost
  • Error-checking happens early, which is good for fail-fast behavior

Optimization Opportunity (optional):
If this becomes a hot path, you could use a single-pass approach:

let mut fields = Vec::new();
for name in columns {
    let matching = self.plan.schema().qualified_fields_with_unqualified_name(name);
    if matching.is_empty() {
        return Err(unqualified_field_not_found(name, self.plan.schema()));
    }
    fields.extend(matching);
}

However, the current implementation is more idiomatic and readable, and the performance difference is likely negligible.

Security Concerns ✅

No security issues identified.

  • The change adds validation, which improves robustness
  • Error messages appropriately use the existing error framework
  • No risk of information leakage or unsafe operations

Test Coverage ⚠️

Positive:

  • A new test select_columns_with_nonexistent_columns() is added
  • The test properly validates that a SchemaError is returned

Suggestions for Improvement:

  1. Test specificity: The test could be more specific about which error is returned:
let t2 = t.select_columns(&["canada", "c2", "rocks"]);
match t2 {
    Err(DataFusionError::SchemaError(SchemaError::FieldNotFound { field, .. }, _)) => {
        // Verify it's the first invalid column
        assert_eq!(field.name(), "canada");
    }
    _ => panic!("Expected SchemaError::FieldNotFound for 'canada'"),
}
  1. Additional test cases to consider:

    • Test with only non-existent columns: select_columns(&["invalid1", "invalid2"])
    • Test with valid columns before invalid ones to ensure early error detection
    • Test with empty column list: select_columns(&[])
    • Edge case: very similar column names (e.g., "c1" exists, "c11" doesn't)
  2. Consistency check: Consider adding a similar test for drop_columns() since the commit message mentions that "drop_columns does accept non-existent columns" - it would be good to document whether this is intentional divergent behavior and perhaps add a test documenting that behavior.

Consistency Observation 🤔

The PR commit message states: "drop_columns does accept non-existent columns. correct select_columns which would error on duplicates"

Question: Is the difference in behavior between select_columns() (now errors on invalid) and drop_columns() (silently ignores invalid) intentional?

  • select_columns(): Errors on non-existent columns (after this PR)
  • drop_columns(): Silently ignores non-existent columns (current behavior using flat_map)

Recommendation: Consider whether drop_columns() should also validate column names for consistency. If the divergent behavior is intentional, it should be documented in the method documentation with a note explaining why.

Overall Assessment

Verdict: LGTM with minor suggestions

This is a solid bug fix that:

  • Correctly addresses the issue
  • Uses appropriate error handling patterns
  • Includes test coverage
  • Follows the codebase style

The suggestions above are minor enhancements rather than blockers. The PR is ready to merge as-is, though the additional test cases and consistency clarification would strengthen it further.


Reviewed by Claude Code 🤖

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. No suggestions at this time.

Comment augment review to trigger a new review at any time.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc49fc0 and 3c9777e.

📒 Files selected for processing (2)
  • datafusion/core/src/dataframe/mod.rs (2 hunks)
  • datafusion/core/tests/dataframe/mod.rs (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
datafusion/core/src/dataframe/mod.rs (1)
datafusion/common/src/error.rs (1)
  • unqualified_field_not_found (1051-1056)
datafusion/core/tests/dataframe/mod.rs (1)
datafusion/core/src/test_util/mod.rs (1)
  • test_table (133-135)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Analyze (rust)
🔇 Additional comments (3)
datafusion/core/tests/dataframe/mod.rs (1)

69-70: LGTM: Import addition is correct.

The internal_err import is properly added and used in the new test function.

datafusion/core/src/dataframe/mod.rs (2)

55-56: LGTM: Clean import additions.

The new imports support the enhanced validation logic in select_columns. The unqualified_field_not_found helper will provide clear error messages including the invalid field name and list of valid fields.


310-333: Excellent bugfix: Column validation now properly errors on missing columns.

The new implementation correctly validates that all requested column names exist in the schema before proceeding. Key improvements:

  • Short-circuit error handling: The collect::<Result<Vec<_>, _>>()? pattern ensures the first missing column immediately returns an error with context
  • Clear error messages: Using unqualified_field_not_found provides users with both the invalid column name and the list of valid columns
  • Preserves multi-field handling: The flatten step correctly handles cases where an unqualified name matches multiple qualified fields

This is the right behavior—select_columns should fail fast when columns don't exist, unlike drop_columns which can safely ignore missing columns.

Comment on lines +308 to +317
#[tokio::test]
async fn select_columns_with_nonexistent_columns() -> Result<()> {
let t = test_table().await?;
let t2 = t.select_columns(&["canada", "c2", "rocks"]);
let Err(DataFusionError::SchemaError(_, _)) = t2 else {
return internal_err!("select_columns with nonexistent columns should error");
};

Ok(())
}
Copy link
Copy Markdown

@coderabbitai coderabbitai bot Nov 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix test data: "c2" is a valid column in the test table.

The test attempts to select ["canada", "c2", "rocks"], but c2 is a valid column in aggregate_test_100 (as shown in the select_columns test at line 296). This weakens the test's validity—if the implementation incorrectly allowed some non-existent columns through, the presence of the valid "c2" column might mask the bug.

Apply this diff to use only non-existent columns:

 async fn select_columns_with_nonexistent_columns() -> Result<()> {
     let t = test_table().await?;
-    let t2 = t.select_columns(&["canada", "c2", "rocks"]);
+    let t2 = t.select_columns(&["canada", "rocks"]);
     let Err(DataFusionError::SchemaError(_, _)) = t2 else {
         return internal_err!("select_columns with nonexistent columns should error");
     };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#[tokio::test]
async fn select_columns_with_nonexistent_columns() -> Result<()> {
let t = test_table().await?;
let t2 = t.select_columns(&["canada", "c2", "rocks"]);
let Err(DataFusionError::SchemaError(_, _)) = t2 else {
return internal_err!("select_columns with nonexistent columns should error");
};
Ok(())
}
#[tokio::test]
async fn select_columns_with_nonexistent_columns() -> Result<()> {
let t = test_table().await?;
let t2 = t.select_columns(&["canada", "rocks"]);
let Err(DataFusionError::SchemaError(_, _)) = t2 else {
return internal_err!("select_columns with nonexistent columns should error");
};
Ok(())
}
🤖 Prompt for AI Agents
In datafusion/core/tests/dataframe/mod.rs around lines 308 to 317, the test uses
["canada", "c2", "rocks"] but "c2" is actually present in the test table,
weakening the assertion that selecting nonexistent columns errors; update the
test to use only nonexistent column names (for example replace "c2" with a
clearly non-existent name like "c22" or "nonexistent_col") so the select_columns
call contains only missing columns and the pattern-match on
Err(DataFusionError::SchemaError(...)) will correctly validate the error
behavior.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value:incorrect-but-reasonable; category:bug; feedback: The CodeRabbit AI reviewer suggests to remove the valid column name from the array but the idea of the test is to have a mix of valid and invalid columns and the test should return an error for the first invalid one.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-augment Thank you for the clarification! You're absolutely right, and I apologize for the incorrect suggestion.

I now understand that the test is intentionally using a mix of valid and invalid column names to ensure that select_columns properly validates all columns and returns an error when any of them don't exist—not just when all columns are missing. This is actually a stronger test case than I initially recognized, as it verifies the validation doesn't stop at the first valid column.

To make this intent clearer for future reviewers, you might consider adding a brief comment above the test:

#[tokio::test]
// Test that select_columns errors when ANY column is invalid,
// even when some valid columns are also included in the selection
async fn select_columns_with_nonexistent_columns() -> Result<()> {

But the test itself is correct as-is. Thanks for taking the time to explain the rationale!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants