Skip to content

avoid collisions in logistic_regression_objective tmp column name#1002

Merged
eordentlich merged 1 commit intoNVIDIA:mainfrom
eordentlich:eo_objective_fix
Jan 8, 2026
Merged

avoid collisions in logistic_regression_objective tmp column name#1002
eordentlich merged 1 commit intoNVIDIA:mainfrom
eordentlich:eo_objective_fix

Conversation

@eordentlich
Copy link
Collaborator

No description provided.

Signed-off-by: Erik Ordentlich <eordentlich@gmail.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

This PR fixes a potential column name collision issue in the logistic_regression_objective function. Previously, the function hardcoded "scaledFeatures" as the output column name for the StandardScaler, which could collide with existing columns in the input DataFrame.

Changes:

  • Dynamically generates a unique temporary column name (_objective_tmp) by checking against existing column names
  • Uses a while loop to prepend underscores until a unique name is found
  • Follows the same collision avoidance pattern already used in params.py (line 118-119)

Impact:

  • Prevents runtime errors when input DataFrames contain a column named "scaledFeatures"
  • Maintains backward compatibility as this is a utility function used in testing
  • No changes to public API or behavior

Confidence Score: 5/5

  • This PR is safe to merge with no risk - it's a defensive bug fix with no breaking changes
  • The change is a straightforward defensive fix that prevents potential column name collisions. The implementation follows an existing pattern in the codebase (params.py), is well-contained, and only affects a utility function used for testing. No edge cases or issues found.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
python/src/spark_rapids_ml/metrics/utils.py 5/5 Adds collision avoidance for temporary column name used by StandardScaler; follows existing codebase patterns

Sequence Diagram

sequenceDiagram
    participant Caller
    participant logistic_regression_objective
    participant DataFrame
    participant StandardScaler
    
    Caller->>logistic_regression_objective: Call with df and lr_model
    logistic_regression_objective->>logistic_regression_objective: Transform df with model
    logistic_regression_objective->>logistic_regression_objective: Calculate log_loss
    
    alt Standardization is True
        logistic_regression_objective->>DataFrame: Get df.columns
        DataFrame-->>logistic_regression_objective: Return column_names
        logistic_regression_objective->>logistic_regression_objective: Initialize outputCol = "_objective_tmp"
        
        loop While outputCol in column_names
            logistic_regression_objective->>logistic_regression_objective: outputCol = "_" + outputCol
        end
        
        logistic_regression_objective->>StandardScaler: Create with inputCol and unique outputCol
        logistic_regression_objective->>StandardScaler: Fit on df
        StandardScaler-->>logistic_regression_objective: Return scaler_model with std
        logistic_regression_objective->>logistic_regression_objective: Adjust coefficients by stdev
    end
    
    logistic_regression_objective->>logistic_regression_objective: Calculate full_objective
    logistic_regression_objective-->>Caller: Return full_objective
Loading

@eordentlich
Copy link
Collaborator Author

build

@eordentlich eordentlich merged commit c8f9672 into NVIDIA:main Jan 8, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants