[Refactor] make faster _convert_tree_to_points use np.vectorize instead of merge in xbooster/lgb_constructor.py#11
Conversation
Reviewer's guide (collapsed on small PRs)Reviewer's GuideRefactors Flow diagram for optimized _convert_tree_to_points vectorized scoringflowchart TD
A[input_X_DataFrame] --> B[get_leafs_X_leaf_indices]
B --> C[get_shape_n_samples_n_trees]
C --> D[init_points_matrix_zeros_n_samples_by_n_trees]
C --> E[get_leaf_idx_values_from_X_leaf_indices]
D --> F[for_each_tree_index_t_0_to_n_trees_minus_1]
E --> F
subgraph Per_tree_vectorized_mapping
F --> G[filter_lgb_scorecard_with_points_where_Tree_equals_t]
G --> H[build_mapping_dict_Node_to_Points]
H --> I[apply_np_vectorize_mapping_to_leaf_idx_values_column_t]
I --> J[assign_result_to_points_matrix_all_rows_column_t]
end
J --> K[after_loop_build_result_DataFrame_from_points_matrix]
K --> L[set_result_index_to_X_index]
L --> M[set_result_columns_to_Score_0_to_Score_n_trees_minus_1]
M --> N[compute_total_Score_as_rowwise_sum_of_points_matrix]
N --> O[add_Score_column_to_result]
O --> P[return_result]
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey - I've found 2 issues, and left some high level feedback:
- The new lookup uses
Tree == twheretis a zero-based column index, whereas the previous version derivedtree_numberfrom the column name; please double-check thatlgb_scorecard_with_points['Tree']uses the same zero-based indexing and ordering asX_leaf_indices.columnsto avoid misaligned scores. - The old implementation applied
.round(4)to leaf indices before merging, but the new code does exact dictionary lookups onleaf_idx_values; if leaf/node identifiers are floats, you may need to preserve the rounding (or another normalization) to avoid missing matches.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new lookup uses `Tree == t` where `t` is a zero-based column index, whereas the previous version derived `tree_number` from the column name; please double-check that `lgb_scorecard_with_points['Tree']` uses the same zero-based indexing and ordering as `X_leaf_indices.columns` to avoid misaligned scores.
- The old implementation applied `.round(4)` to leaf indices before merging, but the new code does exact dictionary lookups on `leaf_idx_values`; if leaf/node identifiers are floats, you may need to preserve the rounding (or another normalization) to avoid missing matches.
## Individual Comments
### Comment 1
<location> `xbooster/lgb_constructor.py:500-502` </location>
<code_context>
+ n_samples, n_trees = X_leaf_indices.shape
+ points_matrix = np.zeros((n_samples, n_trees))
+ leaf_idx_values = X_leaf_indices.values
+ for t in range(n_trees):
# Get points for this tree
- subset_points_df = self.lgb_scorecard_with_points[
- self.lgb_scorecard_with_points["Tree"] == int(tree_number)
- ].copy()
-
- # Merge leaf indices with points
- merged_df = pd.merge(
- X_leaf_indices[[col]].round(4),
- subset_points_df[["Node", "Points"]],
- left_on=col,
- right_on="Node",
- how="left",
- )
- result[f"Score_{tree_number}"] = merged_df["Points"]
+ tree_points = self.lgb_scorecard_with_points[
+ self.lgb_scorecard_with_points["Tree"] == t
+ ]
</code_context>
<issue_to_address>
**issue (bug_risk):** Using the loop index `t` as the `Tree` filter may be brittle if column order / naming changes.
Previously, `tree_number` came from the column name and was used directly in the `Tree` filter, so the mapping depended only on the naming convention. Now the code assumes the column order in `X_leaf_indices` matches `Tree == t` for `t in range(n_trees)`. If columns are reordered, some trees are missing, or `Tree` values are not `0..n_trees-1`, scores could be assigned to the wrong tree with no error. Consider restoring the name-based mapping or otherwise explicitly aligning column names to `Tree` values instead of relying on positional order.
</issue_to_address>
### Comment 2
<location> `xbooster/lgb_constructor.py:506-507` </location>
<code_context>
+ self.lgb_scorecard_with_points["Tree"] == t
+ ]
+ # Mapping dictionary instead of merge
+ mapping_dict = dict(zip(tree_points["Node"], tree_points["Points"]))
+ points_matrix[:, t] = np.vectorize(mapping_dict.get)(leaf_idx_values[:, t])
+ result = pd.DataFrame(
</code_context>
<issue_to_address>
**issue (bug_risk):** Dropping the previous rounding on leaf indices may introduce subtle key-mismatch issues.
Previously, `X_leaf_indices[[col]].round(4)` ensured leaf indices matched the `Node` keys despite floating point noise. Now, `mapping_dict.get(leaf_idx_values[:, t])` relies on raw float values, so slight precision differences could cause failed lookups and `NaN`s, altering results. If the rounding was added for this reason, consider restoring a consistent normalization step (e.g., rounding or casting to int) before using these values as dict keys.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
_convert_tree_to_points use np.vectorize instead of merge_convert_tree_to_points use np.vectorize instead of merge
_convert_tree_to_points use np.vectorize instead of merge_convert_tree_to_points use np.vectorize instead of merge in xbooster/lgb_constructor.py
|
Great work on these performance optimizations! All your changes from PRs #10, #11, #13, and #14 have been merged into the What's included:
Impact:
The RC branch ( Thank you for these excellent contributions! |
Description
This PR optimizes the scoring engine by replacing the inefficient iterative pd.merge operations with Vectorized Dictionary Mapping.
The previous implementation performed a
pd.mergewithin a loop for every tree, which caused performance bottlenecks and memory overhead during the scoring of large datasets. The new approach pre-extracts leaf indices as a NumPy array and applies score mapping usingmap, resulting in a speedup.Key Changes
Performance Benchmark
Measured using
%%timeiton the example ipynb:Summary by Sourcery
Optimize tree-to-points conversion in the LightGBM constructor for faster score computation by replacing per-tree DataFrame merges with a vectorized mapping approach.
Enhancements: