Skip to content

Commit b5bdca3

Browse files
committed
Release v0.1.4: Multiclass Support & Enhanced Tree Binning
- Add multiclass WOE encoding with one-vs-rest approach - Implement decision tree as default binner for numerical features - Add class-specific prediction methods (predict_proba_class, predict_ci_class) - Fix NaN values in last bin for numerical features - Add comprehensive type checking with ty configuration - Update documentation and examples for multiclass support - Fix all type issues in core library, tests, and examples - Add new multiclass example demonstrating functionality
1 parent a1c937c commit b5bdca3

14 files changed

+1895
-377
lines changed

CHANGELOG.md

Lines changed: 45 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,50 @@
11
# Changelog
22

3-
## Version 0.1.3.post1 (Current)
3+
## Version 0.1.4 (Current)
4+
5+
**Multiclass Support & Enhanced Tree Binning**: Major feature additions and API improvements
6+
7+
- **New Features**:
8+
- **Multiclass WOE Support**: Added one-vs-rest Weight of Evidence encoding for multiclass targets
9+
- Automatic detection of multiclass targets (3+ unique values, not continuous proportions)
10+
- One-vs-rest binary encoding for each class against all others
11+
- Multiple output columns per feature: `feature_class_0`, `feature_class_1`, etc.
12+
- Support for both integer and string class labels
13+
- Class-specific priors stored in `y_prior_` dictionary
14+
- **Enhanced Tree Binning**: Improved decision tree-based numerical feature binning
15+
- Fixed NaN values in last bin issue with proper right-inclusive binning `(a, b]`
16+
- Added `get_tree_estimator(feature)` method to access underlying scikit-learn trees
17+
- Optimized default parameters for credit scoring: `max_depth=3`, `random_state=42`
18+
- Simplified default tree parameters (removed `min_samples_leaf`, `min_samples_split`)
19+
- **Unified Binner Parameters**: Streamlined API with single `binner_kwargs` parameter
20+
- Replaced separate `tree_kwargs` and `faiss_kwargs` with unified approach
21+
- Backward compatibility maintained for existing parameter names
22+
- Cleaner API: `FastWoe(binning_method="tree", binner_kwargs={"max_depth": 2})`
23+
24+
- **API Changes**:
25+
- **Default Binning Method**: Changed from `"kbins"` to `"tree"` for numerical features
26+
- **New Method**: `get_tree_estimator(feature)` to access fitted decision tree estimators
27+
- **Enhanced Target Detection**: Automatic multiclass detection with `is_multiclass_target` attribute
28+
- **Class Information**: Added `classes_` and `n_classes_` attributes for multiclass targets
29+
30+
- **Fixed**:
31+
- **Tree Binning NaN Bug**: Resolved issue where last bin always contained NaN values
32+
- **Binning Logic**: Implemented proper right-inclusive binning `(a, b]` instead of `np.digitize`
33+
- **Split Point Handling**: Improved `_create_bin_edges_from_splits` to handle duplicate splits
34+
- **Test Coverage**: Added comprehensive tests for multiclass and tree binning edge cases
35+
36+
- **Documentation & Examples**:
37+
- **New Example**: `examples/fastwoe_multiclass.py` demonstrating multiclass WOE usage
38+
- **Comprehensive Tests**: Added `TestMulticlassWoe` class with 9 test methods
39+
- **Updated Documentation**: Clarified multiclass WOE concept and usage patterns
40+
41+
- **Performance & Reliability**:
42+
- **Credit Scoring Optimization**: Default tree parameters optimized for 4-8 bins per feature
43+
- **Reproducible Results**: `random_state=42` as default for consistent binning
44+
- **Memory Efficiency**: Improved handling of multiclass target encoding
45+
- **Error Handling**: Enhanced validation for multiclass target types
46+
47+
## Version 0.1.3.post1
448

549
**Enhanced Statistical Analysis**: Added IV standard errors and Series support
650

README.md

Lines changed: 79 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,14 @@
88
[![PyPI downloads](https://img.shields.io/pypi/dm/fastwoe.svg)](https://pypi.org/project/fastwoe/)
99
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
1010

11-
FastWoe is a Python library for efficient **Weight of Evidence (WOE)** encoding of categorical features and statistical inference. It's designed for machine learning practitioners seeking robust, interpretable feature engineering and likelihood-ratio-based inference for binary classification problems.
11+
FastWoe is a Python library for efficient **Weight of Evidence (WOE)** encoding of categorical features and statistical inference. It's designed for machine learning practitioners seeking robust, interpretable feature engineering and likelihood-ratio-based inference for binary and multiclass classification problems.
1212

1313
![FastWoe](https://github.com/xRiskLab/fastwoe/raw/main/ims/title.png)
1414

1515
## 🌟 Key Features
1616

1717
- **Fast WOE Encoding**: Leverages scikit-learn's `TargetEncoder` for efficient computation
18+
- **Multiclass Support**: One-vs-rest WOE encoding for targets with 3+ classes
1819
- **Statistical Confidence Intervals**: Provides standard errors and confidence intervals for WOE values
1920
- **IV Standard Errors**: Statistical significance testing for Information Value with confidence intervals
2021
- **Cardinality Control**: Built-in preprocessing to handle high-cardinality categorical features
@@ -136,6 +137,83 @@ print("\nWOE Mapping for 'category':")
136137
print(mapping[['category', 'count', 'event_rate', 'woe', 'woe_se']])
137138
```
138139
140+
## 🎯 Multiclass Support
141+
142+
FastWoe now supports **multiclass classification** using a one-vs-rest approach! For targets with 3+ classes, FastWoe automatically creates separate WOE encodings for each class against all others.
143+
144+
### Multiclass Example
145+
146+
```python
147+
import pandas as pd
148+
import numpy as np
149+
from fastwoe import FastWoe
150+
from sklearn.ensemble import RandomForestClassifier
151+
from sklearn.metrics import classification_report
152+
153+
# Create multiclass data
154+
X = pd.DataFrame({
155+
'job': ['teacher', 'engineer', 'artist', 'doctor'] * 25,
156+
'age_group': ['<30', '30-50', '50+'] * 33 + ['<30'],
157+
'income': np.random.normal(50000, 20000, 100),
158+
})
159+
y = pd.Series([0, 1, 2, 0, 1] * 20) # 3 classes
160+
161+
# Fit FastWoe with multiclass target
162+
woe_encoder = FastWoe()
163+
woe_encoder.fit(X, y)
164+
165+
# Transform data - creates multiple columns per feature
166+
X_woe = woe_encoder.transform(X)
167+
print(f"Original features: {X.shape[1]}")
168+
print(f"WOE features: {X_woe.shape[1]}") # 3x more columns
169+
print(f"Column names: {list(X_woe.columns)}")
170+
# Output: ['job_class_0', 'job_class_1', 'job_class_2', 'age_group_class_0', ...]
171+
172+
# Get probabilities for all classes
173+
probs = woe_encoder.predict_proba(X)
174+
print(f"Probabilities shape: {probs.shape}") # (n_samples, n_classes)
175+
176+
# Get class-specific probabilities
177+
class_0_probs = woe_encoder.predict_proba_class(X, class_label=0)
178+
class_1_probs = woe_encoder.predict_proba_class(X, class_label=1)
179+
180+
# Get confidence intervals for specific class
181+
class_0_ci = woe_encoder.predict_ci_class(X, class_label=0)
182+
print(f"Class 0 CI shape: {class_0_ci.shape}") # (n_samples, 2) [lower, upper]
183+
184+
# Train a classifier on WOE features
185+
rf = RandomForestClassifier(n_estimators=100, random_state=42)
186+
rf.fit(X_woe, y)
187+
predictions = rf.predict(X_woe)
188+
189+
print("\nClassification Report:")
190+
print(classification_report(y, predictions))
191+
```
192+
193+
### Multiclass Features
194+
195+
- **One-vs-Rest Encoding**: Each class gets separate WOE scores against all others
196+
- **Class-Specific Methods**: `predict_proba_class()` and `predict_ci_class()` for individual classes
197+
- **Softmax Probabilities**: `predict_proba()` returns probabilities that sum to 1 across classes
198+
- **Comprehensive Statistics**: All existing methods work with multiclass (IV analysis, feature stats, etc.)
199+
- **String Labels**: Supports both integer and string class labels
200+
201+
### Class-Specific Predictions
202+
203+
```python
204+
# Method 1: Extract from full results
205+
all_probs = woe_encoder.predict_proba(X)
206+
class_0_probs = all_probs[:, 0] # Extract class 0
207+
208+
# Method 2: Use class-specific methods (recommended)
209+
class_0_probs = woe_encoder.predict_proba_class(X, class_label=0)
210+
class_0_ci = woe_encoder.predict_ci_class(X, class_label=0)
211+
212+
# Practical usage examples
213+
high_risk_mask = woe_encoder.predict_proba_class(X, class_label=0) > 0.5
214+
high_confidence_mask = woe_encoder.predict_ci_class(X, class_label=2)[:, 0] > 0.3
215+
```
216+
139217
## 🔧 Advanced Usage
140218
141219
> [!CAUTION]

examples/fastwoe_example.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,12 +12,19 @@
1212
import numpy as np
1313
import pandas as pd
1414
import seaborn as sns
15-
import statsmodels.api as sm
1615
from scipy import stats
1716
from sklearn.linear_model import LogisticRegression
1817
from sklearn.metrics import roc_auc_score
1918
from sklearn.model_selection import train_test_split
2019

20+
try:
21+
import statsmodels.api as sm
22+
23+
STATSMODELS_AVAILABLE = True
24+
except ImportError:
25+
STATSMODELS_AVAILABLE = False
26+
print("Warning: statsmodels not available. Some features will be disabled.")
27+
2128
warnings.filterwarnings("ignore")
2229

2330
# Set style

examples/fastwoe_explanation.ipynb

Lines changed: 42 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@
9595
},
9696
{
9797
"cell_type": "code",
98-
"execution_count": 3,
98+
"execution_count": null,
9999
"metadata": {},
100100
"outputs": [
101101
{
@@ -272,21 +272,24 @@
272272
"\n",
273273
"# Print results\n",
274274
"print(f\"\\nExplanation for sample {idx}:\")\n",
275-
"print(f\"True label: {explanation['true_label']}\")\n",
276-
"print(f\"Predicted label: {explanation['predicted_label']}\")\n",
277-
"print(f\"WOE Evidence: {explanation['total_woe']:.3f}\")\n",
278-
"print(f\"Interpretation: {explanation['interpretation']}\")\n",
279-
"\n",
280-
"# Show feature contributions\n",
281-
"if \"feature_contributions\" in explanation:\n",
282-
" print(\"\\nFeature contributions:\")\n",
283-
" for feature, woe_val in explanation[\"feature_contributions\"].items():\n",
284-
" print(f\" {feature}: {woe_val:.3f}\")"
275+
"if explanation is not None:\n",
276+
" print(f\"True label: {explanation['true_label']}\")\n",
277+
" print(f\"Predicted label: {explanation['predicted_label']}\")\n",
278+
" print(f\"WOE Evidence: {explanation['total_woe']:.3f}\")\n",
279+
" print(f\"Interpretation: {explanation['interpretation']}\")\n",
280+
"\n",
281+
" # Show feature contributions\n",
282+
" if \"feature_contributions\" in explanation:\n",
283+
" print(\"\\nFeature contributions:\")\n",
284+
" for feature, woe_val in explanation[\"feature_contributions\"].items():\n",
285+
" print(f\" {feature}: {woe_val:.3f}\")\n",
286+
"else:\n",
287+
" print(\"No explanation available\")"
285288
]
286289
},
287290
{
288291
"cell_type": "code",
289-
"execution_count": 4,
292+
"execution_count": null,
290293
"metadata": {},
291294
"outputs": [
292295
{
@@ -463,21 +466,24 @@
463466
"\n",
464467
"# Print results\n",
465468
"print(f\"\\nExplanation for sample {idx}:\")\n",
466-
"print(f\"True label: {explanation['true_label']}\")\n",
467-
"print(f\"Predicted label: {explanation['predicted_label']}\")\n",
468-
"print(f\"WOE Evidence: {explanation['total_woe']:.3f}\")\n",
469-
"print(f\"Interpretation: {explanation['interpretation']}\")\n",
470-
"\n",
471-
"# Show feature contributions\n",
472-
"if \"feature_contributions\" in explanation:\n",
473-
" print(\"\\nFeature contributions:\")\n",
474-
" for feature, woe_val in explanation[\"feature_contributions\"].items():\n",
475-
" print(f\" {feature}: {woe_val:.3f}\")"
469+
"if explanation is not None:\n",
470+
" print(f\"True label: {explanation['true_label']}\")\n",
471+
" print(f\"Predicted label: {explanation['predicted_label']}\")\n",
472+
" print(f\"WOE Evidence: {explanation['total_woe']:.3f}\")\n",
473+
" print(f\"Interpretation: {explanation['interpretation']}\")\n",
474+
"\n",
475+
" # Show feature contributions\n",
476+
" if \"feature_contributions\" in explanation:\n",
477+
" print(\"\\nFeature contributions:\")\n",
478+
" for feature, woe_val in explanation[\"feature_contributions\"].items():\n",
479+
" print(f\" {feature}: {woe_val:.3f}\")\n",
480+
"else:\n",
481+
" print(\"No explanation available\")"
476482
]
477483
},
478484
{
479485
"cell_type": "code",
480-
"execution_count": 5,
486+
"execution_count": null,
481487
"metadata": {},
482488
"outputs": [
483489
{
@@ -654,16 +660,19 @@
654660
"\n",
655661
"# Print results\n",
656662
"print(f\"\\nExplanation for sample {idx}:\")\n",
657-
"print(f\"True label: {explanation['true_label']}\")\n",
658-
"print(f\"Predicted label: {explanation['predicted_label']}\")\n",
659-
"print(f\"WOE Evidence: {explanation['total_woe']:.3f}\")\n",
660-
"print(f\"Interpretation: {explanation['interpretation']}\")\n",
661-
"\n",
662-
"# Show feature contributions\n",
663-
"if \"feature_contributions\" in explanation:\n",
664-
" print(\"\\nFeature contributions:\")\n",
665-
" for feature, woe_val in explanation[\"feature_contributions\"].items():\n",
666-
" print(f\" {feature}: {woe_val:.3f}\")"
663+
"if explanation is not None:\n",
664+
" print(f\"True label: {explanation['true_label']}\")\n",
665+
" print(f\"Predicted label: {explanation['predicted_label']}\")\n",
666+
" print(f\"WOE Evidence: {explanation['total_woe']:.3f}\")\n",
667+
" print(f\"Interpretation: {explanation['interpretation']}\")\n",
668+
"\n",
669+
" # Show feature contributions\n",
670+
" if \"feature_contributions\" in explanation:\n",
671+
" print(\"\\nFeature contributions:\")\n",
672+
" for feature, woe_val in explanation[\"feature_contributions\"].items():\n",
673+
" print(f\" {feature}: {woe_val:.3f}\")\n",
674+
"else:\n",
675+
" print(\"No explanation available\")"
667676
]
668677
},
669678
{

0 commit comments

Comments
 (0)