A complete machine learning pipeline for optimizing bank telemarketing campaigns using cost-sensitive learning, achieving 81.1% customer capture rate while minimizing campaign costs to $0.516 per contact.
This project implements a data mining solution to predict term deposit subscriptions from telemarketing campaigns. By applying cost-sensitive optimization techniques, we develop a model that balances recall (capturing potential customers) against precision (avoiding wasted calls).
| Metric | Value | Description |
|---|---|---|
| Recall | 81.1% | Captures 81% of potential customers |
| Cost per Contact | $0.516 | Optimized using custom cost matrix |
| ROC-AUC | 0.804 | Strong discrimination ability |
| Optimal Threshold | 0.34 | Cost-optimized decision boundary |
| Model | Logistic Regression | Winner after comparative analysis |
For a 10,000-customer campaign:
- Expected acceptors: 1,130 (11.3% base rate)
- Our model captures: 917 customers (81.1%)
- Naive baseline captures: 376 customers (33.3%)
- Improvement: +541 customers (+144% lift)
Revenue Impact: At $100 profit per subscription = $54,100 additional revenue per 10k contacts
- Complete ML Pipeline: Data loading → EDA → Preprocessing → Modeling → Optimization → Evaluation
- 21 Professional Visualizations: Class distributions, correlations, model performance, feature importance
- Cost-Sensitive Learning: Custom cost matrix reflecting business priorities (FN:FP = 13.3:1)
- Model Comparison: Decision Tree vs Logistic Regression with hyperparameter tuning
- Interpretability: Feature importance analysis and actionable business recommendations
- Reproducible: Automated pipeline script + Jupyter notebook with detailed explanations
- Academic Report: 5,500-word report following scientific standards
bankml/
├── README.md # This file
├── requirements.txt # Python dependencies
├── notebook.ipynb # Main Jupyter notebook (primary deliverable)
├── run_all.py # Automated pipeline script
├── report.md # Academic report (~5,500 words)
├── report.docx # Word version for Google Docs
│
├── data/
│ └── bank-marketing.csv # UCI Bank Marketing Dataset (41K records)
│
├── assets/ # 21 visualization images
│ ├── 01_class_distribution.png # Target distribution
│ ├── 05_economic_indicators.png # Economic context analysis
│ ├── 12_duration_leakage_analysis.png # Data leakage explanation
│ ├── 17_cost_threshold_optimization.png # Cost optimization
│ ├── 18_dt_feature_importance.png # Decision Tree features
│ ├── 20_lr_coefficients.png # Logistic Regression coefficients
│ └── 21_roc_comparison_final.png # Final model comparison
│
├── models/ # Trained models and results
│ ├── best_decision_tree.pkl
│ ├── best_logistic_regression.pkl
│ └── final_results.json
│
└── docs/ # Documentation
├── EXECUTION_PLAN.md # Detailed methodology
└── SUBMISSION.md # Assignment submission guide
Python 3.8+
pip# Clone repository
git clone https://github.com/yourusername/bankml.git
cd bankml
# Install dependencies
pip install -r requirements.txtOption 1: View Results (No Execution)
# Open Jupyter notebook to see complete analysis with outputs
jupyter notebook notebook.ipynb
# View academic report
cat report.md
# or open report.docx in Word/Google DocsOption 2: Reproduce Full Pipeline
# Run complete pipeline (5-10 minutes)
python run_all.py
# This will:
# - Load and preprocess data
# - Generate 21 visualizations (saved to assets/)
# - Train and optimize models (Decision Tree + Logistic Regression)
# - Perform cost-sensitive threshold optimization
# - Save models to models/
# - Print performance metricsOption 3: Interactive Exploration
# Launch Jupyter and run cells step-by-step
jupyter notebook notebook.ipynb- Dataset: 41,188 telemarketing contacts from Portuguese bank (2008-2013)
- Features: 20 input variables (demographics, campaign info, economic indicators)
- Target: Term deposit subscription (11.3% positive class)
- Key Finding: Duration variable shows data leakage → excluded from models
- Missing Values: Mode/mean imputation
- Feature Engineering:
- Created
was_contacted_before(binary from pdays) - Log transformations:
campaign_log,previous_log - Dropped
duration(data leakage)
- Created
- Encoding: One-hot encoding for categorical variables
- Scaling: StandardScaler for numerical features
- Split: 75/25 stratified train-test split
- Decision Tree (Entropy): 33.3% recall
- Logistic Regression (L2): 64.4% recall ✓ Better baseline
- 5-fold Stratified Cross-Validation
- Decision Tree: max_depth, min_samples_leaf, ccp_alpha
- Logistic Regression: C, penalty, solver
- Result: DT recall improved to 62.1%
Cost Matrix:
False Positive (unnecessary call): +1.5
False Negative (missed customer): +20.0
True Positive (successful sale): -5.0
True Negative (correct avoid): 0.0Threshold Optimization: Swept 0.01-0.99 to find minimum expected cost
| Model | Stage | Recall | Cost | ROC-AUC |
|---|---|---|---|---|
| Decision Tree | Baseline | 33.3% | - | 0.763 |
| Decision Tree | Tuned | 62.1% | - | 0.801 |
| Decision Tree | Optimized | 69.4% | 0.552 | 0.801 |
| Logistic Regression | Baseline | 64.4% | - | 0.804 |
| Logistic Regression | Tuned | 64.4% | - | 0.804 |
| Logistic Regression | Optimized | 81.1% | 0.516 | 0.804 |
Winner: Logistic Regression with threshold=0.34
Decision Tree Importance:
- nr.employed (Employment level): 67.4%
- cons.conf.idx (Consumer confidence): 13.0%
- was_contacted_before: 5.4%
Logistic Regression Coefficients:
- Most Positive: month_mar (+1.07), cons.price.idx (+0.77)
- Most Negative: emp.var.rate (-1.69), month_may (-0.72)
-
Economic Timing Strategy
- Launch campaigns during stable employment periods
- Monitor macroeconomic indicators (employment, confidence)
- Expected impact: 15-20% improvement
-
Warm Lead Prioritization
- Previous contact increases acceptance 10x
- Implement relationship nurturing programs
-
Seasonal Concentration
- Focus on March (highest positive coefficient)
- Reduce May activity (negative coefficient despite high volume)
- Secondary peaks: September, October, December
-
Contact Method
- Cellular > Telephone (telephone shows -0.64 coefficient)
- Invest in cellular database maintenance
-
De-emphasize Demographics
- Economic context matters more than age/job
- Simplify targeting logic
Source: UCI Machine Learning Repository - Bank Marketing Dataset
Citation:
Moro, S., Cortez, P., & Rita, P. (2014).
A Data-Driven Approach to Predict the Success of Bank Telemarketing.
Decision Support Systems, 62, 22-31.
Features:
- Demographics: age, job, marital, education
- Campaign: contact, month, day_of_week, campaign, pdays, previous, poutcome
- Economic: emp.var.rate, cons.price.idx, cons.conf.idx, euribor3m, nr.employed
- Target: y (term deposit subscription: yes/no)
- Python 3.8+
- Data Processing: pandas, numpy
- Visualization: matplotlib, seaborn
- Machine Learning: scikit-learn
- DecisionTreeClassifier
- LogisticRegression
- GridSearchCV
- StandardScaler
- Model Persistence: pickle
- Documentation: Jupyter, Markdown
- notebook.ipynb - Complete analysis with code, outputs, and explanations
- report.md - Academic report (~5,500 words)
- EXECUTION_PLAN.md - Detailed methodology and experimental design
- SUBMISSION.md - Assignment submission guide
All 21 visualizations are available in the assets/ folder:
Exploratory Data Analysis (12 images):
- Class distribution, numerical distributions, correlations
- Age analysis, duration leakage demonstration
- Economic indicators, job/month seasonality
- Previous outcome impact, pdays distribution
Model Evaluation (9 images):
- Baseline confusion matrices and ROC curves
- Tuned model performance
- Cost-threshold optimization curves
- Feature importance (DT) and coefficients (LR)
- Final ROC comparison





