A comprehensive feature engineering framework that combines feature synthesis using genetic programming with feature selection using the NSGA-II multi-objective optimization algorithm.
This project provides an automated feature engineering solution that can:
- Synthesize new features using genetic programming with mathematical expressions
- Select optimal feature subsets using NSGA-II multi-objective optimization
- Handle both regression and classification tasks with automatic task detection
- Process CSV datasets with minimal configuration required
- Scale to high-dimensional data with multiprocessing support
- Provide configurable operators for crossover and mutation strategies
The system uses bio-inspired algorithms to discover meaningful feature combinations while balancing model performance against feature sparsity.
- Tree-based representation for mathematical expressions
- Function set: Arithmetic (
+,-,*,/), trigonometric (sin,cos,tanh), logarithmic (log), power (exp), and other operators such as absolute value (abs) and negation (-) - Crossover operators: Subtree, random, and point crossover
- Mutation operators: Subtree replacement, node mutation, parameter mutation, grow mutation
- Configurable depth constraints to control expression complexity
- Multi-objective optimization balancing accuracy vs. sparsity/correlation/variance/information gain
- Pareto-optimal solutions providing trade-offs between objectives
- Population-based evolution with dominance ranking and crowding distance
- Cross-validation fitness evaluation for robust performance assessment
- Multiple crossover types: Single-point, two-point, uniform, arithmetic
- Adaptive mutation strategies with configurable rates and block operations
- Dual Enhancement Pipeline: Synthesis β Selection in integrated workflow
- Automatic Task Detection: Regression/classification based on target analysis
- Multiple ML Model Support: Linear, tree-based, neural networks, SVM, and more
- Rich Configuration System: JSON-based configs with extensive examples
- Parallel Processing: Multiprocessing support for large datasets
- Sklearn-Compatible: Standard transformer interface for easy integration
- Comprehensive Evaluation: Cross-validation, Pareto fronts, feature importance
- Python 3.13+
- pip or uv package manager
# Clone the repository
git clone <repository-url>
cd feature_selection_project
# Install with pip
pip install -e .
# Or using uv (recommended)
uv syncCore dependencies:
numpy>=2.3.4- Numerical computing and array operationspandas>=2.3.3- Data manipulation and CSV handlingscikit-learn>=1.7.2- ML models, metrics, and preprocessingmatplotlib>=3.10.7- Plotting and visualizationtqdm>=4.67.1- Progress bars for long-running operations
# Full feature enhancement with synthesis and selection and Ridge regression (default behavior)
uv run main.py --csv-path data/California.csv
# Specify target column by name
uv run main.py --csv-path data/Happy.csv --target "Happiness_Index"
# Use different ML model
uv run main.py --csv-path data/Wine.csv --model rf
# Enable both synthesis and selection with custom parameters
uv run main.py --csv-path data/Happy.csv \
--synthesis-config configs/synthesis_config.json \
--selection-config configs/selection_config.json# High-performance mode with multiprocessing
uv run main.py --csv-path data/Mnist.csv --use-multiprocessing --n-jobs -1
# Custom test split and scaling
uv run main.py --csv-path data/Diabetes.csv --test-size 0.3 --no-scale
# Quiet mode with specific random seed
uv run main.py --csv-path data/Wine.csv --quiet --random-state 123{
"population_size": 100,
"generations": 50,
"secondary_objective": "sparsity",
"metric": "accuracy",
"crossover_type": "uniform",
"mutation_type": "adaptive",
"mutation_prob": 0.01,
"uniform_swap_prob": 0.3,
"objective_weights": [0.7, 0.3]
}Secondary Objectives:
"sparsity"- Minimize number of selected features"correlation"- Minimize feature correlation"variance"- Maximize feature variance"information_gain"- Maximize information content"mutual_information"- Maximize mutual information"redundancy"- Minimize feature redundancy"minimun redundancy maximum relevance (mrmr)"- Minimize redundancy and maximize relevance
{
"population_size": 100,
"max_generations": 50,
"max_depth": 6,
"crossover_type": "subtree",
"mutation_type": "parameter",
"mutation_prob": 0.1,
"tournament_size": 3
}Crossover Types:
"subtree"- Standard GP subtree exchange (default)"random"- Creates new random subtrees"point"- Exchanges nodes at positions
Mutation Types:
"adaptive"- Starts with subtree mutation and gradually shifts to grow mutation (default)"subtree"- Replaces random subtree"node"- Changes individual nodes"parameter"- Mutates only terminals"grow"- Extends terminals into subtrees"random"- Randomly selects strategy
from feature_enhancer import FeatureEnhancer, DatasetLoader
from sklearn.ensemble import RandomForestRegressor
# Load and preprocess data
X, y = DatasetLoader.load_csv('data/California.csv')
X, y = DatasetLoader.preprocess_dataset(X, y)
# Configure enhancement
enhancer = FeatureEnhancer(
synthesis_config={
"population_size": 50,
"max_generations": 30,
"max_depth": 4,
"crossover_type": "subtree",
"mutation_type": "parameter"
},
selection_config={
"population_size": 100,
"generations": 50,
"secondary_objective": "sparsity",
"crossover_type": "uniform",
"mutation_type": "adaptive"
},
verbose=True
)
# Apply enhancement
model = RandomForestRegressor()
X_enhanced = enhancer.fit_transform(X, y, model)
# Analyze results
feature_info = enhancer.get_feature_info()
pareto_front = enhancer.get_pareto_front()# Selection only workflow
selector = FeatureSelector(
model=model,
secondary_objective="correlation",
population_size=100,
generations=50
)
X_selected = selector.fit_transform(X, y)
selector.plot_pareto_front() # Visualize trade-offs
# Synthesis only workflow
synthesizer = MultiFeatureGA(
population_size=100,
max_generations=50,
max_depth=6
)
new_features = synthesizer.evolve_multiple_features(X, y, n_features=5)uv run main.py --csv-path dataset.csv [OPTIONS]
Positional Arguments:
--csv-path Path to CSV dataset
Target Configuration:
--target TARGET Target column name or index (default: -1)
Model Selection:
--model MODEL Model choice: auto, linear, logistic, rf, ridge,
lasso, knn, svm, dt, gb, mlp (default: ridge)
Configuration Files:
--synthesis-config FILE Path to synthesis configuration JSON
--selection-config FILE Path to selection configuration JSON
Data Processing:
--no-scale Disable feature scaling
--test-size FLOAT Test set proportion (default: 0.2)
Performance Options:
--use-multiprocessing Enable parallel processing
--n-jobs N Number of processes (-1 for all cores)
Reproducibility:
--random-state INT Random seed (default: 42)
Output Control:
--quiet Reduce output verbosityfeature_selection_project/
βββ feature_enhancer/ # Main package
β βββ __init__.py # Package exports
β βββ feature_enhancer.py # Main FeatureEnhancer class
β βββ dataset_utils.py # Data loading and preprocessing
β βββ utils.py # Utility functions
β βββ feature_selection/ # NSGA-II implementation
β β βββ __init__.py
β β βββ feature_selector.py # Main selector class
β β βββ nsga2.py # NSGA-II algorithm
β β βββ individual.py # Individual representation
β β βββ fitness.py # Fitness functions
β β βββ crossover.py # Crossover operators
β β βββ mutation.py # Mutation operators
β βββ feature_synthesis/ # Genetic Programming
β βββ __init__.py
β βββ feature_synthesis.py # GP algorithms
β βββ individual.py # GP tree representation
β βββ crossover.py # GP crossover operators
β βββ mutation.py # GP mutation operators
βββ config_examples/ # Configuration examples and guides
β βββ synthesis_config_example.json
β βββ selection_config_example.json
βββ configs/ # Pre-configured parameter sets
β βββ quick/ # Fast execution configs
β β βββ quick_selection_*.json # Quick selection configs
β β βββ quick_synthesis_*.json # Quick synthesis configs
β βββ medium/ # Balanced performance configs
β β βββ medium_selection_*.json # Medium selection configs
β β βββ medium_synthesis_*.json # Medium synthesis configs
β βββ slow/ # High-quality, longer-running configs
β βββ slow_selection_*.json # Thorough selection configs
β βββ slow_synthesis_*.json # Thorough synthesis configs
βββ data/ # Example datasets
β βββ AutoMPG.csv # Auto MPG regression dataset
β βββ California.csv # California housing prices
β βββ Diabetes.csv # Diabetes progression dataset
β βββ Fish.csv # Fish weight regression
β βββ Happy.csv # World happiness index
β βββ Wine.csv # Wine quality regression
βββ comparison_results/ # Algorithm comparison outputs
β βββ comparison_visualization.png # Performance comparison plots
β βββ latest_comparison_results.csv # Detailed results data
β βββ summary_report.txt # Analysis summary
βββ main.py # Command-line interface
βββ comparison_analysis.py # Algorithm comparison tool
βββ run_comparison.py # Automated comparison runner
βββ visualize_results.py # Results visualization utility
βββ pyproject.toml # Project configuration and dependencies
βββ uv.lock # Dependency lock file
βββ .gitignore # Git ignore patterns
βββ .python-version # Python version specification
βββ README.md # This file
- Population Initialization: Random binary chromosomes representing feature subsets
- Fitness Evaluation: Cross-validated model performance + secondary objective
- Non-Dominated Sorting: Rank solutions by Pareto dominance
- Crowding Distance: Maintain population diversity
- Selection: Tournament selection based on rank and crowding distance
- Reproduction: Apply crossover and mutation operators
- Environmental Selection: Select best individuals for next generation
- Tree Initialization: Random mathematical expressions within depth constraints
- Fitness Evaluation: Feature usefulness via cross-validated model improvement
- Tournament Selection: Select parents based on fitness
- Tree Crossover: Exchange subtrees between parent expressions
- Tree Mutation: Modify nodes, parameters, or subtrees
- Population Replacement: Generational or steady-state strategies
- Synthesis Phase: Generate N new features using GP
- Combination Phase: Merge original and synthesized features
- Selection Phase: Apply NSGA-II to find optimal feature subset
- Evaluation Phase: Cross-validate final feature set performance