Skip to content

Latest commit

Β 

History

History
346 lines (276 loc) Β· 12.6 KB

File metadata and controls

346 lines (276 loc) Β· 12.6 KB

Bio-Inspired Feature Engineering Toolkit

A comprehensive feature engineering framework that combines feature synthesis using genetic programming with feature selection using the NSGA-II multi-objective optimization algorithm.

🎯 Overview

This project provides an automated feature engineering solution that can:

  • Synthesize new features using genetic programming with mathematical expressions
  • Select optimal feature subsets using NSGA-II multi-objective optimization
  • Handle both regression and classification tasks with automatic task detection
  • Process CSV datasets with minimal configuration required
  • Scale to high-dimensional data with multiprocessing support
  • Provide configurable operators for crossover and mutation strategies

The system uses bio-inspired algorithms to discover meaningful feature combinations while balancing model performance against feature sparsity.

🧬 Algorithm Components

Feature Synthesis (Genetic Programming)

  • Tree-based representation for mathematical expressions
  • Function set: Arithmetic (+, -, *, /), trigonometric (sin, cos, tanh), logarithmic (log), power (exp), and other operators such as absolute value (abs) and negation (-)
  • Crossover operators: Subtree, random, and point crossover
  • Mutation operators: Subtree replacement, node mutation, parameter mutation, grow mutation
  • Configurable depth constraints to control expression complexity

Feature Selection (NSGA-II)

  • Multi-objective optimization balancing accuracy vs. sparsity/correlation/variance/information gain
  • Pareto-optimal solutions providing trade-offs between objectives
  • Population-based evolution with dominance ranking and crowding distance
  • Cross-validation fitness evaluation for robust performance assessment
  • Multiple crossover types: Single-point, two-point, uniform, arithmetic
  • Adaptive mutation strategies with configurable rates and block operations

πŸš€ Key Features

  • Dual Enhancement Pipeline: Synthesis β†’ Selection in integrated workflow
  • Automatic Task Detection: Regression/classification based on target analysis
  • Multiple ML Model Support: Linear, tree-based, neural networks, SVM, and more
  • Rich Configuration System: JSON-based configs with extensive examples
  • Parallel Processing: Multiprocessing support for large datasets
  • Sklearn-Compatible: Standard transformer interface for easy integration
  • Comprehensive Evaluation: Cross-validation, Pareto fronts, feature importance

πŸ“¦ Installation

Prerequisites

  • Python 3.13+
  • pip or uv package manager

Install from Source

# Clone the repository
git clone <repository-url>
cd feature_selection_project

# Install with pip
pip install -e .

# Or using uv (recommended)
uv sync

Dependencies

Core dependencies:

  • numpy>=2.3.4 - Numerical computing and array operations
  • pandas>=2.3.3 - Data manipulation and CSV handling
  • scikit-learn>=1.7.2 - ML models, metrics, and preprocessing
  • matplotlib>=3.10.7 - Plotting and visualization
  • tqdm>=4.67.1 - Progress bars for long-running operations

πŸƒβ€β™‚οΈ Quick Start

Basic Usage

# Full feature enhancement with synthesis and selection and Ridge regression (default behavior)
uv run main.py --csv-path data/California.csv

# Specify target column by name
uv run main.py --csv-path data/Happy.csv --target "Happiness_Index"

# Use different ML model
uv run main.py --csv-path data/Wine.csv --model rf

# Enable both synthesis and selection with custom parameters
uv run main.py --csv-path data/Happy.csv \
    --synthesis-config configs/synthesis_config.json \
    --selection-config configs/selection_config.json

Advanced Usage

# High-performance mode with multiprocessing
uv run main.py --csv-path data/Mnist.csv --use-multiprocessing --n-jobs -1

# Custom test split and scaling
uv run main.py --csv-path data/Diabetes.csv --test-size 0.3 --no-scale

# Quiet mode with specific random seed
uv run main.py --csv-path data/Wine.csv --quiet --random-state 123

βš™οΈ Configuration

Feature Selection Configuration

{
  "population_size": 100,
  "generations": 50,
  "secondary_objective": "sparsity",
  "metric": "accuracy",
  "crossover_type": "uniform",
  "mutation_type": "adaptive",
  "mutation_prob": 0.01,
  "uniform_swap_prob": 0.3,
  "objective_weights": [0.7, 0.3]
}

Secondary Objectives:

  • "sparsity" - Minimize number of selected features
  • "correlation" - Minimize feature correlation
  • "variance" - Maximize feature variance
  • "information_gain" - Maximize information content
  • "mutual_information" - Maximize mutual information
  • "redundancy" - Minimize feature redundancy
  • "minimun redundancy maximum relevance (mrmr)" - Minimize redundancy and maximize relevance

Feature Synthesis Configuration

{
  "population_size": 100,
  "max_generations": 50,
  "max_depth": 6,
  "crossover_type": "subtree",
  "mutation_type": "parameter",
  "mutation_prob": 0.1,
  "tournament_size": 3
}

Crossover Types:

  • "subtree" - Standard GP subtree exchange (default)
  • "random" - Creates new random subtrees
  • "point" - Exchanges nodes at positions

Mutation Types:

  • "adaptive" - Starts with subtree mutation and gradually shifts to grow mutation (default)
  • "subtree" - Replaces random subtree
  • "node" - Changes individual nodes
  • "parameter" - Mutates only terminals
  • "grow" - Extends terminals into subtrees
  • "random" - Randomly selects strategy

πŸ”§ Python API

Basic Integration

from feature_enhancer import FeatureEnhancer, DatasetLoader
from sklearn.ensemble import RandomForestRegressor

# Load and preprocess data
X, y = DatasetLoader.load_csv('data/California.csv')
X, y = DatasetLoader.preprocess_dataset(X, y)

# Configure enhancement
enhancer = FeatureEnhancer(
    synthesis_config={
        "population_size": 50,
        "max_generations": 30,
        "max_depth": 4,
        "crossover_type": "subtree",
        "mutation_type": "parameter"
    },
    selection_config={
        "population_size": 100,
        "generations": 50,
        "secondary_objective": "sparsity",
        "crossover_type": "uniform",
        "mutation_type": "adaptive"
    },
    verbose=True
)

# Apply enhancement
model = RandomForestRegressor()
X_enhanced = enhancer.fit_transform(X, y, model)

# Analyze results
feature_info = enhancer.get_feature_info()
pareto_front = enhancer.get_pareto_front()

Advanced API Usage

# Selection only workflow
selector = FeatureSelector(
    model=model,
    secondary_objective="correlation",
    population_size=100,
    generations=50
)

X_selected = selector.fit_transform(X, y)
selector.plot_pareto_front()  # Visualize trade-offs

# Synthesis only workflow
synthesizer = MultiFeatureGA(
    population_size=100,
    max_generations=50,
    max_depth=6
)

new_features = synthesizer.evolve_multiple_features(X, y, n_features=5)

πŸŽ›οΈ Command Line Interface

uv run main.py  --csv-path dataset.csv [OPTIONS]

Positional Arguments:
  --csv-path                  Path to CSV dataset

Target Configuration:
  --target TARGET       Target column name or index (default: -1)

Model Selection:
  --model MODEL         Model choice: auto, linear, logistic, rf, ridge,
                           lasso, knn, svm, dt, gb, mlp (default: ridge)

Configuration Files:
  --synthesis-config FILE   Path to synthesis configuration JSON
  --selection-config FILE   Path to selection configuration JSON

Data Processing:
  --no-scale               Disable feature scaling
  --test-size FLOAT        Test set proportion (default: 0.2)

Performance Options:
  --use-multiprocessing    Enable parallel processing
  --n-jobs N               Number of processes (-1 for all cores)

Reproducibility:
  --random-state INT       Random seed (default: 42)

Output Control:
  --quiet              Reduce output verbosity

πŸ“ Project Structure

feature_selection_project/
β”œβ”€β”€ feature_enhancer/              # Main package
β”‚   β”œβ”€β”€ __init__.py               # Package exports
β”‚   β”œβ”€β”€ feature_enhancer.py       # Main FeatureEnhancer class
β”‚   β”œβ”€β”€ dataset_utils.py          # Data loading and preprocessing
β”‚   β”œβ”€β”€ utils.py                  # Utility functions
β”‚   β”œβ”€β”€ feature_selection/        # NSGA-II implementation
β”‚   β”‚   β”œβ”€β”€ __init__.py
β”‚   β”‚   β”œβ”€β”€ feature_selector.py   # Main selector class
β”‚   β”‚   β”œβ”€β”€ nsga2.py             # NSGA-II algorithm
β”‚   β”‚   β”œβ”€β”€ individual.py        # Individual representation
β”‚   β”‚   β”œβ”€β”€ fitness.py           # Fitness functions
β”‚   β”‚   β”œβ”€β”€ crossover.py         # Crossover operators
β”‚   β”‚   └── mutation.py          # Mutation operators
β”‚   └── feature_synthesis/        # Genetic Programming
β”‚       β”œβ”€β”€ __init__.py
β”‚       β”œβ”€β”€ feature_synthesis.py  # GP algorithms
β”‚       β”œβ”€β”€ individual.py        # GP tree representation
β”‚       β”œβ”€β”€ crossover.py         # GP crossover operators
β”‚       └── mutation.py          # GP mutation operators
β”œβ”€β”€ config_examples/              # Configuration examples and guides
β”‚   β”œβ”€β”€ synthesis_config_example.json
β”‚   └── selection_config_example.json
β”œβ”€β”€ configs/                     # Pre-configured parameter sets
β”‚   β”œβ”€β”€ quick/                   # Fast execution configs
β”‚   β”‚   β”œβ”€β”€ quick_selection_*.json    # Quick selection configs
β”‚   β”‚   └── quick_synthesis_*.json    # Quick synthesis configs
β”‚   β”œβ”€β”€ medium/                  # Balanced performance configs
β”‚   β”‚   β”œβ”€β”€ medium_selection_*.json   # Medium selection configs
β”‚   β”‚   └── medium_synthesis_*.json   # Medium synthesis configs
β”‚   └── slow/                    # High-quality, longer-running configs
β”‚       β”œβ”€β”€ slow_selection_*.json     # Thorough selection configs
β”‚       └── slow_synthesis_*.json     # Thorough synthesis configs
β”œβ”€β”€ data/                        # Example datasets
β”‚   β”œβ”€β”€ AutoMPG.csv             # Auto MPG regression dataset
β”‚   β”œβ”€β”€ California.csv          # California housing prices
β”‚   β”œβ”€β”€ Diabetes.csv            # Diabetes progression dataset
β”‚   β”œβ”€β”€ Fish.csv                # Fish weight regression
β”‚   β”œβ”€β”€ Happy.csv               # World happiness index
β”‚   └── Wine.csv                # Wine quality regression
β”œβ”€β”€ comparison_results/          # Algorithm comparison outputs
β”‚   β”œβ”€β”€ comparison_visualization.png  # Performance comparison plots
β”‚   β”œβ”€β”€ latest_comparison_results.csv # Detailed results data
β”‚   └── summary_report.txt       # Analysis summary
β”œβ”€β”€ main.py                      # Command-line interface
β”œβ”€β”€ comparison_analysis.py       # Algorithm comparison tool
β”œβ”€β”€ run_comparison.py           # Automated comparison runner
β”œβ”€β”€ visualize_results.py        # Results visualization utility
β”œβ”€β”€ pyproject.toml              # Project configuration and dependencies
β”œβ”€β”€ uv.lock                     # Dependency lock file
β”œβ”€β”€ .gitignore                  # Git ignore patterns
β”œβ”€β”€ .python-version             # Python version specification
└── README.md                   # This file

πŸ”¬ Algorithm Details

NSGA-II Multi-Objective Optimization

  1. Population Initialization: Random binary chromosomes representing feature subsets
  2. Fitness Evaluation: Cross-validated model performance + secondary objective
  3. Non-Dominated Sorting: Rank solutions by Pareto dominance
  4. Crowding Distance: Maintain population diversity
  5. Selection: Tournament selection based on rank and crowding distance
  6. Reproduction: Apply crossover and mutation operators
  7. Environmental Selection: Select best individuals for next generation

Genetic Programming Tree Evolution

  1. Tree Initialization: Random mathematical expressions within depth constraints
  2. Fitness Evaluation: Feature usefulness via cross-validated model improvement
  3. Tournament Selection: Select parents based on fitness
  4. Tree Crossover: Exchange subtrees between parent expressions
  5. Tree Mutation: Modify nodes, parameters, or subtrees
  6. Population Replacement: Generational or steady-state strategies

Integration Workflow

  1. Synthesis Phase: Generate N new features using GP
  2. Combination Phase: Merge original and synthesized features
  3. Selection Phase: Apply NSGA-II to find optimal feature subset
  4. Evaluation Phase: Cross-validate final feature set performance