Skip to content

IlyasFardaouix/Sepsis-Detection

Repository files navigation

Sepsis Early Detection Project

=====================================

Overview


This project implements a comprehensive machine learning pipeline for early detection of sepsis using Gradient Boosting algorithms with advanced time-series feature engineering and imbalanced dataset handling.

Project Structure


The project directory is organized as follows:

SEPSIS DETECTION PROJECT/
│
├── data/
│   ├── raw/                    # Raw input data
│   └── processed/              # Processed data
│
├── models/                     # Saved models and feature selectors
│
├── src/
│   ├── features/               # Feature engineering modules
│   │   ├── time_series_features.py
│   │   └── feature_selector.py
│   ├── models/                 # ML model implementations
│   │   ├── gradient_boosting_pipeline.py
│   │   └── imbalanced_handler.py
│   ├── utils/                  # Utility functions
│   │   ├── config_loader.py
│   │   └── logger.py
│   └── data_loader.py          # Data loading utilities
│
├── notebooks/                  # Jupyter notebooks for exploration
│
├── configs/                    # Configuration files
│   └── config.yaml
│
├── results/                     # Evaluation results and plots
│
├── logs/                       # Log files
│
├── train.py                    # Training script
├── evaluate.py                 # Evaluation script
├── inference.py                # Inference script
├── generate_sample_data.py    # Sample data generator
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Quick Start


Installation

# Clone or navigate to the project directory
cd "SEPSIS DETECTION PROJECT"

# Install dependencies
pip install -r requirements.txt

Generate Sample Data

If you don't have your own data, you can generate sample data for testing:

python generate_sample_data.py

This will create a sample dataset in data/raw/sepsis_data.csv with:

  • 1000 patients
  • ~15% sepsis rate (imbalanced dataset)
  • Time-series vital signs and lab values
  • Realistic clinical patterns

Train the Model

python train.py

The training script will:

  • Load and preprocess the data
  • Perform time-series feature engineering
  • Handle imbalanced data using SMOTE (or other methods)
  • Train a Gradient Boosting model (XGBoost by default)
  • Evaluate on test set
  • Save the model and feature importance plots

Evaluate the Model

python evaluate.py

This will:

  • Load the trained model
  • Evaluate on the full dataset
  • Generate ROC curves, PR curves, and confusion matrices
  • Save evaluation metrics

Make Predictions

python inference.py --data data/raw/sepsis_data.csv --output results/predictions.csv

Configuration


All configuration is managed through configs/config.yaml. Key settings include:

Model Configuration

  • Algorithm: Choose between xgboost, lightgbm, or catboost
  • Hyperparameters: Learning rate, max depth, n_estimators, etc.
  • Early stopping: Configure early stopping rounds

Feature Engineering

  • Time window: Window size for rolling features (default: 6 hours)
  • Lookback hours: How far back to look for features (default: 24 hours)
  • Feature selection: Enable/disable and set max features

Imbalanced Learning

  • Method: Choose from smote, adasyn, smoteenn, class_weight, or none
  • Sampling strategy: Control the resampling ratio

Example Configuration

model:
  algorithm: "xgboost"
  learning_rate: 0.01
  max_depth: 6
  n_estimators: 1000

features:
  time_window_hours: 6
  lookback_hours: 24
  feature_selection: true
  max_features: 100

imbalanced_learning:
  method: "smote"
  k_neighbors: 5

Features


Time-Series Feature Engineering

The pipeline extracts comprehensive temporal features:

  1. Rolling Statistics

    • Mean, std, min, max, median
    • Percentiles (25th, 75th)
    • Coefficient of variation
  2. Trend Features

    • Linear trend slope
    • R-squared of trend
  3. Change Features

    • Absolute changes (1h, 3h, 6h)
    • Percentage changes
    • Rate of change
  4. Statistical Features

    • Recent statistics over lookback window
    • Skewness and kurtosis

Imbalanced Dataset Handling

Multiple techniques are supported:

  • SMOTE: Synthetic Minority Oversampling Technique
  • ADASYN: Adaptive Synthetic Sampling
  • SMOTEENN: SMOTE + Edited Nearest Neighbours
  • Class Weights: Automatic class weight calculation
  • None: Train without resampling

Model Algorithms

Three state-of-the-art gradient boosting algorithms:

  1. XGBoost: Extreme Gradient Boosting
  2. LightGBM: Light Gradient Boosting Machine
  3. CatBoost: Categorical Boosting

Evaluation Metrics


The pipeline evaluates models using:

  • ROC-AUC: Area under the ROC curve
  • Average Precision: Area under the PR curve
  • Precision: Positive predictive value
  • Recall: Sensitivity
  • F1-Score: Harmonic mean of precision and recall
  • Confusion Matrix: Detailed classification breakdown

Data Format


Your input data should be a CSV file with the following columns:

Required Columns

  • patient_id: Unique patient identifier
  • time: Timestamp for each measurement
  • sepsis_label: Binary target (0 = no sepsis, 1 = sepsis)

Feature Columns

Any numeric columns will be used as features. Common clinical features include:

  • Vital Signs: temperature, heart_rate, respiratory_rate, systolic_bp, oxygen_saturation
  • Lab Values: wbc_count, lactate, creatinine, bilirubin
  • Demographics: age, gender
  • Clinical: icu_admission, hours_since_admission

Example Data Format

patient_id,time,temperature,heart_rate,respiratory_rate,systolic_bp,oxygen_saturation,wbc_count,lactate,creatinine,bilirubin,age,gender,icu_admission,sepsis_label
0,2023-01-01 00:00:00,36.8,75,16,120,98,7.0,1.0,0.9,0.8,45,1,0,0
0,2023-01-01 01:00:00,37.2,78,17,118,97,7.2,1.1,0.9,0.8,45,1,0,0
...

Advanced Usage


Custom Feature Engineering

You can extend the feature engineering by modifying src/features/time_series_features.py:

from src.features import TimeSeriesFeatureEngineer

engineer = TimeSeriesFeatureEngineer(
    time_window_hours=6,
    lookback_hours=24
)

features = engineer.create_features(df)

Custom Model Training

from src.models import GradientBoostingPipeline

pipeline = GradientBoostingPipeline(
    algorithm='xgboost',
    model_params={
        'max_depth': 8,
        'learning_rate': 0.01,
        'n_estimators': 2000
    }
)

pipeline.fit(X_train, y_train)
predictions = pipeline.predict(X_test)

Handling Imbalanced Data

from src.models import ImbalancedDataHandler

handler = ImbalancedDataHandler(
    method='smote',
    k_neighbors=5
)

X_resampled, y_resampled = handler.fit_resample(X_train, y_train)

Results


After training and evaluation, results are saved in the results/ directory:

  • test_metrics.csv: Performance metrics on test set
  • evaluation_metrics.csv: Full dataset evaluation metrics
  • feature_importance.png: Top 20 most important features
  • confusion_matrix.png: Confusion matrix visualization
  • roc_curve.png: ROC curve plot
  • pr_curve.png: Precision-Recall curve plot

Testing with Sample Data


The included generate_sample_data.py creates realistic synthetic data:

python generate_sample_data.py

Parameters can be adjusted:

  • n_patients: Number of patients (default: 1000)
  • sepsis_rate: Proportion with sepsis (default: 0.15)
  • hours_per_patient: Average hours of data (default: 48)

Dependencies


Key dependencies:

  • numpy: Numerical computing
  • pandas: Data manipulation
  • scikit-learn: Machine learning utilities
  • xgboost: XGBoost algorithm
  • lightgbm: LightGBM algorithm
  • catboost: CatBoost algorithm
  • imbalanced-learn: Imbalanced dataset handling
  • matplotlib/seaborn: Visualization
  • scipy: Scientific computing

See requirements.txt for complete list.

Model Performance


The pipeline is

About

Early sepsis risk prediction pipeline using machine learning on ICU clinical data.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages