Skip to content

Self-Supervised Learning for Intensive Care Embeddings System

License

Notifications You must be signed in to change notification settings

hannesill/SLICES

Repository files navigation

SLICES (beta): SSL Framework for ICU Embeddings

SLICES Logo

uv Package Manager PyTorch Version MIT License Tested with pytest

Tags: ICU Self-Supervised Learning PyTorch Time Series Benchmark Healthcare AI DuckDB Polars Lightning

SLICES is a modular framework for learning patient embeddings from ICU time-series with SSL and evaluating them on clinical tasks. Clinical researchers can run benchmarks as-is, plug in their own encoders or SSL objectives via clear interfaces, and work entirely on local files—no cloud or API keys.

Overview

Composable pipeline stages:

  1. Data conversion (optional): CSV.gz → Parquet
  2. Feature extraction: MIMIC-IV Parquet → stay-level Parquet (static, timeseries, labels) via DuckDB
  3. Preprocessing: Raw events → hourly-binned dense tensors with observation masks
  4. SSL pretraining: Unlabeled data; config-driven encoders (Transformer, Linear, SMART) and objectives (MAE, SMART)
  5. Downstream evaluation: Fine-tune on configurable tasks (mortality, phenotyping). New encoders or tasks = implement base class + register; Hydra configs keep runs reproducible.

Installation

This project uses the src layout (Python packaging best practice) and uv for dependency management.

Prerequisites

  • Python 3.12+
  • uv package manager

Setup

  1. Clone the repository:
git clone <repository-url>
cd SLICES
  1. Install the package in editable mode with development dependencies:
uv sync --dev

This will:

  • Create a virtual environment
  • Install all dependencies
  • Install the package in editable mode (-e)
  1. Verify installation:
python -c "from slices.data.extractors.base import BaseExtractor; print('Import successful!')"

Quick Start

SLICES supports two starting points: CSV files or Parquet files.

Option A: Starting with CSV Files (Two-Step Process)

If you have MIMIC-IV in CSV.gz format from PhysioNet:

Step 1: Convert CSV to Parquet

python scripts/convert_csv_to_parquet.py \
    data.csv_root=/path/to/mimic-iv-3.0 \
    data.parquet_root=/path/to/mimic-iv-parquet

Step 2: Extract Features

python scripts/extract_mimic_iv.py \
    data.parquet_root=/path/to/mimic-iv-parquet

Convenience shortcut (runs both steps):

python scripts/setup_mimic_iv.py data.csv_root=/path/to/mimic-iv-3.0

Option B: Starting with Parquet Files (Direct Extraction)

If you already have MIMIC-IV in Parquet format:

python scripts/extract_mimic_iv.py \
    data.parquet_root=/path/to/mimic-iv-parquet

Next Steps

Pretrain SSL Model

python scripts/pretrain.py data.parquet_root=/path/to/mimic-iv-parquet

Fine-tune on Downstream Task

uv run python scripts/finetune.py checkpoint=outputs/encoder.pt

Project Structure

slices/                          # Repository root
├── src/
│   └── slices/                  # Main package (src layout)
│       ├── data/
│       │   ├── extractors/       # Dataset-specific extraction
│       │   │   └── base.py       # Abstract base extractor
│       │   ├── concepts/         # Concept dictionary YAMLs
│       │   ├── dataset.py        # PyTorch Dataset
│       │   ├── datamodule.py     # Lightning DataModule
│       │   └── transforms.py     # SSL augmentations
│       ├── models/
│       │   ├── encoders/         # Backbone architectures
│       │   │   ├── base.py       # Abstract base encoder
│       │   │   ├── factory.py    # Encoder factory
│       │   │   └── transformer.py # Transformer encoder
│       │   ├── pretraining/      # SSL objectives
│       │   │   ├── base.py       # Abstract SSL objective
│       │   │   ├── factory.py    # SSL objective factory
│       │   │   └── mae.py        # MAE objective
│       │   └── heads/            # Task heads (for finetuning)
│       │       ├── base.py       # Abstract BaseTaskHead
│       │       ├── factory.py    # Task head factory
│       │       └── mlp.py        # MLP and Linear task heads
│       └── training/              # Training utilities and Lightning modules
│           ├── pretrain_module.py # SSLPretrainModule
│           ├── finetune_module.py # FineTuneModule
│           └── utils.py
├── configs/                      # Hydra configs (outside src/)
│   ├── config.yaml               # Main config
│   ├── data/
│   │   └── mimic_iv.yaml
│   ├── model/
│   │   └── transformer.yaml
│   └── concepts/
│       └── core_features.yaml    # Concept dictionary
├── scripts/                       # Entry points (outside src/)
│   ├── convert_csv_to_parquet.py # Convert CSV.gz to Parquet
│   ├── setup_mimic_iv.py         # Convenience: convert + extract
│   ├── extract_mimic_iv.py       # Extract features from Parquet
│   ├── pretrain.py
│   └── finetune.py
└── tests/                         # Tests (outside src/)

Data Format

Extracted ICU stays are stored as separate Parquet files in the output directory:

  • static.parquet - Stay-level metadata (demographics, admission info)
    • Columns: stay_id, patient_id, age, gender, race, admission_type, los_days, etc.
  • timeseries.parquet - Dense hourly-binned time-series with masks
    • Columns: stay_id, timeseries (nested array, shape T×D), mask (nested array, shape T×D)
  • labels.parquet - Task labels
    • Columns: stay_id, plus one column per task (e.g., mortality_24h, mortality_48h)
  • metadata.yaml - Feature names, sequence length, task names, etc.

The ICUDataset loads these files and returns dictionaries with:

  • timeseries: FloatTensor of shape (seq_length, n_features)
  • mask: BoolTensor of shape (seq_length, n_features) - True = observed, False = missing/imputed
  • static: Dict with static features (age, gender, etc.)
  • label: FloatTensor with task label (if task_name specified)

Configuration

SLICES uses Hydra for configuration management. All configs are in the configs/ directory.

Data Paths Configuration

Edit configs/extraction/mimic_iv.yaml or override via command line:

# Optional: Path to raw CSV.gz files (if starting from CSVs)
csv_root: null  # e.g., /data/mimic-iv-3.0

# Required: Path to Parquet files (used by extraction)
parquet_root: data/parquet/mimic-iv-demo

# Output for extracted features
output_dir: data/processed/mimic-iv-demo

Command-line overrides:

# Convert CSVs
python scripts/convert_csv_to_parquet.py \
    data.csv_root=/path/to/csv \
    data.parquet_root=/path/to/parquet

# Extract from Parquet
python scripts/extract_mimic_iv.py \
    data.parquet_root=/path/to/parquet

Expected Directory Structures

CSV format (from PhysioNet):

mimic-iv-3.0/
├── hosp/
│   ├── patients.csv.gz
│   ├── admissions.csv.gz
│   └── labevents.csv.gz
├── icu/
│   ├── icustays.csv.gz
│   ├── chartevents.csv.gz
│   └── inputevents.csv.gz
└── ...

Parquet format (after conversion or if pre-converted):

mimic-iv-parquet/
├── hosp/
│   ├── patients.parquet
│   ├── admissions.parquet
│   └── labevents.parquet
├── icu/
│   ├── icustays.parquet
│   ├── chartevents.parquet
│   └── inputevents.parquet
└── ...

Environment Variables for Conversion

Fine-tune CSV-to-Parquet conversion performance:

# Maximum parallel workers (default: 4)
export SLICES_CONVERT_MAX_WORKERS=8

# DuckDB memory limit per worker (default: 3GB)
export SLICES_DUCKDB_MEM=4GB

# DuckDB threads per worker (default: 2)
export SLICES_DUCKDB_THREADS=4

Development

Running Tests

pytest tests/

Code Quality

# Format code
black src/ scripts/ tests/

# Lint
ruff check src/ scripts/ tests/

# Type check
mypy src/

References

  • MIMIC-IV: Johnson, A. E. W., et al. (2023). MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data.
  • ricu: Gygax, D. M., et al. (2023). ricu: R's interface to intensive care data. GigaScience.
  • YAIB: Yèche, H., et al. (2024). YAIB: Yet Another ICU Benchmark. ICLR.
  • ICareFM: [Preprint] Self-supervised learning for ICU time-series.

License

See LICENSE file for details.

Contributing

This is a master's thesis project. Contributions welcome via issues and pull requests.

About

Self-Supervised Learning for Intensive Care Embeddings System

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published