Tags:
ICU Self-Supervised Learning PyTorch Time Series Benchmark Healthcare AI DuckDB Polars Lightning
SLICES is a modular framework for learning patient embeddings from ICU time-series with SSL and evaluating them on clinical tasks. Clinical researchers can run benchmarks as-is, plug in their own encoders or SSL objectives via clear interfaces, and work entirely on local files—no cloud or API keys.
Composable pipeline stages:
- Data conversion (optional): CSV.gz → Parquet
- Feature extraction: MIMIC-IV Parquet → stay-level Parquet (static, timeseries, labels) via DuckDB
- Preprocessing: Raw events → hourly-binned dense tensors with observation masks
- SSL pretraining: Unlabeled data; config-driven encoders (Transformer, Linear, SMART) and objectives (MAE, SMART)
- Downstream evaluation: Fine-tune on configurable tasks (mortality, phenotyping). New encoders or tasks = implement base class + register; Hydra configs keep runs reproducible.
This project uses the src layout (Python packaging best practice) and uv for dependency management.
- Python 3.12+
- uv package manager
- Clone the repository:
git clone <repository-url>
cd SLICES- Install the package in editable mode with development dependencies:
uv sync --devThis will:
- Create a virtual environment
- Install all dependencies
- Install the package in editable mode (
-e)
- Verify installation:
python -c "from slices.data.extractors.base import BaseExtractor; print('Import successful!')"SLICES supports two starting points: CSV files or Parquet files.
If you have MIMIC-IV in CSV.gz format from PhysioNet:
Step 1: Convert CSV to Parquet
python scripts/convert_csv_to_parquet.py \
data.csv_root=/path/to/mimic-iv-3.0 \
data.parquet_root=/path/to/mimic-iv-parquetStep 2: Extract Features
python scripts/extract_mimic_iv.py \
data.parquet_root=/path/to/mimic-iv-parquetConvenience shortcut (runs both steps):
python scripts/setup_mimic_iv.py data.csv_root=/path/to/mimic-iv-3.0If you already have MIMIC-IV in Parquet format:
python scripts/extract_mimic_iv.py \
data.parquet_root=/path/to/mimic-iv-parquetPretrain SSL Model
python scripts/pretrain.py data.parquet_root=/path/to/mimic-iv-parquetFine-tune on Downstream Task
uv run python scripts/finetune.py checkpoint=outputs/encoder.ptslices/ # Repository root
├── src/
│ └── slices/ # Main package (src layout)
│ ├── data/
│ │ ├── extractors/ # Dataset-specific extraction
│ │ │ └── base.py # Abstract base extractor
│ │ ├── concepts/ # Concept dictionary YAMLs
│ │ ├── dataset.py # PyTorch Dataset
│ │ ├── datamodule.py # Lightning DataModule
│ │ └── transforms.py # SSL augmentations
│ ├── models/
│ │ ├── encoders/ # Backbone architectures
│ │ │ ├── base.py # Abstract base encoder
│ │ │ ├── factory.py # Encoder factory
│ │ │ └── transformer.py # Transformer encoder
│ │ ├── pretraining/ # SSL objectives
│ │ │ ├── base.py # Abstract SSL objective
│ │ │ ├── factory.py # SSL objective factory
│ │ │ └── mae.py # MAE objective
│ │ └── heads/ # Task heads (for finetuning)
│ │ ├── base.py # Abstract BaseTaskHead
│ │ ├── factory.py # Task head factory
│ │ └── mlp.py # MLP and Linear task heads
│ └── training/ # Training utilities and Lightning modules
│ ├── pretrain_module.py # SSLPretrainModule
│ ├── finetune_module.py # FineTuneModule
│ └── utils.py
├── configs/ # Hydra configs (outside src/)
│ ├── config.yaml # Main config
│ ├── data/
│ │ └── mimic_iv.yaml
│ ├── model/
│ │ └── transformer.yaml
│ └── concepts/
│ └── core_features.yaml # Concept dictionary
├── scripts/ # Entry points (outside src/)
│ ├── convert_csv_to_parquet.py # Convert CSV.gz to Parquet
│ ├── setup_mimic_iv.py # Convenience: convert + extract
│ ├── extract_mimic_iv.py # Extract features from Parquet
│ ├── pretrain.py
│ └── finetune.py
└── tests/ # Tests (outside src/)
Extracted ICU stays are stored as separate Parquet files in the output directory:
static.parquet- Stay-level metadata (demographics, admission info)- Columns:
stay_id,patient_id,age,gender,race,admission_type,los_days, etc.
- Columns:
timeseries.parquet- Dense hourly-binned time-series with masks- Columns:
stay_id,timeseries(nested array, shape T×D),mask(nested array, shape T×D)
- Columns:
labels.parquet- Task labels- Columns:
stay_id, plus one column per task (e.g.,mortality_24h,mortality_48h)
- Columns:
metadata.yaml- Feature names, sequence length, task names, etc.
The ICUDataset loads these files and returns dictionaries with:
timeseries: FloatTensor of shape (seq_length, n_features)mask: BoolTensor of shape (seq_length, n_features) - True = observed, False = missing/imputedstatic: Dict with static features (age, gender, etc.)label: FloatTensor with task label (if task_name specified)
SLICES uses Hydra for configuration management. All configs are in the configs/ directory.
Edit configs/extraction/mimic_iv.yaml or override via command line:
# Optional: Path to raw CSV.gz files (if starting from CSVs)
csv_root: null # e.g., /data/mimic-iv-3.0
# Required: Path to Parquet files (used by extraction)
parquet_root: data/parquet/mimic-iv-demo
# Output for extracted features
output_dir: data/processed/mimic-iv-demoCommand-line overrides:
# Convert CSVs
python scripts/convert_csv_to_parquet.py \
data.csv_root=/path/to/csv \
data.parquet_root=/path/to/parquet
# Extract from Parquet
python scripts/extract_mimic_iv.py \
data.parquet_root=/path/to/parquetCSV format (from PhysioNet):
mimic-iv-3.0/
├── hosp/
│ ├── patients.csv.gz
│ ├── admissions.csv.gz
│ └── labevents.csv.gz
├── icu/
│ ├── icustays.csv.gz
│ ├── chartevents.csv.gz
│ └── inputevents.csv.gz
└── ...
Parquet format (after conversion or if pre-converted):
mimic-iv-parquet/
├── hosp/
│ ├── patients.parquet
│ ├── admissions.parquet
│ └── labevents.parquet
├── icu/
│ ├── icustays.parquet
│ ├── chartevents.parquet
│ └── inputevents.parquet
└── ...
Fine-tune CSV-to-Parquet conversion performance:
# Maximum parallel workers (default: 4)
export SLICES_CONVERT_MAX_WORKERS=8
# DuckDB memory limit per worker (default: 3GB)
export SLICES_DUCKDB_MEM=4GB
# DuckDB threads per worker (default: 2)
export SLICES_DUCKDB_THREADS=4pytest tests/# Format code
black src/ scripts/ tests/
# Lint
ruff check src/ scripts/ tests/
# Type check
mypy src/- MIMIC-IV: Johnson, A. E. W., et al. (2023). MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data.
- ricu: Gygax, D. M., et al. (2023). ricu: R's interface to intensive care data. GigaScience.
- YAIB: Yèche, H., et al. (2024). YAIB: Yet Another ICU Benchmark. ICLR.
- ICareFM: [Preprint] Self-supervised learning for ICU time-series.
See LICENSE file for details.
This is a master's thesis project. Contributions welcome via issues and pull requests.
