This repository is the official implementation for Multivariable Serum Creatinine Forecasting for Acute Kidney Injury Detection Using an Explainable Transformer-based Model, by Cyprien Gille, Galaad Altares, Benjamin Colette, Karim Zouaoui Boudjeltia, Matei Mancas and Virginie Vandenbulcke (EMBC 2025).
If the code in this repository has been useful to you, please cite the original article using the Cite this repository button (Located in the top right of the GitHub page, above Releases).
You can also cite the article directly using the reference below.
@inproceedings{gilleMultivariableSerumCreatinine2025,
title = {Multivariable {{Serum Creatinine Forecasting}} for {{Acute Kidney Injury Detection Using}} an {{Explainable Transformer-based Model}}},
booktitle = {2025 47th {{Annual International Conference}} of the {{IEEE Engineering}} in {{Medicine}} and {{Biology Society}} ({{EMBC}})},
author = {Gille, Cyprien and Altares, Galaad and Colette, Benjamin and Boudjeltia, Karim Zouaoui and Mancas, Matei and Vandenbulcke, Virginie},
year = 2025,
month = jul,
pages = {1--7},
issn = {2694-0604},
doi = {10.1109/EMBC58623.2025.11251723},
keywords = {Accuracy,Forecasting,Injuries,Kidney,Mortality,Predictive models,Prognostics and health management,Time series analysis,Transformers,Usability}
}
All relevant files should have a docstring at the top and extensive comments to tell you what they do, but you can find an overview of the contents of this repository below.
| File | Description | Executable? |
|---|---|---|
| culling_reg.py | Task-aware postprocessing script that removes unusable measures and stays after preprocessing | Yes |
| eval_kfold_TMITS.py | Computes metrics for a trained T-MITS model | Yes |
| kfold_TMITS.py | Trains a T-MITS model with cross-validation (optional) | Yes |
| preprocess_eicu.py | Task-agnostic eICU preprocessing | Yes |
| preprocess_mimic.py | Task-agnostic MIMIC-IV preprocessing using pre-selected variables | Yes |
| .gitignore | Allows for non-tracking of generated files (namely) | No |
| config.py | Dataclasses controlling the configuration of the main scripts | No |
| pyproject.toml | Description of the python project and its dependencies (see Installing section below) | No |
| uv.lock | uv lockfile indicating a working set of packages for this repository, for reproducibility | No |
| dataset_classes/dataset_base.py | Base ICU torch Dataset class | No |
| dataset_classes/dataset_regression.py | ICU torch Dataset class intended for regression | No |
| models/attention.py | Basic attention torch module | No |
| models/loss.py | Quantile Loss torch module | No |
| models/tmits.py | T-MITS torch module | No |
| models/transformer.py | Transformer wrapper torch module | No |
| models/UD.py | Up-dimensional embedding torch module | No |
| utility_functions/eval_utils.py | Utility functions used for evaluation | No |
| utility_functions/preprocessing_utils.py | Utility functions used for preprocessing (includes the pre-selected variables for preprocessing) | No |
| utility_functions/utils.py | Various utility functions | No |
| results/* | Trained model checkpoints and evaluation metrics | No |
All dependencies for this project are specified in the pyproject.toml file, following PEP 621.
You can install dependencies using the uv python package manager, a modern replacement for all of conda's features (and more). To do so, simply install uv and run the following command in this repository:
uv syncThis will create a virtual environment for this project and install its dependencies. This is the recommended and maintained way to setup this repository.
Since pyproject.toml is a standard format, you can also install this project by simply running the following command. We would advise you to do so in a virtual environment.
Note that this method is not recommended because it might not install the best version of pytorch for your hardware (because pytorch has its own indexes), among other things.
pip install .All results presented in our paper were obtained on either the MIMIC-IV dataset or the eICU-CRD dataset, which can both be obtained freely after following a quick training. Instructions can be found at the bottom of the two previous links.
Once obtained, the datasets should be placed in the same root directory as this repository, as such :
<top-directory>/
├── mimic-iv-2.2/
│ ├── icu/
│ ├── hosp/
│ ├── ...
├── eicu-crd-2.0/
│ ├── lab.csv
│ ├── ...
├── T-MITS/
│ ├── dataset_classes/
│ ├── ...
If you wish to place them elsewhere, you will have to modify the paths at the start of the preprocessing scripts.
If you don't want to decompress every file in each dataset to save space, you can just decompress the following files:
- In the
mimic-iv-2.2/directory:hosp/admissions.csvhosp/patients.csvicu/chartevents.csv
- In the
eicu-crd-2.0/directory:lab.csvvitalPeriodic.csvpatient.csv
All preprocessing scripts produce two .csv files: the dataset and a key mapping the original variable labels (such as "Heart Rate") to their reindexed integer ids. This is mainly used to tell scripts (culling, training) which variable interests you without having to know its internal id.
To get the bottom-up (29 variables) or top-down (206 variables) versions of the processed MIMIC-IV dataset, set the top_down flag at the top of preprocess_mimic.py and run it:
uv run preprocess_mimic.pyTo get the processed eICU-CRD dataset, run:
uv run preprocess_eicu.pyPreprocessing is task-agnostic, so you still need to remove from the preprocessed dataset all stays and measures that are unusable for your task. For example, this includes stays with no regression target.
The culling script is also used to create cohorts based on several criteria, such as the first value of the target variable in a stay, which variable should be maskable during training without creating empty stays, or the maximum length of a stay. Note that by default, cohort-defining parameters that are changed from their defaults will be added to the output filename for clarity (see attr_in_paths in CullingConfig in config.py).
To cull a dataset, adjust the CullingConfig at the top of culling_reg.py and run it.
uv run culling_reg.pyThis will produce a ready-to-use .csv. It can also split each stay into separate .csv files (can be faster than filtering the dataset during training), and save the classification label of each stay (useful for stratified splitting) in a .json dictionnary.
As always, adjust the TrainingConfig at the top of the script and run it:
uv run kfold_TMITS.pyThis will produce:
- A record of the training config (
.json) - A logfile of the training process (
.log)
For each cross-validation fold, this will produce:
- Best model checkpoints (
.pth) - A record of the training and testing indices as numpy arrays (
.npy)
As always, adjust the EvalConfig at the top of the script and run it:
uv run eval_kfold_TMITS.pyThis will reuse the saved training config to evaluate the trained model.
If all config booleans are set to True, this will produce (for the train set and the test set):
- Regression and classification metrics (
.csv), - Normalized confusion matrices,
- Ground truth and Predicted value arrays, both for regression and for classification (
.npy) (aligned with the stay indexes saved during training right after splitting), - An
.xlsxsheet with columns for the stay id, the true and predicted values, and the true and predicted classes.
You can also call this script's main function with a path to a culled cohort as the data_override argument: this allows for the evaluation of a model on a different cohort than the one it was trained on (for example, train on stays that started in stage 0, and evaluate on stays that ended in stages 1 2 or 3). Note however that if none of the test indexes used during the training process are in the overriding cohort, the script will raise an error.
To promote ease of use and reproducibility, we provide the full outputs of our training and evaluation pipelines in the results folder. As such, you will find for each experiment (i.e. in each subfolder) the following files:
T_MITS_[X].pthfiles: best model checkpoint from training foldX.T_MITS_config.json: serialized configuration object used for training this experiment.T_MITS_[train/test]_idx_[X].npyfiles: stay indices used for the train/test split for foldX, saved in binary usingnumpy.save.confusion_test.png: Confusion matrix on the test set of the first fold.metrics_test.csv: Evaluation metrics on the test sets of all folds.true_pred_values_test_[X].xlsxfiles: Ground truths and predicted values and classes for each stay in the test set of foldX.arrays/: Numpy arrays of all values (ground truths and predicted) for the test sets of all folds.arrays_classif/: Numpy arrays of all classes (ground truths and predicted) for the test sets of all folds.
Note: the order of the stays (as dictated by the indexes (idx files) and as reported in the .xlsx files) is consistent throughout all numpy arrays.
The code in this repository is available under a GPLv3 license.
