-
Notifications
You must be signed in to change notification settings - Fork 19
Some severe issues with the MIMIC-IV preprocessing #7
Description
I was reproducing the preprocessing and I noticed a few severe issues with the preprocessing provided.
datamerging.ipynb - Prescriptions are accidentally dropped completely
presc_df = presc_df.drop((presc_df['valuenum']=='3-10').index)afterwards, the table is empty.
outputs.ipynb - Wrong labels!
outputs_label_list contains the entries "Chest Tube" and "Jackson Pratt", but these never appear as labels, the correct labels are "Chest Tube #1" and "Jackson Pratt #1"
prescriptions.ipynb - missing required filtering
- rows with non-float
dose_val_rxare not dropped - rows with
NaTentries instarttimeare not dropped
inputevents.ipynb
The code for adding repeats does in some cases not add enough repeats due to a rounding issue. This can be tested via
min_diff = (pd.to_datetime(df_new1["endtime"])-df_new1["charttime"]).groupby(level=0).min()
assert all(min_diff < pd.Timedelta("30min")), f"Did not add enough steps!"labevents.ipynb
- rows with NaN valued
valuenumare not dropped
admissions.ipynb
We filter for patients with a single admission, however later in the other dataframes hadm_id is used as filter instead of subject_id. The issue is that there appears to be corrupted data in at least one table that gives rise to hadm_id with multiple subject_id associated with it. We can test it in datamerging via
assert all(merged_df.groupby("subject_id")["hadm_id"].nunique() == 1)
assert all(merged_df.groupby("hadm_id")["subject_id"].nunique() == 1)Further, the hospital stay is limited to patients with 2-29 days stay. However, the charttime does not agree with this data. Sometimes, charttime starts before admittime. The longest charttime is over 52 years.