The project provides three main CLI tools:
glucose-process: Processes one or multiple glucose databases into a unified ML-ready format.glucose-compare: Compares two checkpoint CSV files and provides detailed statistics.glucose-download: Downloads publicly available glucose datasets.
glucose-process [INPUT_FOLDERS]... [OPTIONS]INPUT_FOLDERS: One or more paths to database folders (e.g.,DATA/uom_small) or ZIP files (for AI-READI).
| Option | Shorthand | Description |
|---|---|---|
--config |
-c |
Path to a YAML configuration file. Auto-loaded from glucose_config.yaml if present. |
--output |
-o |
Filename for the final ML-ready CSV (placed in the OUTPUT/ folder). |
--interval |
-i |
Time discretization interval (minutes). |
--gap-max |
-g |
Max gap size to interpolate (minutes). |
--min-length |
-l |
Minimum sequence length to preserve. |
--remove-calibration/--keep-calibration |
Remove calibration events to create interpolatable gaps (default: enabled). | |
--calibration-period |
-p |
Gap duration considered a calibration period (minutes, default: 165). |
--remove-after-calibration |
-r |
Hours of data to remove after a calibration period (default: 24). |
--glucose-only |
Filter output to only include glucose values. | |
--fixed-frequency/--no-fixed-frequency |
Enable or disable resampling to fixed time buckets (default: enabled). | |
--last-step |
Last processing step to execute (1–7). Omit or use 0 for all steps. | |
--round-precision |
Decimal digits for rounding numeric fields. Can be negative (default: 3). | |
--verbose |
-v |
Enable detailed logging. |
--stats/--no-stats |
Show or suppress the summary statistics printout (default: shown). | |
--save-intermediate |
-s |
Export CSVs after each processing stage. |
--first-n-users |
Limit processing to the first N users found. |
If --config is not provided, the tool automatically loads glucose_config.yaml from the current directory when it exists. CLI arguments always override config file values.
The output filename is resolved in the following order:
--outputCLI option – filename provided by the user, placed inOUTPUT/.- Config
output_filesetting – from the YAML config, placed inOUTPUT/. - Folder-name-based – generated from the input folder/ZIP names joined with underscores and suffixed with
_ml_ready.csv(e.g.,OUTPUT/hupa_uom_ml_ready.csv).
The CLI supports combining different databases in a single run:
glucose-process DATA/uom DATA/hupa DATA/dexcom_small -o combined_data.csvThe preprocessor automatically:
- Detects the database type for each input.
- Tracks global
sequence_idto prevent collisions. - Normalizes all data to the same time resolution and field set.
glucose-download [COMMAND] [OPTIONS]| Command | Description |
|---|---|
list |
List all datasets available for download. |
all |
Download all programmatically accessible datasets. |
by-name |
Download a single dataset by name. |
by-names |
Download multiple datasets by name. |
by-id |
Download a dataset by its numeric ID. |
Common options:
| Option | Description |
|---|---|
--force |
Re-download even if the file already exists. |
Examples:
glucose-download list
glucose-download by-name "HUPA"
glucose-download by-names "HUPA" "T1D-UOM"
glucose-download by-id 14
glucose-download by-name "T1D-UOM" --forceDownloaded datasets are saved to the DATA/ folder with subdirectory names matching their format converters (e.g., DATA/hupa/, DATA/uom/).
Note: Some datasets require credentials:
- PhysioNet datasets: Set
PHYSIONET_USERNAMEandPHYSIONET_PASSWORDin.env. - Manual-access datasets (AI-READI, some JAEB DirecNet studies): Require registration on their respective portals.
glucose-compare [FILE1] [FILE2] [OPTIONS]This tool compares two checkpoint files to ensure processing results are consistent.
FILE1: Path to the first checkpoint file.FILE2: Path to the second checkpoint file.
| Option | Shorthand | Description |
|---|---|---|
--key-columns |
-k |
Key columns for row matching. |
--tolerance |
-t |
Numeric tolerance for approximate matches. |
--no-streaming |
Disable Polars streaming. |
At the end of a successful run, the CLI displays a summary including:
- Total records collected and preserved.
- Number of sequences created and filtered.
- Interpolation and gap statistics.
- Longest and average sequence lengths.