diff --git a/README.md b/README.md index 552bd4e..fbc972c 100644 --- a/README.md +++ b/README.md @@ -23,7 +23,7 @@ git clone --recursive http://github.com/iSPIRT/depa-training cd depa-training ``` -Install the below listed dependencies by running the [install-prerequisites.sh](./install-prerequisites.sh) script. +Install the required dependencies by running the [install-prerequisites.sh](./install-prerequisites.sh) script. ```bash ./install-prerequisites.sh @@ -33,7 +33,7 @@ Note: You may need to restart your machine to ensure that the changes take effec ## Build CCR containers -To build your own CCR container images, use the following command from the root of the repository. +To build your own Confidential Cleanroom (CCR) container images, use the following command from the root of the repository. ```bash ./ci/build.sh @@ -44,7 +44,7 @@ This scripts build the following containers. - ```depa-training```: Container with the core CCR logic for joining datasets and running differentially private training. - ```depa-training-encfs```: Container for loading encrypted data into the CCR. -Alternatively, you can use pre-built container images from the ispirt repository by setting the following environment variable. Docker hub has started throttling which may effect the upload/download time, especially when images are bigger size. So, It is advisable to use other container registries, we are using azure container registry as shown below +Alternatively, you can use pre-built container images from the iSPIRT repository by setting the `CONTAINER_REGISTRY` environment variable and pulling the images. Docker Hub has started throttling, which may effect the upload/download time, especially when images are bigger size. So, it is advisable to use other container registries. We are using Azure Container Registry as shown below ```bash export CONTAINER_REGISTRY=ispirt.azurecr.io ./ci/pull-containers.sh @@ -52,7 +52,7 @@ export CONTAINER_REGISTRY=ispirt.azurecr.io # Scenarios -This repository contains two samples that illustrate the kinds of scenarios DEPA for Training can support. +This repository contains sample demos illustrating a diverse set of scenarios that DEPA for Training can support. Follow the links to build and deploy these scenarios. @@ -60,7 +60,7 @@ Follow the links to build and deploy these scenarios. |--------------|---------------|-----------------|--------------|-----------|------------|------------|------------| | [COVID-19](./scenarios/covid/README.md) | Training - Deep Learning | Binary Classification | Differentially Private | 3 | PII tabular data (CSV) | MLP (ONNX) | Horizontal (3)| | [BraTS](./scenarios/brats/README.md) | Training - Deep Learning | Image Segmentation | Differentially Private | 4 | MRI scans data (NIfTI/PNG) | UNet (Safetensors) | Vertical (4)| -| [Credit Risk](./scenarios/credit-risk/README.md) | Training - Classical ML | Binary Classification | Differentially Private | 4 | PII tabular data (Parquet) | XGBoost (JSON) | Horizontal (4)| +| [Credit Risk](./scenarios/credit-risk/README.md) | Training - Classical ML | Binary Classification | Differentially Private | 4 | PII tabular data (Parquet) | XGBoost (JSON) | Horizontal (6)| | [CIFAR-10](./scenarios/cifar10/README.md) | Training - Deep Learning | Multi-class Image Classification | NA | 1 | Non-PII image data (SafeTensors) | CNN (Safetensors) | NA (1)| | [MNIST](./scenarios/mnist/README.md) | Training - Deep Learning | Multi-class Image Classification | NA | 1 | Non-PII image data (HDF5) | CNN (ONNX) | NA (1)| diff --git a/scenarios/brats/README.md b/scenarios/brats/README.md index df2a8ff..16effbf 100644 --- a/scenarios/brats/README.md +++ b/scenarios/brats/README.md @@ -109,7 +109,7 @@ If all goes well, you should see output similar to the following output, and the ```bash train-1 | Merged dataset 'brats_A' into '/tmp/brats_joined' -train-1 | Merged dataset 'brat_B' into '/tmp/brats_joined' +train-1 | Merged dataset 'brats_B' into '/tmp/brats_joined' train-1 | Merged dataset 'brats_C' into '/tmp/brats_joined' train-1 | Merged dataset 'brats_D' into '/tmp/brats_joined' train-1 | @@ -381,7 +381,7 @@ The outputs will be saved to the [output](./modeller/output/) directory. To check if the trained model is fresh, you can run the following command: ```bash -stat $REPO_ROOT/scenarios/$SCENARIO/modeller/output/trained_model.pth +stat $REPO_ROOT/scenarios/$SCENARIO/modeller/output/trained_model.safetensors ``` --- diff --git a/scenarios/cifar10/README.md b/scenarios/cifar10/README.md index e92b08f..3983b37 100644 --- a/scenarios/cifar10/README.md +++ b/scenarios/cifar10/README.md @@ -355,7 +355,7 @@ The outputs will be saved to the [output](./modeller/output/) directory. To check if the trained model is fresh, you can run the following command: ```bash -stat $REPO_ROOT/scenarios/$SCENARIO/modeller/output/trained_model.pth +stat $REPO_ROOT/scenarios/$SCENARIO/modeller/output/trained_model.safetensors ``` --- diff --git a/scenarios/credit-risk/README.md b/scenarios/credit-risk/README.md index a9d9772..b4af3f2 100644 --- a/scenarios/credit-risk/README.md +++ b/scenarios/credit-risk/README.md @@ -4,13 +4,13 @@ | Scenario name | Scenario type | Task type | Privacy | No. of TDPs* | Data type (format) | Model type (format) | Join type (No. of datasets) | |--------------|---------------|-----------------|--------------|-----------|------------|------------|------------| -| Credit Risk | Training - Classical ML | Binary Classification | Differentially Private | 4 | PII tabular data (Parquet) | XGBoost (JSON) | Horizontal (4)| +| Credit Risk | Training - Classical ML | Binary Classification | Differentially Private | 4 | PII tabular data (Parquet) | XGBoost (JSON) | Horizontal (6)| --- ## Scenario Description -This scenario involves training an XGBoost model on the [Home Credit Default Risk](https://www.kaggle.com/c/home-credit-default-risk) datasets [[1, 2]](README.md#references). We frame this scenario as involving four Training Data Providers (TDPs) - Bank A providing data for clients' credit applications, previous applications and payment installments, Bank B providing data on credit card balance, the Credit Bureau providing data on previous loans, and a Fintech providing data on point of sale (POS) cash balance. Here, Bank A is also the Training Data Consumer (TDC) who wishes to train the model on the joined datasets, in order to build a default risk prediction model. +This scenario involves training an **XGBoost** model on the [Home Credit Default Risk](https://www.kaggle.com/c/home-credit-default-risk) datasets [[1, 2]](README.md#references). We frame this scenario as involving four Training Data Providers (TDPs) - **Bank A** providing data for clients' (i) credit applications, (ii) previous applications and (iii) payment installments, **Bank B** providing data on (i) credit card balance, the **Credit Bureau** providing data on (i) previous loans, and a **Fintech** providing data on (i) point of sale (POS) cash balance. Here, **Bank A** is also the Training Data Consumer (TDC) who wishes to train the model on the joined datasets, in order to build a default risk prediction model. The end-to-end training pipeline consists of the following phases: @@ -106,18 +106,12 @@ flowchart TD If all goes well, you should see output similar to the following output, and the trained model and evaluation metrics will be saved under the folder [output](./modeller/output). ``` -train-1 | Training samples: 43636 -train-1 | Validation samples: 10909 -train-1 | Test samples: 5455 -train-1 | Dataset constructed from config -train-1 | Model loaded from ONNX file -train-1 | Optimizer Adam loaded from config -train-1 | Scheduler CyclicLR loaded from config -train-1 | Custom loss function loaded from config -train-1 | Epoch 1/1 completed | Training Loss: 0.1586 -train-1 | Epoch 1/1 completed | Validation Loss: 0.0860 -train-1 | Saving trained model to /mnt/remote/output/trained_model.onnx -train-1 | Evaluation Metrics: {'test_loss': 0.08991911436687393, 'accuracy': 0.9523373052245646, 'f1_score': 0.9522986646537908} +train-1 | Joined datasets: ['credit_applications', 'previous_applications', 'payment_installments', 'bureau_records', 'pos_cash_balance', 'credit_card_balance'] +train-1 | Loaded dataset splits | train: (6038, 63) | val: (755, 63) | test: (755, 63) +train-1 | Trained Gradient Boosting model with 250 boosting rounds | Epsilon: 4.0 +train-1 | Saved model to /mnt/remote/output +train-1 | Evaluation Metrics: {'accuracy': 0.5231788079470199, 'roc_auc': 0.47444826338639656} +train-1 | Non-DP Evaluation Metrics: {'accuracy': 0.9152317880794701, 'roc_auc': 0.6496472503617945} train-1 | CCR Training complete! train-1 | train-1 exited with code 0 @@ -125,7 +119,7 @@ train-1 exited with code 0 ## Deploy on CCR -In a more realistic scenario, this datasets will not be available in the clear to the TDC, and the TDC will be required to use a CCR for training. The following steps describe the process of sharing an encrypted dataset with TDCs and setting up a CCR in Azure for training. Please stay tuned for CCR on other cloud platforms. +In a more realistic scenario, these datasets will not be available in the clear to the TDC, and the TDC will be required to use a CCR for training her model. The following steps describe the process of sharing encrypted datasets with TDCs and setting up a CCR in Azure for training. Please stay tuned for CCR on other cloud platforms. To deploy in Azure, you will need the following. @@ -302,8 +296,6 @@ export CONTRACT_SEQ_NO=15 This script will deploy the container images from your container registry, including the encrypted filesystem sidecar. The sidecar will generate an SEV-SNP attestation report, generate an attestation token using the Microsoft Azure Attestation (MAA) service, retrieve dataset, model and output encryption keys from the TDP and TDC's Azure Key Vault, train the model, and save the resulting model into TDC's output filesystem image, which the TDC can later decrypt. - - **Note:** The completion of this script's execution simply creates a CCR instance, and doesn't indicate whether training has completed or not. The training process might still be ongoing. Poll the container logs (see below) to track progress until training is complete. ### 6\. Monitor Container Logs @@ -361,7 +353,7 @@ The outputs will be saved to the [output](./modeller/output/) directory. To check if the trained model is fresh, you can run the following command: ```bash -stat $REPO_ROOT/scenarios/$SCENARIO/modeller/output/trained_model.onnx +stat $REPO_ROOT/scenarios/$SCENARIO/modeller/output/trained_model.json ``` ---