Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/contract-service.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ env:
CONTAINER_REGISTRY_USERNAME: ${{ secrets.CONTAINER_REGISTRY_USERNAME }}
CONTAINER_REGISTRY_PASSWORD: ${{ secrets.CONTAINER_REGISTRY_ACCESS_TOKEN }}
PLATFORM: "virtual"
CONTAINER_NAME: "contract-ledger"
CONTAINER_NAME: "contract-service"

jobs:
deploy-contract-service:
Expand Down
39 changes: 23 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,35 @@
# DEPA for Training

[DEPA for Training](https://depa.world) is a techno-legal framework that enables privacy-preserving sharing of bulk, de-identified datasets for large scale analytics and training. This repository contains a reference implementation of [Confidential Clean Rooms](https://depa.world/training/confidential_clean_room_design), which together with the [Contract Service](https://github.com/kapilvgit/contract-ledger/tree/main), forms the basis of this framework. The reference implementation is provided on an As-Is basis. It is work-in-progress and should not be used in production.
[DEPA for Training](https://depa.world) is a techno-legal framework that enables privacy-preserving sharing of bulk, de-identified datasets for large scale analytics and training. This repository contains a reference implementation of [Confidential Clean Rooms](https://depa.world/training/confidential_clean_room_design) (CCR), which together with the [Contract Service](https://github.com/iSPIRT/contract-service/tree/main), forms the basis of this framework. The reference implementation is provided on an As-Is basis. It is work-in-progress and should not be used in production.

# Getting Started

## [New] Interactive Demo

You can now try out DEPA-Training interactively using our [interactive GUI demo](./gui-demo/README.md). The demo requires a signed electronic contract and an Azure cloud subscription.

Start by setting up this project on GitHub Codespaces or your own development environment and then follow the [instructions](./gui-demo/README.md).

## GitHub Codespaces

The simplest way to setup a development environment is using [GitHub Codespaces](https://github.com/codespaces). The repository includes a [devcontainer.json](.devcontainer/devcontainer.json), which customizes your codespace to install all required dependencies. Please ensure you allocate at least 8 vCPUs and 64GB disk space in your codespace. Also, run the following command in the codespace to update submodules.
The simplest way to set up a development environment is using [GitHub Codespaces](https://github.com/codespaces). The repository includes a [devcontainer.json](.devcontainer/devcontainer.json), which customizes your codespace to install all required dependencies. Please ensure you allocate at least 8 vCPUs and 64GB disk space in your codespace. Also, run the following command in the codespace to update submodules.

```bash
git submodule update --init --recursive
```

## Local Development Environment

Alternatively, you can build and develop locally in a Linux environment (we have tested with Ubuntu 20.04 and 22.04), or Windows with WSL 2.
Alternatively, you can build and develop locally in a Linux environment (we have tested with Ubuntu 20.04, 22.04, 24.04), or Windows with WSL 2.

Clone this repo to your local machine / virtual machine as follows.
Clone this repo to your local machine / virtual machine as follows.

```bash
git clone --recursive http://github.com/iSPIRT/depa-training
cd depa-training
```

Install the below listed dependencies by running the [install-prerequisites.sh](./install-prerequisites.sh) script.
Install the required dependencies by running the [install-prerequisites.sh](./install-prerequisites.sh) script.

```bash
./install-prerequisites.sh
Expand All @@ -33,34 +39,35 @@ Note: You may need to restart your machine to ensure that the changes take effec

## Build CCR containers

To build your own CCR container images, use the following command from the root of the repository.
To build your own Confidential Cleanroom (CCR) container images, use the following command from the root of the repository.

```bash
./ci/build.sh
```

This scripts build the following containers.
This scripts build the following containers.

- ```depa-training```: Container with the core CCR logic for joining datasets and running differentially private training.
- ```depa-training-encfs```: Container for loading encrypted data into the CCR.

- ```depa-training```: Container with the core CCR logic for joining datasets and running differentially private training.
- ```depa-training-encfs```: Container for loading encrypted data into the CCR.
Alternatively, you can pull and use pre-built container images from the iSPIRT container registry by setting the following environment variable. Docker hub has started throttling which may affect the upload/download time, especially when images are bigger size. So, It is advisable to use other container registries. We are using Azure container registry (ACR) as shown below:

Alternatively, you can use pre-built container images from the ispirt repository by setting the following environment variable. Docker hub has started throttling which may effect the upload/download time, especially when images are bigger size. So, It is advisable to use other container registries, we are using azure container registry as shown below
```bash
export CONTAINER_REGISTRY=ispirt.azurecr.io
./ci/pull-containers.sh
```

# Scenarios

This repository contains two samples that illustrate the kinds of scenarios DEPA for Training can support.
This repository contains sample demos illustrating a diverse set of scenarios that DEPA for Training can support.

Follow the links to build and deploy these scenarios.
Follow the links to build and deploy these scenarios.

| Scenario name | Scenario type | Task type | Privacy | No. of TDPs* | Data type (format) | Model type (format) | Join type (No. of datasets) |
| Scenario name | Scenario type | Task type | Privacy | No. of TDPs* | Data type (format) | Model type (format) | Join type (No. of datasets) |
|--------------|---------------|-----------------|--------------|-----------|------------|------------|------------|
| [COVID-19](./scenarios/covid/README.md) | Training - Deep Learning | Binary Classification | Differentially Private | 3 | PII tabular data (CSV) | MLP (ONNX) | Horizontal (3)|
| [BraTS](./scenarios/brats/README.md) | Training - Deep Learning | Image Segmentation | Differentially Private | 4 | MRI scans data (NIfTI/PNG) | UNet (Safetensors) | Vertical (4)|
| [Credit Risk](./scenarios/credit-risk/README.md) | Training - Classical ML | Binary Classification | Differentially Private | 4 | PII tabular data (Parquet) | XGBoost (JSON) | Horizontal (4)|
| [Credit Risk](./scenarios/credit-risk/README.md) | Training - Classical ML | Binary Classification | Differentially Private | 4 | PII tabular data (Parquet) | XGBoost (JSON) | Horizontal (6)|
| [CIFAR-10](./scenarios/cifar10/README.md) | Training - Deep Learning | Multi-class Image Classification | NA | 1 | Non-PII image data (SafeTensors) | CNN (Safetensors) | NA (1)|
| [MNIST](./scenarios/mnist/README.md) | Training - Deep Learning | Multi-class Image Classification | NA | 1 | Non-PII image data (HDF5) | CNN (ONNX) | NA (1)|

Expand All @@ -70,7 +77,7 @@ _*Training Data Providers (TDPs) involved in the scenario._

## Build your own Scenarios

A guide to build your own scenarios is coming soon. Stay tuned!
A guide to build your own scenarios is available [here](./build-your-own-scenario/README.md). Follow the steps to build and run your own unique training scenario!

Currently, DEPA for Training supports the following training frameworks, libraries and file formats (more will be included soon):

Expand All @@ -85,4 +92,4 @@ Note: Due to security reasons, we do not support Pickle based file formats such

This project welcomes feedback and contributions. Before you start, please take a moment to review our [Contribution Guidelines](./CONTRIBUTING.md). These guidelines provide information on how to contribute, set up your development environment, and submit your changes.

We look forward to your contributions and appreciate your efforts in making DEPA Training better for everyone.
We look forward to your contributions and appreciate your efforts in making DEPA Training better for everyone.
240 changes: 240 additions & 0 deletions build-your-own-scenario/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,240 @@
# Build Your Own Scenario

## Overview

You can build and run your own unique Training scenarios by following three simple steps:

1. Define your high-level scenario configuration and generate a scenario boilerplate from it.
2. Implement the data preprocessing and model saving code for the Training Data Providers (TDPs) and Training Data Consumer (TDC) respectively.
3. Tailor the various training configuration files applicable to your scenario.

Once the scenario is ready, deploy it locally and/or inside a Confidential Clean Room (CCR) following the standard deployment steps.

## Step 1: Define and build your scenario template

Make sure to first complete the setup steps mentioned in the main [README](../README.md) file.

### Define your scenario configuration

In the [config](./config/) directory, you will find example scenario configuration files. These files define the high-level scenario configuration -- the datasets owned by each TDP along with associated control parameters (such as privacy budget), the training framework to use, the data joining method to use, and the model format to save/load the model in.

In a similar fashion, you can define your own scenario configuration file following the generalized template below:

```json
{
"scenario_name": "your-scenario-name",
"tdps": [
{
"name": "data_provider_1",
"datasets": [
{
"name": "dataset_name",
"id": "unique/random-uuid-here",
"privacy": true|false,
"epsilon": 7.5|null,
"delta": 0.00001|null
}
]
}
],
"training_framework": "Train_DL|LLM_Finetune|Train_ML|Train_XGB",
"join_type": "SparkJoin|DirectoryJoin",
"model_format": "ONNX|Safetensors|HDF5|ccr_instantiate"
}
```

### Configuration Fields

- **scenario_name**: Unique name for your scenario (used for directory creation)
- **tdps**: Array of Training Data Providers (TDPs)
- **name**: TDP identifier
- **datasets**: Array of datasets brought by this TDP
- **name**: Dataset name
- **id**: Unique UUID for the dataset
- **privacy**: (Optional) Boolean indicating if privacy protection is required
- **epsilon**: (Optional) Privacy budget (ε) for differential privacy
- **delta**: (Optional) Privacy parameter (δ) for differential privacy
- **training_framework**: Training framework to use among available DEPA-Training options:
- `Train_DL`: Deep learning training
- `LLM_Finetune`: Fine-tuning of LLMs
- `Train_ML`: Classical machine learning training
- `Train_XGB`: XGBoost training
- **join_type**: Data joining method to use among available DEPA-Training options:
- `SparkJoin`: Spark-based data joining
- `DirectoryJoin`: Directory-based data joining
- **model_format**: Format of model brought by the TDC for training
- `ONNX`: Open Neural Network Exchange format
- `Safetensors`: Safetensors format
- `HDF5`: HDF5 format
- `ccr_instantiate`: Model is created inside the CCR (no file needed)

### Generate your scenario directory

Generate a scenario directory from your scenario configuration file, by running the `build-scenario.sh` script as follows:

```bash
./build-scenario.sh <path-to-scenario.json> [--force]
```

- `<path-to-scenario.json>`: Path to your scenario configuration JSON file
- `--force`: Optional flag to overwrite existing scenario directories

Example:

```bash
./build-scenario.sh config/credit-risk.json
```

### Generated Scenario Directory Structure

```
scenarios/your-scenario-name/
├── ci/
│ ├── Dockerfile.* # Dockerfiles for each TDP to prepare datasets
│ ├── Dockerfile.modelsave # Dockerfiles for preparing base model (if applicable)
│ ├── build.sh # Build script for all containers
│ ├── pull-containers.sh # Container pulling script
│ └── push-containers.sh # Container pushing script
├── src/
│ ├── preprocess_*.py # Preprocessing scripts for each TDP
│ └── save_base_model.py # Model saving script (if applicable)
├── contract/
│ └── contract.json # Contract template
├── policy/
│ └── policy-in-template.json # Policy template
├── config/
│ ├── consolidate_pipeline.sh # Pipeline consolidation script
│ ├── dataset_config.json # Dataset configuration
│ ├── eval_config.json # Evaluation configuration
│ ├── join_config.json # Data joining configuration (if applicable)
│ ├── model_config.json # Model configuration (if applicable)
│ ├── loss_config.json # Loss function configuration (DL only)
│ └── templates/ # Configuration templates for training pipeline
│ ├── pipeline_config_template.json
│ └── train_config_template.json
├── deployment/
│ ├── local/ # Local deployment commands
│ └── azure/ # Azure deployment commands
├── export-variables.sh # Environment variables for deployment
├── .gitignore # Git ignore file
```

## Step 2: Implement the data preprocessing and model saving code

Prior to training, the Training Data Providers (TDPs) and Training Data Consumer (TDC) need to prepare their datasets and models respectively.

### Data preprocessing

The folder ```scenarios/your-scenario-name/src``` contains boilerplate scripts for pre-processing the datasets. Acting as a Training Data Provider (TDP), prepare your datasets by modifying the scripts according to your requirements.

Corresponding Dockerfiles are also provided in the ```scenarios/your-scenario-name/ci``` directory. Modify the Dockerfiles to install the dependencies for your preprocessing scripts.

### Model saving

If the TDC intends to bring a base model file for training, a boilerplate script for saving the model is provided in the ```scenarios/your-scenario-name/src``` directory. Acting as a TDC, modify the script to save your model in an appropriate format.

A corresponding Dockerfile is provided in the ```scenarios/your-scenario-name/ci``` directory. Modify the Dockerfile to install the dependencies for your model saving script.

## Step 3: Tailor the training configuration

The folder ```scenarios/your-scenario-name/config``` contains the training configuration files for the different training frameworks. Below is a list of configuration files that you can modify to suit your training requirements:

- `train_config_template.json`: Training configuration template
- `join_config.json`: Data joining configuration (applicable only if joining multiple datasets)
- `dataset_config.json`: Dataset configuration
- `model_config.json`: Model configuration (applicable for formats other than ONNX)
- `loss_config.json`: Loss function configuration (applicable only for DL scenarios)
- `eval_config.json`: Trained model validation configuration

## Step 4: Deploy your scenario

Now that you have the full scenario ready, you can deploy it following the same steps as the example scenarios:

### Build scenario container images

```bash
export SCENARIO=your-scenario-name
export REPO_ROOT="$(git rev-parse --show-toplevel)"
cd $REPO_ROOT/scenarios/$SCENARIO
./ci/build.sh
```

### Deploy locally

Assuming you have cleartext access to all the datasets, you can train the model _locally_ as follows:

```bash
cd $REPO_ROOT/scenarios/$SCENARIO/deployment/local
./preprocess.sh
./save-model.sh
./train.sh
```

### Deploy on CCR

Once the training scenario executes successfully in the local environment, you can train the model inside a _Confidential Clean Room (CCR)_ as follows. This reference implementation assumes Azure as the cloud platform. Stay tuned for CCR on other cloud platforms.

#### 1. Set up environment variables

Set up the necessary environment variables for your deployment in the ```scenarios/your-scenario-name/export-variables.sh``` file and run it. This will set the environment variables in the current terminal.
```bash
cd $REPO_ROOT/scenarios/$SCENARIO
./export-variables.sh
source export-variables.sh
```

#### 2. Create resources

```bash
cd $REPO_ROOT/scenarios/$SCENARIO/deployment/azure
./1-create-storage-containers.sh
./2-create-akv.sh
```

#### 3. Contract signing

Follow the instructions in the [contract-service](https://github.com/iSPIRT/contract-service/blob/main/README.md) repository for contract signing, using your scenario's contract template in `/scenarios/$SCENARIO/contract/contract.json`.

Once the contract is signed, export the contract sequence number as an environment variable in the same terminal where you set the environment variables for the deployment.

```bash
export CONTRACT_SEQ_NO=<contract-sequence-number>
```

#### 4. Data encryption and upload

```bash
cd $REPO_ROOT/scenarios/$SCENARIO/deployment/azure
./3-import-keys.sh
./4-encrypt-data.sh
./5-upload-encrypted-data.sh
```

#### 5. Deploy CCR

```bash
./deploy.sh -c $CONTRACT_SEQ_NO -p ../../config/pipeline_config.json
```

#### 6. Monitor container logs

```bash
az container logs \
--name "depa-training-$SCENARIO" \
--resource-group "$AZURE_RESOURCE_GROUP" \
--container-name depa-training
```

You will know training has completed when the logs print "CCR Training complete!".

#### 7. Download and decrypt model

```bash
./6-download-decrypt-model.sh
```

The outputs will be saved to the ```scenarios/your-scenario-name/modeller/output``` directory.

## Contribute

Have a scenario that you think would be useful for others? Raise a Pull Request to contribute to the DEPA-Training project, following the [contribution guidelines](../CONTRIBUTING.md). Ensure that no personal/proprietary data, credentials or information is included in the code you submit.
Loading
Loading