Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
6 changes: 6 additions & 0 deletions ci/Dockerfile.train
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,12 @@ RUN python3.9 get-pip.py
RUN pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cpu
RUN pip3 --default-timeout=1000 install pyspark pandas opacus onnx onnx2pytorch scikit-learn scipy matplotlib

## ADDED: For language tasks
RUN pip3 install transformers datasets peft

## ADDED: For vision tasks
RUN pip3 install --default-timeout=100 opencv-python pillow monai tqdm

RUN apt-get install -y jq

# Install contract ledger client
Expand Down
3 changes: 3 additions & 0 deletions scenarios/llm-finetune/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
/data/
/model/
/output/
74 changes: 74 additions & 0 deletions scenarios/llm-finetune/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# LLM Fine-tuning with Differential Privacy

This scenario demonstrates how a Large Language Model (LLM) can be fine-tuned for medical question answering using the join of multiple (potentially PII-sensitive) datasets. The Training Data Consumer (TDC) building the model gets into a contractual agreement with multiple Training Data Providers (TDPs), and the model is fine-tuned on the joined datasets in a data-blind manner within the CCR, maintaining privacy guarantees using differential privacy. For demonstration purposes, this scenario uses open-source models and datasets from HuggingFace.

The end-to-end training pipeline consists of the following phases:

1. Data pre-processing
2. Data packaging, encryption and upload
3. Model packaging, encryption and upload
4. Encryption key import with key release policies
5. Deployment and execution of CCR
6. Model decryption

## Build container images

Build container images required for this sample as follows:

```bash
cd scenarios/llm-finetune
./ci/build.sh
```

This script builds the following container images:

- `preprocess-icmr, preprocess-cowin, preprocess-index`: Containers that pre-process the text datasets.
- `ccr-model-save`: Container that saves the base model to be fine-tuned.

## Data pre-processing and de-identification

Acting as TDPs for each dataset, run the following scripts to de-identify the datasets:

```bash
cd scenarios/llm-finetune/deployment/docker
./preprocess.sh
```

This script performs pre-processing of the text datasets before the training process.

## Prepare model for training

Next, acting as a TDC, load and save a sample model using the following script:

```bash
./save-model.sh
```

This script will save the base model within `scenarios/llm-finetune/model/`.

## Deploy locally

Assuming you have cleartext access to all the datasets, you can fine-tune the model as follows:

```bash
./train.sh
```

The script joins the datasets and fine-tunes the model using a pipeline configuration defined in [pipeline_config.json](./config/pipeline_config.json). The fine-tuning process uses:

- LoRA (Low-Rank Adaptation) for parameter-efficient fine-tuning
- Differential Privacy via Opacus
- Weight quantization for memory efficiency

If all goes well, you should see training progress output similar to:

```
Epoch: 1 | Step: 50 | Train loss: 2.342
Epoch: 1 | Step: 100 | Train loss: 2.123
...
Epoch 1 completed. Average loss: 2.156
```

The fine-tuned model will be saved under the folder `/scenarios/llm-finetune/output`.

## Deploy on CCR
10 changes: 10 additions & 0 deletions scenarios/llm-finetune/ci/Dockerfile.chatdoctor
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
FROM ubuntu:20.04

ENV DEBIAN_FRONTEND="noninteractive"

RUN apt-get upgrade && apt-get update \
&& apt-get install -y python3 python3-pip

RUN pip3 install datasets pandas pathlib

COPY load_chatdoctor_dataset.py load_chatdoctor_dataset.py
10 changes: 10 additions & 0 deletions scenarios/llm-finetune/ci/Dockerfile.medqa
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
FROM ubuntu:20.04

ENV DEBIAN_FRONTEND="noninteractive"

RUN apt-get upgrade && apt-get update \
&& apt-get install -y python3 python3-pip

RUN pip3 install datasets pandas pathlib

COPY load_medqa_dataset.py load_medqa_dataset.py
10 changes: 10 additions & 0 deletions scenarios/llm-finetune/ci/Dockerfile.medquad
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
FROM ubuntu:20.04

ENV DEBIAN_FRONTEND="noninteractive"

RUN apt-get upgrade && apt-get update \
&& apt-get install -y python3 python3-pip

RUN pip3 install datasets pandas pathlib

COPY load_medquad_dataset.py load_medquad_dataset.py
21 changes: 21 additions & 0 deletions scenarios/llm-finetune/ci/Dockerfile.modelsave
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
FROM ubuntu:20.04

ENV DEBIAN_FRONTEND="noninteractive"

RUN apt-get update && apt-get -y upgrade \
&& apt-get install -y curl \
&& apt-get install -y python3.9 python3.9-dev python3.9-distutils

## Install pip
RUN curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
RUN python3.9 get-pip.py

# Install CPU-only version of PyTorch
RUN pip3 install torch --index-url https://download.pytorch.org/whl/cpu
# RUN pip3 install torch
RUN pip3 install transformers peft bitsandbytes pathlib
# RUN pip3 install transformers peft pathlib

# ENV CUDA_VISIBLE_DEVICES=""

COPY load_base_model.py load_base_model.py
6 changes: 6 additions & 0 deletions scenarios/llm-finetune/ci/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
#!/bin/bash

docker build -f ci/Dockerfile.medqa src -t preprocess-medqa:latest
docker build -f ci/Dockerfile.chatdoctor src -t preprocess-chatdoctor:latest
docker build -f ci/Dockerfile.medquad src -t preprocess-medquad:latest
docker build -f ci/Dockerfile.modelsave src -t ccr-model-save:latest
6 changes: 6 additions & 0 deletions scenarios/llm-finetune/ci/push-containers.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
containers=("preprocess-medqa:latest" "preprocess-chatdoctor:latest" "preprocess-medquad" "ccr-model-save:latest")
for container in "${containers[@]}"
do
docker tag $container $CONTAINER_REGISTRY"/"$container
docker push $CONTAINER_REGISTRY"/"$container
done
5 changes: 5 additions & 0 deletions scenarios/llm-finetune/config/chatdoctor_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"DATASET_NAME": "lavita/ChatDoctor-HealthCareMagic-100k",
"SAVE_DIR": "/mnt/output/chatdoctor/",
"FORMAT": "csv"
}
6 changes: 6 additions & 0 deletions scenarios/llm-finetune/config/medqa_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"DATASET_NAME": "Malikeh1375/medical-question-answering-datasets",
"DATASET_SPLIT": "all-processed",
"SAVE_DIR": "/mnt/output/medqa/",
"FORMAT": "csv"
}
5 changes: 5 additions & 0 deletions scenarios/llm-finetune/config/medquad_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"DATASET_NAME": "keivalya/MedQuad-MedicalQnADataset",
"SAVE_DIR": "/mnt/output/medquad/",
"FORMAT": "csv"
}
10 changes: 10 additions & 0 deletions scenarios/llm-finetune/config/model_repo_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
{
"MODEL_NAME": "facebook/opt-350m",
"REPO_READ_TOKEN": "",
"SAVE_DIR": "mnt/model",
"Q_PREC": "load_in_4bit",
"REPO_WRITE_TOKEN": "",
"REPO_URL": "https://huggingface.co/your_username/your_model_repo",
"MODEL_DESCRIPTION": "A fine-tuned model for medical question answering.",
"MODEL_LICENSE": "apache-2.0"
}
72 changes: 72 additions & 0 deletions scenarios/llm-finetune/config/pipeline_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
{
"pipeline": [
{
"name": "TextJoin",
"config": {
"datasets": [
{
"id": "19517ba8-bab8-11ed-afa1-0242ac120002",
"name": "medqa",
"file": "medical-question-answering-datasets.csv",
"select_variables": ["input", "output"],
"num_rows": 200,
"mount_path": "/mnt/remote/medqa/"
},
{
"id": "216d5cc6-bab8-11ed-afa1-0242ac120002",
"name": "chatdoctor",
"file": "ChatDoctor-HealthCareMagic-100k.csv",
"select_variables": ["input", "output"],
"num_rows": 200,
"mount_path": "/mnt/remote/chatdoctor/"
},
{
"id": "2830a144-bab8-11ed-afa1-0242ac120002",
"name": "medquad",
"file": "MedQuad-MedicalQnADataset.csv",
"select_variables": ["Question", "Answer"],
"num_rows": 200,
"mount_path": "/mnt/remote/medquad/"
}
],
"joined_dataset": {
"output_folder": "/tmp/",
"output_file": "medqa_chatdoctor_medquad_joined.csv"
}
}
},
{
"name": "PrivateLLMFineTune",
"config": {
"device": "cpu",
"saved_model_dir": "/mnt/remote/model",
"model_name": "facebook/opt-350m",
"trained_model_output_path": "/mnt/remote/output",
"input_dataset_path": "/tmp/medqa_chatdoctor_medquad_joined.csv",
"sample_prompts": [
"What is the treatment for diabetes?",
"How can I manage my hypertension?",
"What are the symptoms of asthma?",
"What is the best diet for heart health?"
],
"MODEL_REPO_CONFIG": "location/to/model_repo_config",
"MAX_TOKENS": 64,
"BATCH_SIZE": 4,
"NUM_EPOCHS": 1,
"Q_PRECISION": null,
"LEARNING_RATE": 5e-5,
"EPSILON": 7.5,
"DELTA": 1e-5,
"MAX_GRAD_NORM": 1.0,
"MAX_PHYSICAL_BATCH_SIZE": 4,
"LORA_RANK": 8,
"LORA_ALPHA": 32,
"LORA_TARGET_MODULES": null,
"LORA_DROPOUT": 0.05,
"LORA_BIAS": "none",
"LOG_FREQ_STEPS": 50
}
}
]
}

92 changes: 92 additions & 0 deletions scenarios/llm-finetune/contract/contract.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
{
"id": "f4f72a88-bab1-11ed-afa1-0242ac120002",
"schemaVersion": "0.1",
"startTime": "2023-03-14T00:00:00.000Z",
"expiryTime": "2024-03-14T00:00:00.000Z",
"tdc": "",
"tdps": [],
"ccrp": "did:web:ccrprovider.github.io",
"datasets": [
{
"id": "19517ba8-bab8-11ed-afa1-0242ac120002",
"name": "medqa",
"url": "https://ccrcontainer.blob.core.windows.net/medqa/data.img",
"provider": "",
"key": {
"type": "azure",
"properties": {
"kid": "MEDQAFilesystemEncryptionKey",
"authority": {
"endpoint": "sharedneu.neu.attest.azure.net"
},
"endpoint": ""
}
}
},
{
"id": "216d5cc6-bab8-11ed-afa1-0242ac120002",
"name": "chatdoctor",
"url": "https://ccrcontainer.blob.core.windows.net/chatdoctor/data.img",
"provider": "",
"key": {
"type": "azure",
"properties": {
"kid": "CHATDOCTORFilesystemEncryptionKey",
"authority": {
"endpoint": "sharedneu.neu.attest.azure.net"
},
"endpoint": ""
}
}
},
{
"id": "2830a144-bab8-11ed-afa1-0242ac120002",
"name": "medquad",
"url": "https://ccrcontainer.blob.core.windows.net/medquad/data.img",
"provider": "",
"key": {
"type": "azure",
"properties": {
"kid": "MEDQUADFilesystemEncryptionKey",
"authority": {
"endpoint": "sharedneu.neu.attest.azure.net"
},
"endpoint": ""
}
}
}
],
"purpose": "TRAINING",
"constraints": [
{
"privacy": [
{
"dataset": "19517ba8-bab8-11ed-afa1-0242ac120002",
"epsilon_threshold": "1.5",
"noise_multiplier": "2.0",
"delta": "0.01",
"epochs_per_report": "2"
},
{
"dataset": "216d5cc6-bab8-11ed-afa1-0242ac120002",
"epsilon_threshold": "1.5",
"noise_multiplier": "2.0",
"delta": "0.01",
"epochs_per_report": "2"
},
{
"dataset": "2830a144-bab8-11ed-afa1-0242ac120002",
"epsilon_threshold": "1.5",
"noise_multiplier": "2.0",
"delta": "0.01",
"epochs_per_report": "2"
}
]
}
],
"terms": {
"payment": {},
"revocation": {}
}
}

Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
services:
model_save:
image: ${CONTAINER_REGISTRY:+$CONTAINER_REGISTRY/}ccr-model-save:latest
volumes:
- $MODEL_OUTPUT_PATH:/mnt/model
- $CONFIG_PATH:/mnt/config/model_repo_config.json
command: ["python3.9", "load_base_model.py"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
services:
medqa:
image: ${CONTAINER_REGISTRY:+$CONTAINER_REGISTRY/}preprocess-medqa:latest
volumes:
- $MEDQA_OUTPUT_PATH:/mnt/output/medqa
- $CONFIG_PATH:/mnt/config/
command: ["python3", "load_medqa_dataset.py"]
chatdoctor:
image: ${CONTAINER_REGISTRY:+$CONTAINER_REGISTRY/}preprocess-chatdoctor:latest
volumes:
- $CHATDOCTOR_OUTPUT_PATH:/mnt/output/chatdoctor
- $CONFIG_PATH:/mnt/config/
command: ["python3", "load_chatdoctor_dataset.py"]
medquad:
image: ${CONTAINER_REGISTRY:+$CONTAINER_REGISTRY/}preprocess-medquad:latest
volumes:
- $MEDQUAD_OUTPUT_PATH:/mnt/output/medquad
- $CONFIG_PATH:/mnt/config/
command: ["python3", "load_medquad_dataset.py"]
Loading