This project provides a privacy-preserving solution for record matching, eliminating the need for dataholders to transfer their data externally in any form. By leveraging a Federated Learning approach with Support Vector Machines and using a reference set, it achieves high-quality matching comparable to non-Federated Learning setups for plain record linkage.
federated_data_generator.py: Generates data for federated learning simulator
local_data_generator.py: Generates data for local training (no Federated Learning here)
prepare_job_config.sh: Generates the configuration files
run_experiment_simulator.sh: Runs the FL simulator
test_only_data_generator.py: Creates test datasets. Suitable for big tests files
tester.py: Tests saved models on selected datasets
jobs: Contains python and config files for clients and server in FL simulator
workspace: Contains files that used during FL (like configs and python) and files that created after FL for server and each client
This project was implemented using NVFlare.
For more detailed information about the framework, you may refer to the Scikit-learn SVM example in the NVIDIA NVFlare repository. This example provides a detailed walkthrough of how to use Scikit-learn's SVM with NVFlare.
There are 2 options for data generation depanding if the data is going to be used for local training or in a Federated environment.
For the first option you can run the script from the command line with the following command:
python local_data_generator.py --path [pathA, pathB, pathRef] --rows [rows] --metric [metric] --threshold [threshold] --filename [filename]
For example, to read 1000 rows from 'Data/BIASA_200000.csv', 'Data/BIASB_200000.csv', and 'Data/reference_set.csv', calculate distances using the cosine metric, and save the resulting dataset to '/tmp/dataset/data.csv', you would use the following command:
python local_data_generator.py --path Data/BIASA_200000.csv Data/BIASB_200000.csv Data/reference_set.csv --rows 1000 --metric cosine --filename /tmp/dataset/data.csv
For the second option (i.e. the dataset for federated learning), the data generation process is integrated directly into the federated_data_generator.py file. To change parameters such as the path, number of rows, metric, or filename, you will need to do it manually in the code.
-
The script reads data from three CSV files: two data files with the real records and a reference set with random names. The data files contain columns for 'id', 'FirstName', 'LastName', and 'MiddleName'. The reference set contains a single column 'name' which is a Full Name.
-
The script calculates the Levenshtein distance (edit distance) between the names in the data files and the names in the reference set in order to create numerical data points.
-
The script then calculates distance matrices using the cdist function from scipy.
-
Next, labels are assigned. A label of 1 is given when two data points have the same 'id', indicating that they represent the same entity. Conversely, a label of 0 is given when they have different 'ids', indicating that they are different entities.
-
The script saves the resulting dataset to a CSV file, which is located at
/tmp/dataset/data.csv.
A script is used to automatically generate the configuration files for a specific setting, following the approach suggested in the NVFlare documentation. This script simplifies the process and eliminates the need for manual copying and modification of the files.
You can run the script with the following command:
bash prepare_job_config.shPlease note that this script will recreate the jobs/sklearn_svm_base/ folder for each client and also for the server. For instance, if you modify the number of clients in this bash file to 2, it will create a new folder under the jobs/ directory named sklearn_svm_2_uniform.
The newly created folder will contain the same files and code as in the jobs/sklearn_svm_base/ directory. If you wish to make more detailed modifications, such as changing the model kernel for the client and server, you will need to modify the jobs/sklearn_svm_base/app/config/config_fed_server.json file.
In this example, we chose the Radial Basis Function (RBF) kernel to experiment with three clients under the uniform data split.
Below is a sample config for site-1, saved to ./jobs/sklearn_svm_2_uniform/app_site-1/config/config_fed_client.json:
{
"format_version": 2,
"executors": [
{
"tasks": [
"train"
],
"executor": {
"id": "Executor",
"path": "nvflare.app_opt.sklearn.sklearn_executor.SKLearnExecutor",
"args": {
"learner_id": "svm_learner"
}
}
}
],
"task_result_filters": [],
"task_data_filters": [],
"components": [
{
"id": "svm_learner",
"path": "svm_learner.SVMLearner",
"args": {
"data_path": "/tmp/dataset/data.csv",
"train_start": 250000,
"train_end": 500000,
"valid_start": 0,
"valid_end": 250000
}
}
]
}Differential Privacy can also be added using Randomized Response. This is applied to the labels of support vectors after local training and before they are sent to the server. The code for this option is currently commented out in the /jobs/sklearn_svm_base/app/custom/svm_learner.py file.
We can run the FL simulator with three clients under the uniform data split with
nvflare simulator ./jobs/sklearn_svm_2_uniform -w ./workspace -n 2 -t 2
or
bash run_experiment_simulator.sh
You can monitor the Precision and Recall metrics of the resulting global model through the clients' logs and Google TensorBoard. To launch TensorBoard, execute the following command:
python3 -m tensorboard.main --logdir='workspace'