Skip to content

JunzheShao98/Synthetic-CTR

Repository files navigation

Synthetic Data Generation for Cameroon Trauma Registry (CTR) using Synthpop and LLMs

Overview

This project focuses on generating high-quality synthetic data from the Cameroon Trauma Registry (CTR). The primary objective is to create datasets that retain statistical utility for analysis while maximizing privacy protection, enabling safe public sharing of valuable health information. We implemented and compared two distinct approaches:

  1. Synthpop: A traditional, machine learning-based method using sequential regression trees (implemented via the synthpop R package).
  2. LLM-based: A novel Artificial Intelligence approach utilizing fine-tuned Large Language Models (LLMs), partially inspired by the GReaT method.

A comprehensive evaluation framework was used to assess both the utility (how well the synthetic data mirrors the real data statistically) and the privacy (how well individual records are protected) of the data generated by each method.


Given the promising privacy preservation metrics we've observed, this GitHub repository, with all original, sensitive data removed, will be made publicly accessible. Note that replicating all evaluation results presented in this project requires access to the original Cameroon Trauma Registry data.


1. Key Highlights of Implementation

  • Methods Compared:

    • Synthpop: Represents a well-established, relatively straightforward baseline statistical approach. It leverages sequential regression modeling (using CART) to generate synthetic data column by column. Its implementation is facilitated by a well-developed R package.
    • LLM-based Method: A novel approach using fine-tuned open-source LLMs (specifically Llama 3.1 in this implementation). This method involves converting tabular data into text, fine-tuning the model on this text, generating synthetic text, and converting it back to tabular data.
  • Data Preparation:

    • A specific subset of the CTR dataset was selected for this implementation to facilitate easier quality assessment.
    • This subset included 2550 patient entries with non-missing values for trauma subtype.
    • A total of 52 variables, primarily demographic, were chosen based on having low rates of missing data.
      • [true_data_subset.csv] has been removed as this github repo is public.
  • LLM-based Workflow Details: workflow

    1. Serialization (Table to Text):
      • Input: A single row from the prepared CTR dataset (52 columns).
      • Conversion: Each cell value is paired with its column name (e.g., "age is 35, gender is Male“). These pairs are concatenated into a single descriptive text string, separated by commas.
      • Permutation: To mitigate the influence of column order inherent in autoregressive LLMs, the order of the "column_name is value" pairs within the string was randomly permuted 3 times for each original row entry during training dataset creation.
      • Patient ID: A random patient identifier column was included, serving as a starting point anchor during the generation phase prompting.
        • [output_synthetic_text_V2_M3.json]
      • The serialized text strings were formatted into a prompt-completion structure suitable for fine-tuning. A global instruction, "Complete the sentences generated by a csv file:", was used.
    2. Model Fine-tuning:
      • An open-source LLM (Llama 3.1) was fine-tuned using the prepared prompt-completion dataset.
        • [Finetune_Llama3_for_synthetic_data_generation.ipynb]
    3. Synthetic Data Generation:
      • The fine-tuned model was prompted to generate synthetic patient descriptions. Prompts typically started with the anchor phrase (e.g., patient_id is [New Random ID],).
    4. De-serialization (Text to Table):
      • The generated text descriptions were parsed back into a structured tabular format, creating the synthetic dataset.
        • Generated predictions/synthetic dataset: [generated_predictions.json]
        • [deserialization.py]
  • Evaluation Framework:

    • A comprehensive framework was established to assess both generated datasets (synthpop and LLM-based).
    • Data Utility: Measured statistical accuracy and distributional similarity (e.g., comparing histograms, correlation matrices, and the performance of predictive models trained on real vs. synthetic data).
    • Privacy: Assessed using metrics like exact match detection and resilience against Membership Inference Attacks (MIAs), which try to determine if a specific individual's record was used in the training data. We specifically utilized the sdmetrics library for evaluations.

2. Results

  • Synthpop Performance:

    • Utility: Demonstrated strong utility. It effectively preserved local data distributions (visualized via histograms) and pairwise variable correlations (visualized via heatmaps). Global data patterns were also well-maintained.
      • distribution plot
      • distribution plot
    • Distinguishability tests (e.g., training a classifier to tell real from synthetic) showed an AUC of 0.52, indicating the synthetic data was statistically very close to the real data, bordering on hard to distinguish.
    • Statistical Inference: Regression models (e.g., predicting an outcome variable) built using the synthpop data yielded results very similar to models built using the original CTR data, indicating good preservation of statistical relationships. [synthetic_data_generation.R]
    • Privacy: While synthpop successfully avoided generating exact duplicates of real records (a minimum privacy requirement), potential risks for inference attacks remain due to its model-based nature, which explicitly tries to capture data relationships.
  • LLM-based Method Performance:

    • Utility: Also preserved most data structures and relationships effectively (checked via histograms and heatmaps). However, it showed slightly lower fidelity in replicating exact pairwise correlations and the fit of regression models compared to synthpop.[synthetic_data_generation.R]

      • distribution plot
    • Quantitative Utility Metrics: Achieved a Propensity Mean Squared Error (PMSE) of 0.1599 and a Nearest Neighbour Adversarial Accuracy (NNAA) of 0.8494, indicating reasonable (though not perfect) data utility according to these standard metrics.[synthetic_sdmetrics.ipynb]

    • Privacy: Provided demonstrably superior privacy protection compared to synthpop. This enhanced privacy is attributed partly to the inherent complexity and "black-box" nature of large language models, making it harder to reverse-engineer specific training data points.

    • Quantitative Privacy Metrics: Showed strong privacy characteristics with a high Nearest Neighbour Distance Ratio (NNDR) of 0.9271 (higher means synthetic points are relatively far from their nearest real neighbours) and a low Epsilon Identifiability Risk of 0.0966 (lower means less risk of re-identification).

    • Membership Inference Attack (MIA) Resistance: Evaluations using sdmetrics (averaged over 30 runs with randomly selected combinations of 3 key fields and 1 sensitive field) confirmed the significantly enhanced privacy performance of the LLM-based method against MIAs compared to synthpop.

      • MIA performance comparison

3. Conclusions

  • Effectiveness: Both synthpop and the LLM-based method are viable and effective techniques for generating synthetic versions of the Cameroon Trauma Registry data.

  • Trade-off: The primary finding is a clear trade-off between statistical fidelity (utility) and privacy protection between the two methods.

    • Synthpop: Excels in accurately replicating the statistical properties of the original data but has privacy risks associated due to CART methods.
    • LLM-based Method: Offers substantially improved privacy guarantees, particularly against attacks like Membership Inference Attacks. This makes it a more robust choice when privacy is the major concern.
  • Controlling the Trade-off: The study suggests that for LLM-based methods, factors like the number of fine-tuning epochs can influence the trade-off between utility and privacy. Fine-tuning for too long might increase fidelity but potentially leak more private information, requiring careful calibration.

  • Future Potential: The LLM-based method demonstrates significant potential as an advanced technique for creating safe, shareable synthetic health datasets. Further research inlcudes optimization (e.g., exploring different LLM architectures, fine-tuning strategies, serialization techniques), and validation on diverse datasets.


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published