Digital Twin Simulation

This repository contains code for simulating digital twins using Large Language Models (LLMs), for the purpose of reproducing the experiments in Twin-2K-500 Mega Study (paper link). The project focuses on creating and simulating digital twins based on persona profiles and survey responses.

Overview

The digital twin simulation system creates virtual representations of individuals based on their survey responses and simulates their behavior in response to new survey questions. The system uses LLMs to generate realistic responses that maintain consistency with the original persona profiles.

The complete pipeline consists of:

Survey Processing: Converting Qualtrics QSF files and CSV responses into structured JSON
Persona Processing: Converting persona data to text format
Question Processing: Converting questions to text prompts
Simulation Input Creation: Combining personas and questions for LLM simulation
LLM Simulation: Running the actual digital twin simulations

Project Structure

.
├── configs/                   # Configuration files and Snakemake workflow
│   ├── idea_generation.yaml  # Configuration for idea generation study
│   ├── story_beliefs.yaml    # Configuration for story beliefs study
│   └── README.md             # Configuration documentation
├── Snakefile                 # Main Snakemake workflow definition
├── data/                     # Data organization by study
│   ├── idea_generation/      # Idea generation study data
│   │   ├── raw_data/         # Raw QSF and CSV files
│   │   ├── wave_qsf_json/    # Processed survey templates
│   │   └── response_json/    # Processed survey responses
│   ├── story_beliefs/        # Story beliefs study data
│   └── mega_persona_json/    # Persona profile data
├── processing_qualtrics_qsf/ # QSF file processing
│   ├── flow_elements_types/  # Flow element parsers
│   ├── question_types/       # Question type processors
│   └── parse_qsf.py         # Main QSF parser
├── processing_qualtrics_csv/ # CSV response processing
├── text_simulation/          # Main simulation code
│   ├── text_personas/        # Persona profile data (generated)
│   ├── text_questions/       # Survey questions (generated)
│   ├── text_simulation_input/ # Combined input files (generated)
│   └── text_simulation_output/ # Simulation results (generated)
├── evaluation/              # Evaluation folder  
├── scripts/                 # Utility scripts
│   └── run_pipeline.sh      # Main pipeline execution script
└── cache/                   # Cached data

Requirements

Python 3.11.7 or higher
Poetry for dependency management
Snakemake for workflow management
OpenAI API key (for LLM simulation)

Installation

Clone the repository:

git clone [repository-url]
cd Digital-Twin-Mega-Study

Install dependencies using Poetry:

poetry install

Download the Persona Dataset

poetry run python download_dataset.py

Set up your OpenAI API key:

# Create a .env file with your API key
echo "OPENAI_API_KEY=your_actual_api_key_here" > .env

Data Organization

Setting Up Your Study Data

For each study (e.g., "idea_generation", "story_beliefs"), organize your data as follows:

Create the study folder structure:

mkdir -p data/your_study_name/raw_data
mkdir -p data/your_study_name/wave_qsf_json
mkdir -p data/your_study_name/response_json

Place your raw files:
- Put your QSF files in: data/your_study_name/raw_data/
- Put your response CSV files in: data/your_study_name/raw_data/
Create a configuration file:
- Copy configs/idea_generation.yaml to configs/your_study_name.yaml
- Update the file paths to match your study structure

Pipeline Execution Order

Our workflow is origanized by the following. We use Snakemake to manage the workflow.

graph TD
    A["QSF File<br/>(Survey Template)"] --> B["process_qsf"]
    C["CSV File<br/>(Survey Responses)"] --> D["process_csv"]
    E["Persona JSON<br/>(mega_persona_json)"] --> F["convert_personas"]
    
    B --> G["JSON Template<br/>(wave_qsf_json/)"]
    G --> D
    D --> H["Individual Response JSONs<br/>(response_json/)"]
    
    F --> I["Text Personas<br/>(text_personas/)"]
    H --> J["convert_questions"]
    J --> K["Text Questions<br/>(text_questions/)"]
    
    I --> L["create_simulation_input"]
    K --> L
    L --> M["Simulation Input<br/>(text_simulation_input/)"]
    
    M --> N["run_llm_simulation"]
    N --> O["Final Results<br/>(text_simulation_output/)"]
    
    subgraph "Qualtrics Processing"
        B
        D
    end
    
    subgraph "Text Simulation"
        F
        J
        L
        N
    end
    
    style A fill:#e1f5fe
    style C fill:#e1f5fe
    style E fill:#e1f5fe
    style O fill:#c8e6c9
    style B fill:#fff3e0
    style D fill:#fff3e0
    style F fill:#f3e5f5
    style J fill:#f3e5f5
    style L fill:#f3e5f5
    style N fill:#f3e5f5

Pipeline Steps:

process_qsf: Convert QSF → JSON template
process_csv: Process CSV responses → Individual response JSONs
convert_personas: Convert persona data → Text format
convert_questions: Convert questions → Text prompts
create_simulation_input: Combine personas + questions → Simulation inputs
run_llm_simulation: Execute LLM simulation → Final results

Configuration Files

Each study requires a YAML configuration file in the configs/ directory. Here's the structure using idea_generation.yaml as an example:

# QSF Processing Configuration
qsf_to_json:
  input_file: data/idea_generation/raw_data/Measures_of_Creativity_Digital_Twins_Toubia.qsf
  output_file: data/idea_generation/wave_qsf_json/Measures_of_Creativity_Digital_Twins_Toubia_parsed.json
  exclude_blocks: ["consent and screening-digital twins"]  # Optional: blocks to exclude

# CSV Response Processing Configuration
csv_to_fill_answers:
  input_survey_template: data/idea_generation/wave_qsf_json/Measures_of_Creativity_Digital_Twins_Toubia_parsed.json
  input_csv_file: data/idea_generation/raw_data/response.csv
  output_dir: data/idea_generation/response_json
  limit: -1  # -1 for all participants, or specify a number

# Text Simulation Configuration
text_simulation:
  # Convert personas to text format
  personas_to_texts:
    persona_json_dir: data/mega_persona_json/mega_persona
    output_text_dir: text_simulation/text_personas
    persona_variant: full

  # Convert questions to text prompts
  questions_to_texts:
    input_path: data/idea_generation/response_json
    output_dir: text_simulation/text_questions
    include_reasoning: true  # true for creative tasks, false for factual

  # Create simulation input files
  create_text_simulation_input:
    persona_text_dir: data/full_persona_text
    question_prompts_dir: text_simulation/text_questions
    output_combined_prompts_dir: text_simulation/text_simulation_input

  # LLM Simulation Configuration
  LLM_simulation:
    input_dir: text_simulation/text_simulation_input
    output_dir: text_simulation/text_simulation_output
    question_json_base_dir: data/idea_generation/response_json
    output_updated_questions_dir: text_simulation/text_simulation_output/response_json_llm_imputed
    system_instruction: |
      You are an AI assistant. Your task is to answer the 'New Survey Question' as if you are the person described in the 'Persona Profile' (which consists of their past survey responses). 
      Adhere to the persona by being consistent with their previous answers and stated characteristics. 
      Follow all instructions provided for the new question carefully regarding the format of your answer.
    LLM_config:
      provider: "openai"
      model_name: "gpt-4.1"          # Model to use
      temperature: 1.0               # 1.0 for creative tasks, 0.0 for factual
      max_tokens: 16384
      max_retries: 10
    num_workers: 300                 # Parallel processing workers
    force_regenerate: false          # Set to true to overwrite existing outputs
    max_personas: -1                 # -1 for all, or specify number

Configuration Options Explained

exclude_blocks: List of survey blocks to exclude from processing
include_reasoning: Whether to include reasoning in question prompts (true for creative tasks)
temperature: LLM creativity (0.0 = deterministic, 1.0 = creative)
max_personas: Limit number of personas processed (-1 for all)
force_regenerate: Whether to overwrite existing simulation outputs

Usage

The pipeline is managed using Snakemake, which builds a Directed Acyclic Graph (DAG) based on the workflow rules defined in Snakefile. Snakemake automatically determines which steps need to be run based on file dependencies and timestamps.

Running the Complete Pipeline

# Run the complete pipeline for idea generation study
poetry run snakemake --configfile configs/idea_generation.yaml --cores 1

# Run with multiple cores for faster processing
poetry run snakemake --configfile configs/idea_generation.yaml --cores 4

# Run for a different study (e.g., story_beliefs)
poetry run snakemake --configfile configs/story_beliefs.yaml --cores 4

Available Snakemake Rules

The Snakefile defines the following rules that correspond to each pipeline step:

process_qsf: Convert QSF files to JSON templates
process_csv: Process CSV responses to individual JSON files
convert_personas: Convert persona data to text format
convert_questions: Convert questions to text prompts
create_simulation_input: Combine personas and questions
run_llm_simulation: Execute LLM digital twin simulation

Forcing Rule Execution

By default, Snakemake skips rules if output files are newer than input files. Since Python code changes aren't automatically detected, you may need to force execution:

# Force all rules to run regardless of file timestamps
poetry run snakemake --configfile configs/idea_generation.yaml --cores 4 --forceall

# Force a specific rule and all downstream dependencies
poetry run snakemake --configfile configs/idea_generation.yaml --cores 4 --forcerun process_qsf

# Force multiple specific rules
poetry run snakemake --configfile configs/idea_generation.yaml --cores 4 --forcerun process_qsf --forcerun convert_personas

Running Individual Pipeline Components

Very often you are working on a specific component of the workflow. To test or run specific components without executing the entire pipeline:

# Run only the persona conversion step
poetry run snakemake convert_personas --configfile configs/idea_generation.yaml --cores 4 --forcerun convert_personas

# Run only the QSF processing step
poetry run snakemake process_qsf --configfile configs/idea_generation.yaml --cores 4 --forcerun process_qsf

# Run up to question conversion (includes all prerequisite steps)
poetry run snakemake convert_questions --configfile configs/idea_generation.yaml --cores 4 --forcerun convert_questions

When you specify a target rule, Snakemake builds a DAG that includes only that rule and its dependencies, making it efficient for testing individual components.

Dry run the Workflow

To understand better how Snakemake works, you can do

# Show which files would be created/updated (dry run)
poetry run snakemake --configfile configs/idea_generation.yaml --cores 1 --dry-run

# Generate a workflow diagram (requires graphviz)
poetry run snakemake --configfile configs/idea_generation.yaml --dag | dot -Tpng > workflow.png

# Show detailed execution plan
poetry run snakemake --configfile configs/idea_generation.yaml --cores 1 --dry-run --printshellcmds

Troubleshooting

Common Issues

Missing API Key:

# Make sure your .env file contains:
echo "OPENAI_API_KEY=your_actual_api_key_here" > .env

File path errors:
- Ensure all paths in your configuration file are relative to the project root
- Check that raw data files exist in the specified locations
Memory issues with large datasets:
- Use --max_personas=N to limit the number of participants
- Adjust limit in your configuration file
Snakemake doesn't detect changes:
- Use --forcerun <rule_name> to force rebuilding specific rules and their dependencies
- Use --forceall to force complete rebuild of all rules
- Use --delete-all-output followed by a normal run to completely rebuild from scratch
LLM simulation fails:
- Check your OpenAI API key and quota
- Reduce num_workers if hitting rate limits
- Check max_retries setting in LLM_config

Output Verification

After running the pipeline, you should see these directories populated:

# For idea_generation study:
data/idea_generation/wave_qsf_json/          # Survey templates
data/idea_generation/response_json/          # Individual responses
text_simulation/text_personas/               # Text personas
text_simulation/text_questions/              # Text questions
text_simulation/text_simulation_input/       # Combined inputs
text_simulation/text_simulation_output/      # Final simulation results

ML Prediction with XGBoost

In addition to LLM-based digital twins, this repository includes a machine learning approach using XGBoost to predict survey responses based on persona features.

# Train XGBoost models
poetry run python ml_prediction/predict_answer_xgboost.py \
    --config ml_prediction/ml_prediction_config.yaml

# Evaluate predictions with MAD
poetry run python ml_prediction/prepare_xgboost_for_mad.py --run-mad

For more details and options, see ml_prediction/README.md.

Meta-Analysis of Results

After running simulations, analyze results across all studies and specifications:

# Combine all meta-analysis results into a single CSV
poetry run python mega_study_evaluation/combine_all_meta_analyses.py

This creates a comprehensive CSV file combining all studies across all specifications with:

Specification types (e.g., "full_persona_without_reasoning")
Run dates parsed from directory names
All original meta-analysis metrics from individual studies

Results are saved to mega_study_evaluation/meta_analysis_results/combined_all_specifications_meta_analysis_{timestamp}.csv

For individual study analysis and advanced meta-analysis options, see mega_study_evaluation/README.md.

Contributing

When adding new studies or modifying processing logic:

Create appropriate data folder structure
Add configuration file in configs/
Test processing pipeline with Snakemake
Document any new configuration options
Update this README if needed

For more detailed documentation on individual components, see the configs/README.md file.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.dat		.dat
mega_study_evaluation		mega_study_evaluation
ml_prediction		ml_prediction
post_metric_calculation		post_metric_calculation
processing_qualtrics_csv		processing_qualtrics_csv
processing_qualtrics_qsf		processing_qualtrics_qsf
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
Snakefile		Snakefile
download_dataset.py		download_dataset.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Digital Twin Simulation

Overview

Project Structure

Requirements

Installation

Data Organization

Setting Up Your Study Data

Pipeline Execution Order

Configuration Files

Configuration Options Explained

Usage

Running the Complete Pipeline

Available Snakemake Rules

Forcing Rule Execution

Running Individual Pipeline Components

Dry run the Workflow

Troubleshooting

Common Issues

Output Verification

ML Prediction with XGBoost

Meta-Analysis of Results

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Digital Twin Simulation

Overview

Project Structure

Requirements

Installation

Data Organization

Setting Up Your Study Data

Pipeline Execution Order

Configuration Files

Configuration Options Explained

Usage

Running the Complete Pipeline

Available Snakemake Rules

Forcing Rule Execution

Running Individual Pipeline Components

Dry run the Workflow

Troubleshooting

Common Issues

Output Verification

ML Prediction with XGBoost

Meta-Analysis of Results

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages