IDSpace is a synthetic data generation framework to generate a large number of identity documents using only a few documents from a target domain without including any private information.
Our synthetic dataset is released on HuggingFace, you can download the datasets from here.
Python>=3.10 is required to run the project. To install all the dependencies, either run the following command or manually install the dependencies in the requirements file.
pip install -r requirements.txtDownload the pretrained models from here and place the unzipped models folder inside the data/ directory.
In this Experiments, we used SIDTD tmplate dataset as our target domain and used in our experiments, you can download the SIDTD data in site1 or site2, then put the reals and fakes folders inside the data/templates/ directory.
To run the Bayesian Optimization baseline, run the following command under the root directory of the project.
python experiments/Bayesian_search.py target_samples with_model lambda0 lambda1 candidate_modelsIn the above command, the parameter target_samples denotes the number of samples to be used (int), with_model denotes whether the optimization will be model-guided or not (0 or 1), lambda0 and lambda1 parameters control the fractions of similarity score and consistency score in the total evaluation score, and candidate_models stand for the names of models (space separated) that will guide the optimization. Example commands are given below:
python experiments/Bayesian_search.py 20 1 1 1 resnet50python experiments/Bayesian_search.py 20 0 1 1 ssimYou may also execute experiments/run.py which provides sample commands to execute Bayesian_search.py.
To run the Hyperband search baseline, run the following command under the root directory of the project.
python experiments/Hyperband_search.py param_r param_eta target_samples with_model lambda0 lambda1 candidate_modelsHere, param_r and param_eta represent maximum resources and successive halving parameters of the Hyperband search method. Other parameters target_samples, with_model, lambda0, lambda1, and candidate_models are similar to those in Bayesian search. Example commands:
python experiments/Hyperband_search.py 700 3 20 1 1 1 resnet50python experiments/Hyperband_search.py 700 3 20 0 1 1 ssimAlso, experiments/run_hyperband.py can be executed, which runs Hyperband_search.py with example commands.
Download the repo of CycleGAN here Download the data used for this baseline here
To run the CycleGAN baseline, run the following command under the root directory of the project.
cd experiments
bash cycle_run.shThis include the CycleGAN training part and results evaluation part(test_cyclegan.py).
Download the repo of diffusers here Download the data used for this baseline here
To run the Diffusion inpainting baseline, run the following command under the root directory of the project.
cd experiments
python test_diffusion.py <dataset>To train and test the inpainting models:
cd experiments/diffusion_inpainting
bash train_bash.sh
python infer_pipeline.py <number of training samples> The folder also contains the data preprocessing code(prepare_data.py)
See README.md in template_image for more details.
See README.md in scanned_image for more details.
See README.md in mobile_image for more details.
git clone https://github.com/asu-cactus/IDSpace.git
cd IDSpace
pip install -r requirements.txt # python >=3.10 is requiredcd data/inputs/SVKIn this folder, SVK_configure.yaml contains all configuration settings used for both the Bayesian Optimization (BO) search and template-based image generation. The file SVK_pii_configure.yaml defines user-configurable parameters for generating personal data according to specified distributions.
Example Format
- Defines experiment-level settings
area: SVK # Country / document area identifier
template_path: data/inputs/SVK/template_SVK.png # Base ID template image
# Font resources
text_fonts_path: data/inputs/text_fonts/ # Fonts used for printed text
signature_fonts: data/inputs/signature_fonts # Fonts used for signatures- Defines personal metadata
We support two input formats. A CSV file can be provided to supply user-defined personal metadata for populating each template field. Alternatively, a YAML file can be used to specify metadata distributions, in which case the system automatically generates personal metadata that satisfies the specified requirements.
personal_info: data/inputs/SVK/personal_info.csv # CSV containing personal information
# personal_info: data/inputs/SVK/SVK_pii_configure.yaml # yaml containing personal info distribution- Specifies Bayesian Optimization settings
bo_settings:
init_points: 1 # Number of random initialization points
n_iter: 3 # Number of optimization iterations
seed: 2 # Random seed for reproducibility- Specifies Objective configuration
eval_args:
target_samples: 1 # Number of samples used for parameter search
with_model: 1 # Whether to use the model-guided method
candidate_models: # Guided Models
- resnet50
testing: true # Enable testing in the end
lambda0: 1 # scoring weight for SSIM
lambda1: 1 # scoring weight for model consistency
guided_datapaths: data/inputs/template_guided_datas.json # Guided data info- Defines output directories
output:
best_settings_file: data/outputs/SVK/template_best_settings.json
synthetic_images_path: data/outputs/SVK/positive # Output directory for generated images
synthetic_images_annotation_path: data/outputs/SVK/SVK_original_annotation.json #output annotation file- Define template segments
segments:
portrait:
type: image # Image-based segment
bbox: # Bounding box (x1, y1, x2, y2)
- 44
- 176
- 370
- 576
tunable: false # Fixed placement
# ---------- Text fields ----------
name:
type: text
font_group: 1 # Font style group
color_group: 1 # Color group reference
tunable: true # Optimized during BO search
font_info:
text_height:
initial: 29 # initial value/ approximation
margin: 5 # Allowed variation
text_position:
x:
initial: 397
margin: 5
y:
initial: 211
margin: 5If a user does not have personal metadata, or wants to generate images with demographic attributes following a specific distribution, they can customize the SVK_pii_configure.yaml file, and the system will automatically generate personal metadata that satisfies the specified requirements.
Example Format
- number of generated examples and used modules
number_samples: 5
ui:
default_modules: ["person_core", "dob", "doc_dates", "issue_location", "svk_numbers", "portrait_from_index"]
default_profile: "SVK"
default_seed: 42- output fields, keys, and formats
output_profiles:
SVK:
description: "SVK-style output format"
fields:
- last_name
- first_name
- dob
- issue_date
- expiry_date
- gender
- issue_place
- doc_number
- local_id_number
- signature
- portrait
key_map:
last_name: surname
first_name: name
dob: birth_date
doc_number: number
local_id_number: id_number
formats:
date_pattern: "%d.%m.%Y"
gender_encoding:
male: "M"
female: "F"
nonbinary: "X"- Distribution of one attribute
modules:
dob:
description: "Generate integer age and DOB string."
provides: ["age", "dob_iso", "dob"]
requires: []
params:
age:
dist:
type: uniform_int
params:
min: 18
max: 85
enforce: sampleRun the pipeline:
To run the Bayesian Optimization (BO) search, a customized YAML configuration file must be provided.
To perform image generation, the best-performing parameter configuration produced by the BO search is used.
cd ../../../ # go th the main folder
python template_image/template_image/Optuna_search_global.py --config data/inputs/SVK/SVK_configure.yaml #For BO parameters search
python template_image/template_image_generation.py --best_seting_path data/outputs/SVK/template_best_settings.json #For template images generationOutput Directory Defined in SVK_configure.yaml:
output:
best_settings_file: data/outputs/SVK/template_best_settings.json
synthetic_images_path: data/outputs/SVK/positive # Output directory for generated images
synthetic_images_annotation_path: data/outputs/SVK/SVK_original_annotation.json #output annotation fileExample Output Structure
data/outputs/SVK/
├── positive/
│ ├── generated_1.png
│ └── generated_2.png
├── SVK_original_annotation.json
├── template_best_settings.jsonNote: The fraud image generation code is adapted from the SIDTD repository, with only minimal modifications to better suit the structure and annotations of our dataset.
python template_image/fraud_generation/generate.py data/outputs/SVK positive SVK_original_annotation.jsonArguments:
-
data/outputs/SVK — Base directory containing the source images and annotations
-
positive — Relative path (under the base directory) to the folder with positive images to be converted into fraudulent ones
-
SVK_original_annotation.json — Relative path (under the base directory) to the annotation file for the positive images
Output:
- Generates fraudulent images and annotations using inpaint and rewrite, crop and replace
- All generated images and annotations are saved under:
data/outputs/SVK