Skip to content

Iterative editing framework that harnesses the inherent strengths of these generative models to progressively refine images with precision

Notifications You must be signed in to change notification settings

egeyavuzcan/semantic-data-augmentation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantic-Guided Image Augmentation Pipeline

Semantic Data Augmentation Overview

📄 Paper Repository: This is the official implementation of our paper "Semantic-Guided Autoregressive Diffusion-Based Data Augmentation Using Visual Instructions" (ISAS 2025) by Ege Yavuzcan, Ömer Kuş, and Abdurrahman Gümüş.

A modular, iterative framework that addresses the challenge of limited training data in deep learning by generating semantically consistent augmented images. Unlike traditional augmentation techniques (flipping, rotation, color jittering) that produce visually similar samples, our approach leverages Vision Language Models and diffusion-based image generation to create diverse yet contextually meaningful variations.

Eye Images
Eye Images

Key Contributions

  • VLM-Driven Captioning: Utilizes LLaVA to generate multiple semantic descriptions of input images
  • CLIP-Based Prompt Selection: Automatically selects the most relevant caption using cosine similarity scoring
  • Autoregressive Refinement: Each augmented image becomes the input for subsequent iterations, enabling progressive semantic diversity
  • Classifier Improvement: Augmented datasets significantly improve classification model performance on limited data

Features

  • Iterative Refinement: Repeat caption → selection → augmentation for configurable iterations
  • Batch Processing: CLI supports single image or folder input, outputs to structured subfolders
  • Modular Architecture: Separate components for captioning (LLaVA), prompt selection (CLIP), and augmentation (Stable Diffusion)
  • Detailed Logging: Records generated captions, cleaned prompts, similarity scores, and augmentation states
  • Configurable: All hyperparameters and model checkpoints defined in config/config.yaml

Installation

# Clone and install in editable mode
git clone https://github.com/egeyavuzcan/semantic-data-augmentation
cd semantic-data-augmentation
pip install -e .
pip install atma

# (Optional) Create and activate Conda environment
conda create -n myenv python=3.10
conda activate myenv
pip install -r requirements.txt

CLI Usage

sda-augment -i <input_path> -o <output_dir> -c config/config.yaml [--return_all]
  • -i, --input_path: Path to image file or directory of images.
  • -o, --output_dir: Directory for augmented outputs.
  • -c, --config: Pipeline configuration YAML.
  • --return_all: Save all intermediate results instead of final only.

Algorithm

  1. Load Configuration: Read models, thresholds, iteration count, and logging settings.
  2. Initialize Modules:
    • CaptionGenerator: Uses LLaVA chat model to produce exactly three numbered captions.
    • PromptSelector: Encodes captions and image via CLIP, computes cosine similarities, and picks top prompt (strips any ASSISTANT: prefix).
    • Augmenter: Wraps Stable Diffusion Img2Img for conditional image augmentation.
    • Refiner: Orchestrates iterative pipeline.
  3. Iterative Loop (per image):
    • Generate captions.
    • Select best prompt by CLIP score.
    • Augment image with chosen prompt.
    • Repeat for iterations times.
  4. Output: Save results under <output_dir>/augmented_dataset with iteration suffixes.

Configuration (config/config.yaml)

caption_generation:
  model_name_or_path: llava-hf/llava-1.5-7b-hf
  prompt_templates:
    general: "ER:  \nGive me three detailed image captions describing the main elements, context, and any notable objects or actions in this image."
  max_new_tokens: 200

prompt_selection:
  clip_model: openai/clip-vit-base-patch32
  threshold: 0.2

augmentation:
  model: sd-legacy/stable-diffusion-v1-5
  guidance_scale: 7.5
  num_inference_steps: 35

iterative_refinement:
  iterations: 3

logging:
  level: INFO
  log_dir: logs

Citation

@inproceedings{yavuzcan2025semantic,
  title={Semantic-Guided Autoregressive Diffusion-Based Data Augmentation Using Visual Instructions},
  author={Yavuzcan, Ege and Kuş, Ömer and Gümüş, Abdurrahman},
  booktitle={ISAS 2025},
  year={2025}
}

Contributing

Contributions welcome! Please open issues and pull requests.

License

MIT License

About

Iterative editing framework that harnesses the inherent strengths of these generative models to progressively refine images with precision

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages