📄 Paper Repository: This is the official implementation of our paper "Semantic-Guided Autoregressive Diffusion-Based Data Augmentation Using Visual Instructions" (ISAS 2025) by Ege Yavuzcan, Ömer Kuş, and Abdurrahman Gümüş.
A modular, iterative framework that addresses the challenge of limited training data in deep learning by generating semantically consistent augmented images. Unlike traditional augmentation techniques (flipping, rotation, color jittering) that produce visually similar samples, our approach leverages Vision Language Models and diffusion-based image generation to create diverse yet contextually meaningful variations.
- VLM-Driven Captioning: Utilizes LLaVA to generate multiple semantic descriptions of input images
- CLIP-Based Prompt Selection: Automatically selects the most relevant caption using cosine similarity scoring
- Autoregressive Refinement: Each augmented image becomes the input for subsequent iterations, enabling progressive semantic diversity
- Classifier Improvement: Augmented datasets significantly improve classification model performance on limited data
- Iterative Refinement: Repeat caption → selection → augmentation for configurable iterations
- Batch Processing: CLI supports single image or folder input, outputs to structured subfolders
- Modular Architecture: Separate components for captioning (LLaVA), prompt selection (CLIP), and augmentation (Stable Diffusion)
- Detailed Logging: Records generated captions, cleaned prompts, similarity scores, and augmentation states
- Configurable: All hyperparameters and model checkpoints defined in
config/config.yaml
# Clone and install in editable mode
git clone https://github.com/egeyavuzcan/semantic-data-augmentation
cd semantic-data-augmentation
pip install -e .
pip install atma
# (Optional) Create and activate Conda environment
conda create -n myenv python=3.10
conda activate myenv
pip install -r requirements.txtsda-augment -i <input_path> -o <output_dir> -c config/config.yaml [--return_all]-i, --input_path: Path to image file or directory of images.-o, --output_dir: Directory for augmented outputs.-c, --config: Pipeline configuration YAML.--return_all: Save all intermediate results instead of final only.
- Load Configuration: Read models, thresholds, iteration count, and logging settings.
- Initialize Modules:
- CaptionGenerator: Uses LLaVA chat model to produce exactly three numbered captions.
- PromptSelector: Encodes captions and image via CLIP, computes cosine similarities, and picks top prompt (strips any
ASSISTANT:prefix). - Augmenter: Wraps Stable Diffusion Img2Img for conditional image augmentation.
- Refiner: Orchestrates iterative pipeline.
- Iterative Loop (per image):
- Generate captions.
- Select best prompt by CLIP score.
- Augment image with chosen prompt.
- Repeat for
iterationstimes.
- Output: Save results under
<output_dir>/augmented_datasetwith iteration suffixes.
caption_generation:
model_name_or_path: llava-hf/llava-1.5-7b-hf
prompt_templates:
general: "ER: \nGive me three detailed image captions describing the main elements, context, and any notable objects or actions in this image."
max_new_tokens: 200
prompt_selection:
clip_model: openai/clip-vit-base-patch32
threshold: 0.2
augmentation:
model: sd-legacy/stable-diffusion-v1-5
guidance_scale: 7.5
num_inference_steps: 35
iterative_refinement:
iterations: 3
logging:
level: INFO
log_dir: logs@inproceedings{yavuzcan2025semantic,
title={Semantic-Guided Autoregressive Diffusion-Based Data Augmentation Using Visual Instructions},
author={Yavuzcan, Ege and Kuş, Ömer and Gümüş, Abdurrahman},
booktitle={ISAS 2025},
year={2025}
}Contributions welcome! Please open issues and pull requests.
MIT License
