Skip to content

kushalsai-01/Conditional-Diffusion-Text2Image

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Text-to-Image Generation Using Conditional Diffusion Models

Python PyTorch CLIP

From-scratch implementation of Text-to-Image Diffusion Model using DDPM + CLIP for conditional image generation.

Input: "a small bird with red feathers and black wings"
Output: 64×64 RGB image matching the description


How It Works

Diffusion Process

%%{init: {'theme':'dark'}}%%
graph LR
    A[Clean Image] -->|Add Noise| B[x₁]
    B -->|Add Noise| C[x₂]
    C -->|Add Noise| D[xₜ]
    D -->|UNet Denoise| C
    C -->|UNet Denoise| B
    B -->|UNet Denoise| A
    
    style A fill:#1a472a,stroke:#2e7d32,color:#fff
    style D fill:#5c2e2e,stroke:#d32f2f,color:#fff
Loading

Training Objective

$$\mathcal{L} = \mathbb{E}_{x_0, \epsilon, t} \left[ | \epsilon - \epsilon_\theta(x_t, t, c) |^2 \right]$$

Neural network predicts the noise $\epsilon$ added at each timestep $t$ conditioned on text $c$.


Architecture

%%{init: {'theme':'dark'}}%%
flowchart TB
    T[Text Prompt] --> CLIP[CLIP Encoder]
    CLIP --> EMB[Embedding 512-D]
    N[Noise] --> UNET[Conditional UNet]
    EMB --> UNET
    TIME[Timestep] --> UNET
    UNET --> PRED[Predicted Noise]
    PRED --> IMG[Generated Image]
    
    style T fill:#1a472a,stroke:#2e7d32,color:#fff
    style IMG fill:#1a3d5c,stroke:#1976d2,color:#fff
    style UNET fill:#5c2e2e,stroke:#d32f2f,color:#fff
Loading

UNet Details

%%{init: {'theme':'dark'}}%%
graph TB
    X[Noisy Image] --> CONV[Init Conv]
    T[Timestep] --> TIME[Time Embed]
    C[Text Embed] --> PROJ[Projection]
    
    CONV --> DOWN[Encoder]
    TIME --> DOWN
    DOWN --> BOTTLE[Bottleneck]
    PROJ --> BOTTLE
    BOTTLE --> UP[Decoder]
    TIME --> UP
    UP --> OUT[Output Conv]
    OUT --> NOISE[Predicted Noise]
    
    style X fill:#1a472a,stroke:#2e7d32,color:#fff
    style NOISE fill:#5c2e2e,stroke:#d32f2f,color:#fff
    style BOTTLE fill:#3d2e5c,stroke:#7b1fa2,color:#fff
Loading

Project Structure

Conditional-Diffusion-Text2Image/
├── data/                # Dataset loading
├── models/              # UNet + Diffusion
├── text_encoder/        # CLIP wrapper
├── utils/               # Helpers & schedulers
├── train.py            # Training script
├── sample.py           # Generation script
└── requirements.txt

Dataset

CUB-200-2011 Birds Dataset (11,788 images, 200 species)

HuggingFace:

from datasets import load_dataset
dataset = load_dataset("alkzar90/CC6204-Hackaton-Cub-Dataset")

Links:


Installation

git clone https://github.com/kushalsai-01/Conditional-Diffusion-Text2Image.git
cd Conditional-Diffusion-Text2Image
pip install -r requirements.txt

Training

Quick Start (Synthetic Data)

python train.py --use_synthetic --epochs 20 --batch_size 8

Full Training (CUB-200)

python train.py \
    --root_dir ./dataset \
    --epochs 100 \
    --batch_size 16 \
    --timesteps 1000 \
    --lr 1e-4

Training Flow

%%{init: {'theme':'dark'}}%%
flowchart TD
    START([Train]) --> BATCH[Get Batch]
    BATCH --> ENCODE[Text → Embedding]
    ENCODE --> NOISE[Add Noise to Image]
    NOISE --> PREDICT[UNet Predicts Noise]
    PREDICT --> LOSS[MSE Loss]
    LOSS --> UPDATE[Update Weights]
    UPDATE --> CHECK{More Data?}
    CHECK -->|Yes| BATCH
    CHECK -->|No| END([Done])
    
    style START fill:#1a472a,stroke:#2e7d32,color:#fff
    style END fill:#1a472a,stroke:#2e7d32,color:#fff
    style LOSS fill:#5c2e2e,stroke:#d32f2f,color:#fff
Loading

Generation

Basic

python sample.py \
    --checkpoint checkpoints/model_final.pt \
    --prompts "a red bird with black wings"

DDIM (Faster)

python sample.py \
    --checkpoint checkpoints/model_final.pt \
    --prompts "a blue bird on a branch" \
    --use_ddim \
    --ddim_steps 50

Results

Training Progress

Training Loss

Epoch Loss Sample
5 0.312
10 0.187
15 0.123
20 0.089

Generated Samples

Prompt: "a small yellow bird with black wings"

Generated Samples


Technical Details

Component Configuration
Model Conditional UNet
Parameters ~8M
Text Encoder CLIP ViT-B/32 (frozen)
Image Size 64×64
Timesteps 1000 (DDPM) / 50 (DDIM)
Batch Size 16
Learning Rate 1e-4
Scheduler Cosine with warmup

References

  • DDPM - Ho et al., 2020
  • DDIM - Song et al., 2021
  • CLIP - Radford et al., 2021

License

MIT License

About

Text-to-image generation using a Conditional Denoising Diffusion Probabilistic Model (DDPM) implemented in PyTorch. Images are synthesized by conditioning the denoising process on text embeddings using a UNet backbone. Built from scratch without external image-generation APIs for learning and research purposes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages