This project is a simplified implementation of the Imagen text-to-image model. More specifically, this is an example of a diffusion model for image generation. Such a project is an efficient way of learning the difficult topic of modern generative AI by doing and exploring how a functional implementation might work.
When given an image caption, the Imagen text-to-image model generates an image that depicts the scene described - well, basically what you would expect from a generative model. This model employs a cascading diffusion model and utilizes a T5 text encoder to produce a caption encoding. This encoding conditions a base image generator, followed by a series of super-resolution models that refine the base image.
Other interesting aspects of this model worth noting are the concepts of noise conditioning augmentation and dynamic thresholding.
It is also important to note that this implementation is based on Phil Wang's implementation and was made possible by the large collection of learning materials - both theoretical and practical - available online, such as articles, tutorials, and blog posts.
If you want to read the original Imagen paper, you can find it here
First, clone this repository:
$ git clone [PUT REPO HERE]After that, create a virtual environment:
$ pip install virtualenv
$ virtualenv venvThen activate the virtual environment and install all dependencies:
$ .\venv\Scripts\activate.bat # for Windows
$ source venv/bin/activate # for MacOS/Linux
$ pip install -r requirements.txtTo use main.py for the most basic functionality, navigate to the project directory and run:
$ python main.pyThis command will create a small "Imagen" instance, train it on a minimal dataset, and then generate an image using the trained instance.
After execution, two directories will be created:
-
training_<TIMESTAMP>. This Training Directory is created during the training and includes:- A
parameterssubdirectory with configuration details. state_dictsandtmpdirectories containing model checkpoints.- A
training_progress.txtfile that logs the training progress.
- A
-
generated_images_<TIMESTEP>, which contains:- A
generated_imagesfolder with the images generated by the model. captions.txtfiles documenting the input captions, where each line index corresponds to an image number in thegenerated_imagesfolder.- An
imagen_training_directory.txtfile specifying the Training Directory used to load the MinImagen instance and generate images.
- A
main.py runs both train.py and inference.py in sequence, with the former training the model and the latter generating the image.
To train a model, execute train.py with the appropriate command line arguments. The arguments include:
--PARAMETERSor-p: The directory specifies the MinImagen configuration and is structured like aparameterssubdirectory within a Training Directory.--NUM_WORKERSor-n: Number of workers for the DataLoaders.--BATCH_SIZEor-b: Batch size for training.--MAX_NUM_WORDSor-mw: Maximum number of words allowed in a caption.--IMG_SIDE_LENor-s: Final side length of the square images output by MinImagen.--EPOCHSor-e: Number of training epochs.--T5_NAMEor-t5: Name of the T5 encoder to use.--TRAIN_VALID_FRACor-f: Fraction of the dataset to use for training versus validation.--TIMESTEPSor-t: Number of timesteps in the Diffusion Process.--OPTIM_LRor-lr: Learning rate for the Adam optimizer.--ACCUM_ITERor-ai: Number of batches to accumulate for gradient accumulation.--CHCKPT_NUMor-cn: Interval of batches to create a temporary model checkpoint during training.--VALID_NUMor-vn: Number of validation images to use.--RESTART_DIRECTORYor-rd: Training directory to load the MinImagen instance from if resuming training.--TESTINGor-test: Used to run the script with a small MinImagen instance and a small dataset for testing.
For example:
python train.py --PARAMETERS ./parameters --BATCH_SIZE 2 --TIMESTEPS 25 --TESTINGTo generate images using a model from a training directory, use inference.py with the following command line arguments:
--TRAINING_DIRECTORYor-d: Specifies the training directory from which to load for inference.--CAPTIONSor-c: Specifies either a single caption to generate an image for, or a filepath to a.txtfile containing a list of captions, each on a new line.
For example:
python inference.py --CAPTIONS captions.txt --TRAINING_DIRECTORY training_<TIMESTAMP>If you wish to create your own training and inference scripts, take a look at the files train.py and inference.py to get some inspiration.