A LLaVA-style multimodal model that combines a pre-trained CLIP vision encoder with the TinyLlama language model to generate rich, detailed image descriptions.
- Image to Text: Takes an image as input and generates a human-like text description.
- Versatile: Ideal for automated image captioning, visual assistance, and visual Q&A systems.
The model follows a simple yet effective architecture to bridge the gap between vision and language.
Image → CLIP Vision Encoder → Bridge Network (MLP) → TinyLlama LLM → Text Description
Only the Bridge Network is trained, making the process highly efficient and resource-friendly.
Here is an overview of the key files and directories in this project.
Of course. Here is the exact code for the tree structure.
.
├── architecture_and_training.py # Main model architecture, training, and inference logic
├── make_dataset.py # Script to process the raw PixelProse dataset
├── testing.py # A simple script to run inference on images
├── .gitignore # Specifies files and folders for Git to ignore
├── README.md # You are here!
└── Test images/ # Contains sample images for quick inference testing
└── License # This project is licensed under the MIT License
Follow these steps to set up the project, train the model, and run inference.
First, clone the required models and the PixelProse dataset from Hugging Face.
# Clone the ViT vision encoder
git clone [https://huggingface.co/openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
# Clone the TinyLlama language model
git clone [https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
# Clone the PixelProse dataset
git clone [https://huggingface.co/datasets/tomg-group-umd/pixelprose](https://huggingface.co/datasets/tomg-group-umd/pixelprose)Install the necessary Python packages. A virtual environment is recommended.
pip install torch transformers pillow pandas tqdmRun the script to process and prepare the images and captions from the cloned pixelprose directory.
python make_dataset.pyStart the training process. The script will train the bridge network and save checkpoints automatically.
python architecture_and_training.pyUse the testing.py script to generate captions. For your convenience, a Test images/ folder containing a few sample images is included in this repository so you can test the model immediately after training.
python testing.pyHere's how to programmatically generate a description for an image using a trained model.
# Note: You will need to implement the model loading and preprocessing functions
from architecture_and_training import load_model_for_inference, preprocess_image
# Load a trained model from a checkpoint
model = load_model_for_inference('checkpoints/best_model.pth')
image = preprocess_image('Test images/sample_image.jpg') # Example using a test image
# Generate a description for your image
response = model.generate(
image=image,
text_prompt="Describe this image in detail:",
max_new_tokens=100
)
print(response)-
Dataset: PixelProse (120K images with high-quality captions)
-
Epochs: 3
-
Batch Size: 4
-
Learning Rate: 5e-5
-
Optimizer: AdamW
-
Optimizations: Gradient accumulation and mixed-precision training for memory efficiency.
-
Checkpoints: Saved every 500 steps and at the end of each training epoch.
-
Python 3.8+ (3.12+ recommended)
-
PyTorch & Transformers
-
A CUDA-enabled GPU is highly recommended for training.
-
16GB+ RAM
This project is licensed under the MIT License.