Skip to content

Adding image modularity to TinyLlama(the light-weight version of the GPT OSS model text to text model) by integrating CLIP ViT into its architecture

License

Notifications You must be signed in to change notification settings

avkhalkar/SIH_Project

Repository files navigation

VLM: A Multimodal Vision-Language Model 🖼️➡️📝

A LLaVA-style multimodal model that combines a pre-trained CLIP vision encoder with the TinyLlama language model to generate rich, detailed image descriptions.

What It Does

  • Image to Text: Takes an image as input and generates a human-like text description.
  • Versatile: Ideal for automated image captioning, visual assistance, and visual Q&A systems.

🏛️ Architecture

The model follows a simple yet effective architecture to bridge the gap between vision and language.

ImageCLIP Vision EncoderBridge Network (MLP)TinyLlama LLMText Description

Only the Bridge Network is trained, making the process highly efficient and resource-friendly.

📁 Project Structure

Here is an overview of the key files and directories in this project.

Of course. Here is the exact code for the tree structure.

.
├── architecture_and_training.py  # Main model architecture, training, and inference logic
├── make_dataset.py               # Script to process the raw PixelProse dataset
├── testing.py                    # A simple script to run inference on images
├── .gitignore                    # Specifies files and folders for Git to ignore
├── README.md                     # You are here!
└── Test images/                  # Contains sample images for quick inference testing
└── License                       # This project is licensed under the MIT License

🚀 Getting Started

Follow these steps to set up the project, train the model, and run inference.

1. Clone Repositories

First, clone the required models and the PixelProse dataset from Hugging Face.

# Clone the ViT vision encoder
git clone [https://huggingface.co/openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)

# Clone the TinyLlama language model
git clone [https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)

# Clone the PixelProse dataset
git clone [https://huggingface.co/datasets/tomg-group-umd/pixelprose](https://huggingface.co/datasets/tomg-group-umd/pixelprose)

2. Install Dependencies

Install the necessary Python packages. A virtual environment is recommended.

pip install torch transformers pillow pandas tqdm

3. Prepare the Dataset

Run the script to process and prepare the images and captions from the cloned pixelprose directory.

python make_dataset.py

4. Train the Model

Start the training process. The script will train the bridge network and save checkpoints automatically.

python architecture_and_training.py

5. Run Inference

Use the testing.py script to generate captions. For your convenience, a Test images/ folder containing a few sample images is included in this repository so you can test the model immediately after training.

python testing.py

🛠️ Usage Example

Here's how to programmatically generate a description for an image using a trained model.

# Note: You will need to implement the model loading and preprocessing functions
from architecture_and_training import load_model_for_inference, preprocess_image

# Load a trained model from a checkpoint
model = load_model_for_inference('checkpoints/best_model.pth')
image = preprocess_image('Test images/sample_image.jpg') # Example using a test image

# Generate a description for your image
response = model.generate(
    image=image,
    text_prompt="Describe this image in detail:",
    max_new_tokens=100
)

print(response)

📊 Training Details

  • Dataset: PixelProse (120K images with high-quality captions)

  • Epochs: 3

  • Batch Size: 4

  • Learning Rate: 5e-5

  • Optimizer: AdamW

  • Optimizations: Gradient accumulation and mixed-precision training for memory efficiency.

  • Checkpoints: Saved every 500 steps and at the end of each training epoch.

📋 Requirements

  • Python 3.8+ (3.12+ recommended)

  • PyTorch & Transformers

  • A CUDA-enabled GPU is highly recommended for training.

  • 16GB+ RAM

📄 License

This project is licensed under the MIT License.

About

Adding image modularity to TinyLlama(the light-weight version of the GPT OSS model text to text model) by integrating CLIP ViT into its architecture

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published