VLM: A Multimodal Vision-Language Model 🖼️➡️📝

A LLaVA-style multimodal model that combines a pre-trained CLIP vision encoder with the TinyLlama language model to generate rich, detailed image descriptions.

What It Does

Image to Text: Takes an image as input and generates a human-like text description.
Versatile: Ideal for automated image captioning, visual assistance, and visual Q&A systems.

🏛️ Architecture

The model follows a simple yet effective architecture to bridge the gap between vision and language.

Image → CLIP Vision Encoder → Bridge Network (MLP) → TinyLlama LLM → Text Description

Only the Bridge Network is trained, making the process highly efficient and resource-friendly.

📁 Project Structure

Here is an overview of the key files and directories in this project.

Of course. Here is the exact code for the tree structure.

.
├── architecture_and_training.py  # Main model architecture, training, and inference logic
├── make_dataset.py               # Script to process the raw PixelProse dataset
├── testing.py                    # A simple script to run inference on images
├── .gitignore                    # Specifies files and folders for Git to ignore
├── README.md                     # You are here!
└── Test images/                  # Contains sample images for quick inference testing
└── License                       # This project is licensed under the MIT License

🚀 Getting Started

Follow these steps to set up the project, train the model, and run inference.

1. Clone Repositories

First, clone the required models and the PixelProse dataset from Hugging Face.

# Clone the ViT vision encoder
git clone [https://huggingface.co/openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)

# Clone the TinyLlama language model
git clone [https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)

# Clone the PixelProse dataset
git clone [https://huggingface.co/datasets/tomg-group-umd/pixelprose](https://huggingface.co/datasets/tomg-group-umd/pixelprose)

2. Install Dependencies

Install the necessary Python packages. A virtual environment is recommended.

pip install torch transformers pillow pandas tqdm

3. Prepare the Dataset

Run the script to process and prepare the images and captions from the cloned pixelprose directory.

python make_dataset.py

4. Train the Model

Start the training process. The script will train the bridge network and save checkpoints automatically.

python architecture_and_training.py

5. Run Inference

Use the testing.py script to generate captions. For your convenience, a Test images/ folder containing a few sample images is included in this repository so you can test the model immediately after training.

python testing.py

🛠️ Usage Example

Here's how to programmatically generate a description for an image using a trained model.

# Note: You will need to implement the model loading and preprocessing functions
from architecture_and_training import load_model_for_inference, preprocess_image

# Load a trained model from a checkpoint
model = load_model_for_inference('checkpoints/best_model.pth')
image = preprocess_image('Test images/sample_image.jpg') # Example using a test image

# Generate a description for your image
response = model.generate(
    image=image,
    text_prompt="Describe this image in detail:",
    max_new_tokens=100
)

print(response)

📊 Training Details

Dataset: PixelProse (120K images with high-quality captions)
Epochs: 3
Batch Size: 4
Learning Rate: 5e-5
Optimizer: AdamW
Optimizations: Gradient accumulation and mixed-precision training for memory efficiency.
Checkpoints: Saved every 500 steps and at the end of each training epoch.

📋 Requirements

Python 3.8+ (3.12+ recommended)
PyTorch & Transformers
A CUDA-enabled GPU is highly recommended for training.
16GB+ RAM

📄 License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
architecture_and_training.py		architecture_and_training.py
check_dataset.ipynb		check_dataset.ipynb
download.py		download.py
fine_tuning.py		fine_tuning.py
inference.py		inference.py
make_dataset.py		make_dataset.py
run_downloader.py		run_downloader.py
testing.py		testing.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VLM: A Multimodal Vision-Language Model 🖼️➡️📝

What It Does

🏛️ Architecture

📁 Project Structure

🚀 Getting Started

1. Clone Repositories

2. Install Dependencies

3. Prepare the Dataset

4. Train the Model

5. Run Inference

🛠️ Usage Example

📊 Training Details

📋 Requirements

📄 License

About

Uh oh!

Releases

Packages

Languages

License

avkhalkar/SIH_Project

Folders and files

Latest commit

History

Repository files navigation

VLM: A Multimodal Vision-Language Model 🖼️➡️📝

What It Does

🏛️ Architecture

📁 Project Structure

🚀 Getting Started

1. Clone Repositories

2. Install Dependencies

3. Prepare the Dataset

4. Train the Model

5. Run Inference

🛠️ Usage Example

📊 Training Details

📋 Requirements

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages