Skip to content

Papasmurf79/Photo_Label_BLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

📸 Photo Label BLIP — AI Image Captioning & Smart Photo Naming

An AI-powered image captioning application that generates meaningful descriptions for photos using Salesforce BLIP (Bootstrapping Language Image Pretraining) from HuggingFace Transformers.

This project demonstrates how vision-language models (VLMs) can be used to automatically analyze images and generate human-readable captions. It includes both:

  • 🌐 Interactive Web Interface using Gradio
  • 🖥 Local Python script for captioning images from a folder

The goal of the project is to automatically understand and label photos with natural language, making it easier to organize image collections.


🚀 Project Highlights

This project demonstrates practical experience with:

  • Python AI development
  • HuggingFace Transformers
  • Vision-Language Models (BLIP)
  • Image processing with PIL
  • Model inference pipelines
  • Gradio UI development
  • Local automation scripts
  • Git & GitHub project structure

The application takes an input image and generates a natural language caption describing the scene, enabling smarter photo organization and labeling.


🧠 AI Model Used

BLIP — Bootstrapping Language Image Pretraining

The model used in this project is:

Salesforce/blip-image-captioning-base

BLIP is a vision-language transformer model trained to understand images and generate captions.

It works by:

  1. Encoding visual features from the image
  2. Aligning those features with language tokens
  3. Generating a natural-language description of the scene

This model enables tasks such as:

  • Image captioning
  • Visual question answering
  • Image understanding
  • Multimodal reasoning

🏗 Project Structure

PHOTO_LABEL_BLIP
│
├── Photos/                     # Example images used for testing
│   ├── AI Meditation.jpg
│   ├── BlackRyu.jpg
│   ├── Generations.jpg
│   ├── Miata.jpg
│   └── TaiChiTigers.jpg
│
├── app/
|     └── gradio_img_app.py      # Web interface for captioning images
|
├── script/ 
|     └── local_image_cap.py     # Local script to caption images
│
├── requirements.txt            # Project dependencies
├── .gitignore                  # Ignored files for version control
│
└── dataset1.csv                # Optional dataset file

⚙️ Installation

Clone the repository:

git clone https://github.com/papasmurf79/photo_label_blip.git
cd photo_label_blip

Create a virtual environment:

python -m venv imgenv

Activate environment:

Mac / Linux

source imgenv/bin/activate

Windows

imgenv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

🌐 Running the Web Application

Launch the Gradio interface:

python gradio_img_app.py

After starting, open the URL displayed in your terminal (usually):

http://127.0.0.1:7860

Upload an image and the AI will generate a caption.


🖥 Running the Local Caption Script

To caption an image locally:

python local_image_cap.py

You will be prompted to either enter number or the filename of an image stored inside the Photos directory.

Example:

Enter image filename: Miata.jpg

Output:

Generated Caption:
"A red sports car parked on a road"

📷 Example Output

Input image:

Photos/Miata.jpg

Generated caption:

"A red sports car parked on a road"

The model analyzes the visual content and generates a natural language description.


🖥 Gradio App Demo

The project includes a lightweight interactive Gradio web interface that allows users to upload an image and instantly generate an AI caption using the BLIP vision-language model.

The interface makes the system accessible to non-technical users by providing a simple drag-and-drop workflow.

Upload Interface

Users can drag and drop an image or click to upload.

Gradio Upload Interface


Caption Generation Example

After uploading an image, the model analyzes the visual content and generates a natural language description.

Example result:

"the image of a black panther in the dark"

Gradio Caption Result


This interface demonstrates how vision-language models can be integrated into interactive AI applications, enabling real-time multimodal inference directly in the browser.

💡 Potential Future Improvements

Future enhancements could include:

  • Automatic AI-based photo renaming
  • Batch captioning for entire folders
  • Top-3 caption suggestions
  • Vector search for image similarity
  • Integration with cloud storage (AWS S3 / Google Drive)
  • Deploying the app with HuggingFace Spaces or Docker
  • Building a mobile interface

📊 Skills Demonstrated

This project highlights practical skills in:

  • Python programming
  • AI model inference
  • Multimodal machine learning
  • Image processing
  • HuggingFace ecosystem
  • Gradio UI development
  • Git & GitHub version control
  • AI application architecture

🎯 Real-World Applications

AI image captioning systems like this are used in:

  • Photo organization tools
  • Accessibility tools for visually impaired users
  • Content moderation systems
  • Image search engines
  • Digital asset management platforms
  • AI assistants that understand visual data

📚 Technologies Used

  • Python
  • HuggingFace Transformers
  • BLIP Image Captioning Model
  • PyTorch
  • PIL (Python Imaging Library)
  • Gradio
  • NumPy

👨‍💻 Author

Developed as part of an AI lab exploring multimodal machine learning and image captioning systems.

This project demonstrates how modern vision-language models can be integrated into real-world AI applications using Python.


⭐ If You Found This Project Interesting

Feel free to:

  • Star the repository ⭐
  • Fork the project
  • Experiment with new models

AI + vision-language models are rapidly evolving — and projects like this are just the beginning.


About

I built an AI image captioning application using HuggingFace BLIP and PyTorch. The system analyzes photos and generates natural-language descriptions using a vision-language transformer model. I implemented both a Gradio-based web interface and a local automation pipeline for batch captioning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages