
🎥 Realtime Vision Captioning
Image Captioning · Visual Question Answering · Image Classification · Realtime Webcam
This repository contains a curated set of Jupyter notebooks demonstrating core computer vision and vision–language capabilities using pretrained models. The notebooks progress from offline image understanding tasks to a realtime webcam application that performs image captioning and image classification on live video streams.
- Image captioning
- Visual question answering (VQA)
- Image classification
- Realtime webcam captioning and classification
Semantic understanding of visual data is a foundational capability for modern, user-facing AI systems. Tasks such as image captioning, visual question answering, and image classification allow machines to describe scenes, answer questions about visual content, and identify objects with high confidence.
These techniques power real-world applications including accessibility tools, interactive AI interfaces, and automated perception systems. This repository demonstrates how such models can be applied both in offline experimentation and realtime interactive settings.
- Python
- PyTorch
- Torchvision
- Hugging Face Transformers (BLIP)
- Gradio
- PIL / NumPy
- ImageNet pretrained models
Notebook: 01_image_captioning.ipynb
This notebook implements image captioning using BLIP from Hugging Face, generating natural language descriptions from images.
- Supports local images and image URLs
- Uses Gradio to provide an interactive, browser-based interface
Result
Image captioning using Gradio
Caption generated for a sample image
Notebook: 02_visual_question_answering.ipynb
This notebook demonstrates visual question answering using BLIP (VQA), enabling the model to answer natural language questions about image content.
- Accepts an image and a free-form question
- Produces concise, human-readable answers
Result
Notebook: 03_image_classification_resnet50.ipynb
This notebook performs image classification using a pretrained ResNet-50 model trained on ImageNet.
- Outputs Top-K class predictions with confidence scores
- Uses Gradio for interactive testing and visualization
Result
Image classification using Gradio
Notebook: 04_realtime_webcam_caption_and_classify.ipynb
This notebook combines image captioning and image classification into a realtime webcam application, accessible directly through a web browser.
- Input: Live webcam frames
- Output:
- A natural language caption describing the scene
- Top image classification predictions with confidence scores
The outputs update continuously as the camera view changes, allowing real-time observation of model behavior.
- Captioning runs on the full frame to capture scene context
- Classification operates on a center-focused crop to emphasize the primary object
- Frame throttling balances responsiveness and performance
- A Gradio interface provides adjustable controls:
- Center zoom level
- Number of Top-K predictions
- Frame processing stride
- Enable captioning, classification, or both
Result
Each notebook is fully self-contained.
Open a notebook and run all cells in order.
For the realtime demo, webcam access is required.
- All models are pretrained
- No training or dataset setup is required
- The notebooks are intended for experimentation and demonstration purposes
This project is licensed under the MIT License.
See the LICENSE file for details.
For questions or collaboration, please contact:




