Developed an image captioning pipeline leveraging the Hugging Face Transformers library to generate descriptive captions for images using the BLIP (Bootstrapped Language-Image Pretraining) model. Implemented preprocessing workflows with Pillow for image conversion and tokenized inputs using AutoProcessor for compatibility with the model. Leveraged Python libraries such as transformers, Pillow, and PyTorch to design a scalable, automated solution for image-to-text generation. Gained hands-on experience in transformer-based vision-language models and practical applications in image captioning.
Initial Caption Generated: the image of a cat and a dog
Caption Generated after Fine-Tuning: the image of tom and jerry from tom and jerry show