This project is a deep learning system that matches images and text based on how similar they are. It was built using PyTorch and trained on the MNIST dataset (handwritten digits). The model learns to understand both images and their matching text descriptions by putting them in the same kind of space.
Python | PyTorch
- Educational tools and interactive AI assistants
- Cross-modal search engines
- Image-caption retrieval systems
- main.py: Runs the full training pipeline. Loads MNIST data, converts digit labels to text variations, tokenizes the text using BERT, trains the image-text alignment model, and prints training loss after each epoch.
- model.py: Defines the MultiModalModel, which encodes images using CNN layers and encodes text using fully connected layers on top of BERT embeddings. Outputs two embeddings for similarity comparison.
- helpers.py: Contains utility functions like convert_digits_to_random_text (to turn digit labels into random text variants) and batched_tokenizer (to tokenize a list of texts using BERT). Helps keep main.py clean and modular.
- Alignment Accuracy: 95%+
- Loss Reduction: 80% in 5 epochs
- Training Time: Less than 10 minutes on most GPUs
Each digit label is randomly changed, so a "5" might become "Five", "quinque", or "Paanch", to make every training run a unique and fun way to teach the model how different words can mean the same thing.