System that generates natural language descriptions from images of "what a robot sees". (CV => NLP)
This idea aligns with real robotics applications such as assistance robots, autonomous navigation, and systems that need to interpret their environment in a way that is understandable to humans.
COCO dataset and a multimodal CV + NLP pipeline its used to train, validate and test the models.
Robots often perceive their enviroment through cameras, but lack the ability to summarize visual information in language.
Natural-language descriptions give robots:
- Human friendly explanations of what they see.
- Context for decision making.
- Improved interpretability in assistive or domestic settings.
- A bridge between vision and communication.
The system aims to demonstrate how image captioning can serve as a foundational module for robot awareness.
As it was established, the sysytem uses the COCO dataset (Common Objects in Context), which contains:
- 118k training images.
- 5k validation images.
- 5 captions per image.
Source Link: https://cocodataset.org/#download
By the end of the project, the system RoboCap will:
- Generate readable captions from unseen images.
- Highlight actions and objects relevant to robot navigation.
Follow these steps to set up the environment, prepare the data, and run the RoboCap application.
To train the model, you must download the COCO 2017 Dataset. Specifically, you need:
- 2017 Train images
- 2017 Val images
- 2017 Train/Val annotations
Create a directory named raw_data in the root of your project and structure your files exactly as shown below:
/raw_data
├── /images
│ ├── /train2017 # Place training images here
│ └── /val2017 # Place validation images here
├── /annotations # Place downloaded .json annotations here
└── /pt_files # Create this empty folder for output tensors
Note: We strongly recommend using a GPU with CUDA support for generating embeddings and training the model.
-
Generate Embeddings: Open
notebooks/RoboCap_Lab.ipynb. Run all cells to process the images and captions. This will generate.ptfiles (caption embeddings and image logits) and save them into the/raw_data/pt_filesdirectory you created earlier. -
Train the Model: Create a new directory named
/resultsin your project root (this is where model checkpoints will be saved). Opennotebooks/RoboCap_Encoding.ipynband run the notebook to fine-tune the BERT model.
Open a terminal in your project root. We strongly recommend using a virtual environment.
-
Create and Activate Virtual Environment:
# Create the virtual environment python -m venv venv # Activate on Windows .\venv\Scripts\activate # Activate on Mac/Linux source venv/bin/activate
-
Prerequisite: Install NVIDIA CUDA Toolkit: Before installing PyTorch, ensure your machine has the correct CUDA Toolkit installed to support GPU acceleration.
- Download Toolkit: Visit the NVIDIA CUDA Toolkit Archive and install the version matching your system (e.g., 11.8).
-
Install PyTorch with CUDA: Visit pytorch.org and copy the install command for your Compute Platform (CUDA version). Example for CUDA 11.8:
pip install torch torchvision --index-url [https://download.pytorch.org/whl/cu118](https://download.pytorch.org/whl/cu118)
-
Install Remaining Dependencies:
pip install -r requirements.txt
-
Run the Backend: Run the server on host
0.0.0.0so it is accessible by your mobile device:uvicorn api.main:app --host 0.0.0.0 --port 8000
Open a new terminal window and navigate to the frontend directory:
cd frontend/RoboCapFrontend-
Configure Network IP: To allow the mobile app to communicate with your computer, you must update the API configuration.
- Run
ipconfig(Windows) orifconfig(Mac/Linux) in your terminal to find your computer's IPv4 Address (e.g.,192.168.1.87). - Open the file
config.tsand update theAPI_BASE_URL:// config.ts export const API_BASE_URL = "http://<YOUR_IPV4_ADDRESS>:8000";
- Run
-
Install & Run:
npm install npx expo start -c
- Download Expo Go: Install the Expo Go app on your mobile device (Tested on iPhone 13).
- Connect: Scan the QR code displayed in your terminal with the Expo Go app.
- Run RoboCap: Take a picture using the app interface.
- Wait: The model typically takes 20 - 30 seconds to process the image and generate a caption.
- Andrés Jaramillo Barón | A01029079
- Pedro Mauri Martínez | A01029143