This project presents a novel assistive system designed to enhance spatial awareness and navigation for visually impaired individuals in unfamiliar environments. Current approaches for assisting visually impaired people with wayfinding are often limited by a lack of spatial awareness and the inability to provide precise object-to-camera distances without extensive calibration or specialized hardware.
Our solution overcomes these limitations by integrating state-of-the-art deep learning models to accurately estimate metric object-to-camera distances directly from single monocular camera frames. The system then synthesizes this spatial information into coherent, natural language scene descriptions, delivered via an accessible web application.
- Accessible Navigation: A web application with a carousel-based navigation system (Gallery, Camera, Chat, Help) supporting keyboard, mouse, and touch interactions.
- Image Input: Users can provide images through a gallery selection or real-time camera capture for processing.
- Object Detection: Utilizes DETR (DEtection TRansformer) to identify and localize objects within images.
- Metric Depth Estimation: Employs UniDepthV2 to generate precise metric depth maps, indicating object distances from the camera.
- Scene Description Generation: Integrates Qwen-2.5-VL Vision-Language Model to synthesize detected objects and their metric distances into informative, natural language scene descriptions.
- Accessible Output: Provides visual annotations (bounding boxes, depth maps) and delivers scene descriptions and distance information through Text-to-Speech (TTS) for screen reader users.
- Multi-modal Interaction: Features a chat section using Qwen-2.5-VL for text and voice-based follow-up questions about processed scenes, maintaining conversation history for context-aware interaction.
The system is built upon a multi-stage deep learning pipeline:
- Metric Depth Estimation: UniDepthv2 is used to generate a metric depth map from a single monocular image, addressing limitations of prior methods that required specialized hardware or calibration.
- Object Detection: A lightweight DEtection TRansformer (DETR) identifies and locates objects in the environment.
- Vision-Language Model (VLM): The Qwen-2.5-VL model synthesizes the detected objects and their corresponding metric distances (obtained by sampling the depth map at object centers) into coherent, natural language scene descriptions.
This integrated approach aims to provide visually impaired users with a more comprehensive understanding of their surroundings in a computationally efficient and generalizable manner, even on low-power mobile devices.
- Processor: Intel i5 or equivalent (or faster)
- RAM: Minimum 8GB
- Storage: At least 2GB free SSD space
- Camera: Webcam or smartphone camera (for image input)
- GPU: CUDA-compatible GPU (recommended for optimal performance)
- Internet: Stable internet connection
- Operating System: Windows 10/11, macOS 11+, Ubuntu 20.04+, or other equivalent Linux distributions
(For mobile devices: iOS 14+ or Android 9+) - Web Browsers: Chrome 90+, Firefox 88+, Safari 14+, Edge 90+
- Python 3.8+
- PyTorch 1.9+ (for DETR and UniDepthV2 models)
- Flask 2.0+
- OpenCV 4.5+
- Pillow 8.0+
- ngrok (for public deployment)
- TypeScript 4.4+
- Tailwind CSS 3.4+
- Valid OpenRouter API key (for Qwen-2.5-VL model access)
- Valid ngrok API key (for public backend deployment)
- Web Speech API (for Text-to-Speech functionality)
- Camera, microphone, and local storage permissions are required for the application.
The web application provides a user-friendly interface for visually impaired people to navigate using simple taps and swipes, with voice feedback through text-to-speech. Users can upload images or use their camera to capture scenes, which are then analyzed to provide detailed, distance-annotated descriptions.
The integrated chat functionality allows for personalized interaction and follow-up questions about the processed scenes.
This project was submitted by:
- BEN GEORGE
- CHARUKESH PRASANTH
- JASKARAN SINGH
- KARTHIK E.M.
- Upload image
- Detect objects using DETR
- Estimate metric depth using UniDepthV2
- Search
