DISTANCE PERCEPTION IN VISION-LANGUAGE MODELS FOR BLIND NAVIGATION AND SCENE INTERPRETATION

Overview

This project presents a novel assistive system designed to enhance spatial awareness and navigation for visually impaired individuals in unfamiliar environments. Current approaches for assisting visually impaired people with wayfinding are often limited by a lack of spatial awareness and the inability to provide precise object-to-camera distances without extensive calibration or specialized hardware.

Our solution overcomes these limitations by integrating state-of-the-art deep learning models to accurately estimate metric object-to-camera distances directly from single monocular camera frames. The system then synthesizes this spatial information into coherent, natural language scene descriptions, delivered via an accessible web application.

Key Features

Accessible Navigation: A web application with a carousel-based navigation system (Gallery, Camera, Chat, Help) supporting keyboard, mouse, and touch interactions.
Image Input: Users can provide images through a gallery selection or real-time camera capture for processing.
Object Detection: Utilizes DETR (DEtection TRansformer) to identify and localize objects within images.
Metric Depth Estimation: Employs UniDepthV2 to generate precise metric depth maps, indicating object distances from the camera.
Scene Description Generation: Integrates Qwen-2.5-VL Vision-Language Model to synthesize detected objects and their metric distances into informative, natural language scene descriptions.
Accessible Output: Provides visual annotations (bounding boxes, depth maps) and delivers scene descriptions and distance information through Text-to-Speech (TTS) for screen reader users.
Multi-modal Interaction: Features a chat section using Qwen-2.5-VL for text and voice-based follow-up questions about processed scenes, maintaining conversation history for context-aware interaction.

Methodology

The system is built upon a multi-stage deep learning pipeline:

Metric Depth Estimation: UniDepthv2 is used to generate a metric depth map from a single monocular image, addressing limitations of prior methods that required specialized hardware or calibration.
Object Detection: A lightweight DEtection TRansformer (DETR) identifies and locates objects in the environment.
Vision-Language Model (VLM): The Qwen-2.5-VL model synthesizes the detected objects and their corresponding metric distances (obtained by sampling the depth map at object centers) into coherent, natural language scene descriptions.

This integrated approach aims to provide visually impaired users with a more comprehensive understanding of their surroundings in a computationally efficient and generalizable manner, even on low-power mobile devices.

Requirements and Prerequisites

Hardware Requirements

Processor: Intel i5 or equivalent (or faster)
RAM: Minimum 8GB
Storage: At least 2GB free SSD space
Camera: Webcam or smartphone camera (for image input)
GPU: CUDA-compatible GPU (recommended for optimal performance)
Internet: Stable internet connection

Software Requirements

Operating System: Windows 10/11, macOS 11+, Ubuntu 20.04+, or other equivalent Linux distributions
(For mobile devices: iOS 14+ or Android 9+)
Web Browsers: Chrome 90+, Firefox 88+, Safari 14+, Edge 90+

Backend

Python 3.8+
PyTorch 1.9+ (for DETR and UniDepthV2 models)
Flask 2.0+
OpenCV 4.5+
Pillow 8.0+
ngrok (for public deployment)

Frontend

TypeScript 4.4+
Tailwind CSS 3.4+

APIs

Valid OpenRouter API key (for Qwen-2.5-VL model access)
Valid ngrok API key (for public backend deployment)
Web Speech API (for Text-to-Speech functionality)

Permissions

Camera, microphone, and local storage permissions are required for the application.

Usage

The web application provides a user-friendly interface for visually impaired people to navigate using simple taps and swipes, with voice feedback through text-to-speech. Users can upload images or use their camera to capture scenes, which are then analyzed to provide detailed, distance-annotated descriptions.

The integrated chat functionality allows for personalized interaction and follow-up questions about the processed scenes.

Team

This project was submitted by:

BEN GEORGE
CHARUKESH PRASANTH
JASKARAN SINGH
KARTHIK E.M.

Output

Upload image

Detect objects using DETR

Estimate metric depth using UniDepthV2

Search

Demo

https://youtu.be/sCEQoFfh3N8?si=xBD4eUmwrIIMGNH4

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
README.md		README.md
backend(v1).py		backend(v1).py
backend(v2).ipynb		backend(v2).ipynb
backend_final_version.ipynb		backend_final_version.ipynb
frontend(v1).html		frontend(v1).html
frontend-v2.html		frontend-v2.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DISTANCE PERCEPTION IN VISION-LANGUAGE MODELS FOR BLIND NAVIGATION AND SCENE INTERPRETATION

Overview

Key Features

Methodology