Skip to content

TEAM-5-PROJECTS/Vision-assistant

Repository files navigation

readme

DISTANCE PERCEPTION IN VISION-LANGUAGE MODELS FOR BLIND NAVIGATION AND SCENE INTERPRETATION

Overview

This project presents a novel assistive system designed to enhance spatial awareness and navigation for visually impaired individuals in unfamiliar environments. Current approaches for assisting visually impaired people with wayfinding are often limited by a lack of spatial awareness and the inability to provide precise object-to-camera distances without extensive calibration or specialized hardware.

Our solution overcomes these limitations by integrating state-of-the-art deep learning models to accurately estimate metric object-to-camera distances directly from single monocular camera frames. The system then synthesizes this spatial information into coherent, natural language scene descriptions, delivered via an accessible web application.

Key Features

  • Accessible Navigation: A web application with a carousel-based navigation system (Gallery, Camera, Chat, Help) supporting keyboard, mouse, and touch interactions.
  • Image Input: Users can provide images through a gallery selection or real-time camera capture for processing.
  • Object Detection: Utilizes DETR (DEtection TRansformer) to identify and localize objects within images.
  • Metric Depth Estimation: Employs UniDepthV2 to generate precise metric depth maps, indicating object distances from the camera.
  • Scene Description Generation: Integrates Qwen-2.5-VL Vision-Language Model to synthesize detected objects and their metric distances into informative, natural language scene descriptions.
  • Accessible Output: Provides visual annotations (bounding boxes, depth maps) and delivers scene descriptions and distance information through Text-to-Speech (TTS) for screen reader users.
  • Multi-modal Interaction: Features a chat section using Qwen-2.5-VL for text and voice-based follow-up questions about processed scenes, maintaining conversation history for context-aware interaction.

Methodology

The system is built upon a multi-stage deep learning pipeline:

  1. Metric Depth Estimation: UniDepthv2 is used to generate a metric depth map from a single monocular image, addressing limitations of prior methods that required specialized hardware or calibration.
  2. Object Detection: A lightweight DEtection TRansformer (DETR) identifies and locates objects in the environment.
  3. Vision-Language Model (VLM): The Qwen-2.5-VL model synthesizes the detected objects and their corresponding metric distances (obtained by sampling the depth map at object centers) into coherent, natural language scene descriptions.

This integrated approach aims to provide visually impaired users with a more comprehensive understanding of their surroundings in a computationally efficient and generalizable manner, even on low-power mobile devices.

Requirements and Prerequisites

Hardware Requirements

  • Processor: Intel i5 or equivalent (or faster)
  • RAM: Minimum 8GB
  • Storage: At least 2GB free SSD space
  • Camera: Webcam or smartphone camera (for image input)
  • GPU: CUDA-compatible GPU (recommended for optimal performance)
  • Internet: Stable internet connection

Software Requirements

  • Operating System: Windows 10/11, macOS 11+, Ubuntu 20.04+, or other equivalent Linux distributions
    (For mobile devices: iOS 14+ or Android 9+)
  • Web Browsers: Chrome 90+, Firefox 88+, Safari 14+, Edge 90+

Backend

  • Python 3.8+
  • PyTorch 1.9+ (for DETR and UniDepthV2 models)
  • Flask 2.0+
  • OpenCV 4.5+
  • Pillow 8.0+
  • ngrok (for public deployment)

Frontend

  • TypeScript 4.4+
  • Tailwind CSS 3.4+

APIs

  • Valid OpenRouter API key (for Qwen-2.5-VL model access)
  • Valid ngrok API key (for public backend deployment)
  • Web Speech API (for Text-to-Speech functionality)

Permissions

  • Camera, microphone, and local storage permissions are required for the application.

Usage

The web application provides a user-friendly interface for visually impaired people to navigate using simple taps and swipes, with voice feedback through text-to-speech. Users can upload images or use their camera to capture scenes, which are then analyzed to provide detailed, distance-annotated descriptions.

The integrated chat functionality allows for personalized interaction and follow-up questions about the processed scenes.

Team

This project was submitted by:

  • BEN GEORGE
  • CHARUKESH PRASANTH
  • JASKARAN SINGH
  • KARTHIK E.M.

Output

  1. Upload image
upload
  1. Detect objects using DETR
object
  1. Estimate metric depth using UniDepthV2
uni
  1. Search
srch

Demo

https://youtu.be/sCEQoFfh3N8?si=xBD4eUmwrIIMGNH4

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors