Skip to content

YasiruDEX/Robobrain-2.0

Repository files navigation

RoboBrain 2.0

Python 3.10+ PyTorch 2.1+ React 18 License: MIT CUDA

A Vision-Language Model for Robotic Understanding and Interaction

Advanced multi-modal AI system with interactive web interface, automatic task detection, and optimized memory management

Installation | Quick Start | Web Interface | Troubleshooting


πŸ“‹ Table of Contents


🌟 Overview

RoboBrain 2.0 is a state-of-the-art vision-language model designed for robotic perception and interaction tasks. This implementation provides:

  • Interactive Web Interface: Modern React-based chat UI with dark mode support
  • Automatic Task Detection: AI-powered task classification using Groq's Llama 3.3
  • Multi-Turn Conversations: Persistent conversation history with image context
  • Optimized Memory Management: 8-bit quantization for efficient GPU usage
  • Local & Cloud Support: Run offline with downloaded weights or use Hugging Face

The system combines Qwen2.5-VL (7B/32B) for vision-language understanding with a Flask backend and React frontend for seamless interaction.


✨ Features

Core Capabilities

Feature Description
General QA Answer questions about images with natural language
Object Grounding Detect and localize objects with bounding boxes
Affordance Prediction Identify interaction points for robotic manipulation
Trajectory Generation Plan motion paths for task completion
Pointing Tasks Localize specific points of interest in images

Technical Features

  • πŸ€– Auto Mode: Automatically detects task type from natural language prompts
  • πŸ’¬ Multi-Turn Memory: Maintains context across conversation
  • 🎨 Visual Output: Generates annotated images for spatial tasks
  • 🧠 Thinking Mode: Optional chain-of-thought reasoning display
  • πŸ”’ Session Management: Independent conversation sessions
  • πŸ“¦ 8-bit Quantization: Reduces memory from ~6GB to ~3GB
  • πŸŒ™ Dark Mode UI: Easy on the eyes for extended use

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                     Frontend (React)                     β”‚
β”‚  β€’ Modern chat interface with dark mode                 β”‚
β”‚  β€’ Image upload and preview                             β”‚
β”‚  β€’ Task selection (Auto/Manual modes)                   β”‚
β”‚  β€’ Real-time response streaming                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚ HTTP/REST API
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                  Backend (Flask)                         β”‚
β”‚  β€’ Session management                                    β”‚
β”‚  β€’ Auto task detection (Groq Llama 3.3)                 β”‚
β”‚  β€’ Inference orchestration                              β”‚
β”‚  β€’ Memory optimization & cleanup                        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                      β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            RoboBrain2.0 Inference Engine                 β”‚
β”‚  β€’ Qwen2.5-VL model (7B/32B)                            β”‚
β”‚  β€’ 8-bit quantization with bitsandbytes                 β”‚
β”‚  β€’ Multi-turn conversation memory                       β”‚
β”‚  β€’ Visual annotation & plotting                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ’» Requirements

Hardware

Component Minimum Recommended
GPU VRAM 6 GB (RTX 2060) 8+ GB (RTX 3070+)
System RAM 16 GB 32 GB
Storage 20 GB free 50 GB free
CUDA 11.8+ 12.1+

Software

  • Python 3.10 or higher
  • Node.js 16.x or higher
  • npm 8.x or higher
  • CUDA 11.8+ (for GPU acceleration)
  • Git

πŸš€ Installation

1. Clone Repository

git clone https://github.com/YasiruDEX/Robobrain-2.0.git
cd Robobrain-2.0

2. Backend Setup

Using Conda (Recommended)

# Create environment from environment.yml
conda env create -f environment.yml
conda activate robobrain2-env

# Install additional dependencies
pip install -r requirements.txt

Using pip + venv

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

3. Frontend Setup

cd frontend
npm install
cd ..

4. Environment Configuration

Create a .env file in the project root:

cp .env.example .env

Edit .env and add your API keys:

# Hugging Face token (optional, for cloud model access)
HF_TOKEN=hf_your_token_here

# Groq API key (required for Auto Mode)
GROQ_API_KEY=your_groq_api_key_here

Get API Keys:


🎯 Usage

Starting the Application

1. Start Backend

# Using the convenience script
./run_backend.sh

# Or manually
conda activate robobrain2-env
python backend.py

The backend will start on http://localhost:5001

2. Start Frontend

In a new terminal:

cd frontend
npm run dev

The web interface will open at http://localhost:5173

Web Interface

  1. Create Session: Click "New Chat" to start a conversation
  2. Upload Image (optional): Click the image icon to upload
  3. Select Mode:
    • Auto Mode: AI automatically detects the task type
    • Manual Mode: Choose specific task (General/Grounding/Affordance/Trajectory/Pointing)
  4. Send Message: Type your question and press Enter

Auto Mode (AI Task Detection)

When Auto Mode is enabled, the system uses Groq's Llama 3.3 to automatically classify your prompt:

  • "Where is the apple?" β†’ Grounding
  • "How can I grab this?" β†’ Affordance
  • "Plan a path to reach the cup" β†’ Trajectory
  • "Point to all the chairs" β†’ Pointing
  • "What color is the table?" β†’ General QA

The detected task is displayed in the response with a ✨ sparkle icon.

CLI Scripts

For command-line usage or testing:

# General question answering
python scripts/general.py --image path/to/image.jpg --prompt "What is in this image?"

# Object grounding
python scripts/grounding.py --image path/to/image.jpg --object "red apple"

# Affordance prediction
python scripts/affordance.py --image path/to/image.jpg --task "pick up the cup"

# Trajectory generation
python scripts/trajectory.py --image path/to/image.jpg --task "move to the door"

# Multi-turn conversation
python scripts/multi_turn.py

πŸ“‘ API Documentation

Base URL

http://localhost:5001/api

Endpoints

Create Session

POST /session

Response:

{
  "session_id": "uuid",
  "sessionId": "uuid"
}

Send Message

POST /chat

Request Body:

{
  "session_id": "uuid",
  "message": "What is in this image?",
  "image": "filename.jpg",
  "task": "auto",
  "enable_thinking": true
}

Response:

{
  "answer": "The image shows...",
  "thinking": "[[coordinates]]",
  "output_image": "/result/annotated.jpg",
  "task": "grounding",
  "task_source": "auto"
}

Upload Image

POST /upload
Content-Type: multipart/form-data

Response:

{
  "path": "/absolute/path/to/image.jpg",
  "filename": "uuid_image.jpg",
  "url": "/uploads/uuid_image.jpg"
}

Delete Session

DELETE /session/<session_id>

Health Check

GET /health

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "active_sessions": 2
}

πŸ“ Project Structure

Robobrain-2.0/
β”œβ”€β”€ backend.py                 # Flask API server
β”œβ”€β”€ RoboBrain2.0_lib/          # Core inference library
β”‚   β”œβ”€β”€ inference.py           # Model loading & inference
β”‚   └── multi_turn.py          # Conversation memory
β”œβ”€β”€ scripts/                   # CLI task scripts
β”‚   β”œβ”€β”€ general.py
β”‚   β”œβ”€β”€ grounding.py
β”‚   β”œβ”€β”€ affordance.py
β”‚   β”œβ”€β”€ trajectory.py
β”‚   β”œβ”€β”€ multi_turn.py
β”‚   └── utils.py               # Model utilities
β”œβ”€β”€ frontend/                  # React web interface
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/        # UI components
β”‚   β”‚   β”‚   β”œβ”€β”€ ChatContainer.jsx
β”‚   β”‚   β”‚   β”œβ”€β”€ Message.jsx
β”‚   β”‚   β”‚   └── Sidebar.jsx
β”‚   β”‚   β”œβ”€β”€ api.js             # Backend API client
β”‚   β”‚   └── App.jsx
β”‚   β”œβ”€β”€ package.json
β”‚   └── vite.config.js
β”œβ”€β”€ uploads/                   # Uploaded images
β”œβ”€β”€ result/                    # Generated output images
β”œβ”€β”€ conversations/             # Saved conversation JSON
β”œβ”€β”€ weights/                   # Local model weights (optional)
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ environment.yml            # Conda environment spec
β”œβ”€β”€ .env                       # API keys (not in git)
β”œβ”€β”€ .env.example               # Template for .env
└── README.md

πŸ”§ Troubleshooting

GPU Out of Memory

Symptoms: CUDA out of memory error during inference

Solutions:

  1. The system automatically uses 8-bit quantization and reserves ~800MB headroom
  2. If still failing, reduce image resolution before uploading
  3. Close other GPU applications (browsers, games, etc.)
  4. Restart the backend to clear GPU cache:
    pkill -9 python
    python backend.py

Model Not Loading

Symptoms: Model loaded: False on startup

Solutions:

  1. Check if weights exist in weights/ directory
  2. Verify Hugging Face token in .env if using cloud weights
  3. Ensure sufficient disk space (20GB+)
  4. Check CUDA installation:
    python -c "import torch; print(torch.cuda.is_available())"

Frontend Not Connecting

Symptoms: "Failed to fetch" or connection errors in browser

Solutions:

  1. Verify backend is running on port 5001:
    curl http://localhost:5001/api/health
  2. Check if port 5001 is blocked by firewall
  3. Ensure CORS is enabled (already configured in backend.py)

Auto Mode Not Working

Symptoms: Task detection fails or returns "general" for all prompts

Solutions:

  1. Verify GROQ_API_KEY is set in .env
  2. Check Groq API quota: https://console.groq.com/
  3. Test Groq connection:
    python -c "from groq import Groq; import os; from dotenv import load_dotenv; load_dotenv(); client = Groq(api_key=os.getenv('GROQ_API_KEY')); print('Connected')"

Port Already in Use

Symptoms: Address already in use when starting backend

Solutions:

# Find and kill process using port 5001
lsof -ti:5001 | xargs kill -9

# Or use the provided script
./kill_backend.sh

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit changes (git commit -m 'Add amazing feature')
  4. Push to branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ™ Credits

Developed by: Yasiru Jayasooriya (@YasiruDEX)

Built with:

Special Thanks:

  • BAAI Team for RoboBrain model architecture
  • Hugging Face for model hosting and tools
  • The open-source AI community

⭐ Star this repo if you find it useful!

Report Bug Β· Request Feature

About

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •