RoboBrain 2.0

A Vision-Language Model for Robotic Understanding and Interaction

Advanced multi-modal AI system with interactive web interface, automatic task detection, and optimized memory management

Installation | Quick Start | Web Interface | Troubleshooting

📋 Table of Contents

Overview
Features
Architecture
Requirements
Installation
Usage
API Documentation
Project Structure
Troubleshooting
Contributing
License
Credits

🌟 Overview

RoboBrain 2.0 is a state-of-the-art vision-language model designed for robotic perception and interaction tasks. This implementation provides:

Interactive Web Interface: Modern React-based chat UI with dark mode support
Automatic Task Detection: AI-powered task classification using Groq's Llama 3.3
Multi-Turn Conversations: Persistent conversation history with image context
Optimized Memory Management: 8-bit quantization for efficient GPU usage
Local & Cloud Support: Run offline with downloaded weights or use Hugging Face

The system combines Qwen2.5-VL (7B/32B) for vision-language understanding with a Flask backend and React frontend for seamless interaction.

✨ Features

Core Capabilities

Feature	Description
General QA	Answer questions about images with natural language
Object Grounding	Detect and localize objects with bounding boxes
Affordance Prediction	Identify interaction points for robotic manipulation
Trajectory Generation	Plan motion paths for task completion
Pointing Tasks	Localize specific points of interest in images

Technical Features

🤖 Auto Mode: Automatically detects task type from natural language prompts
💬 Multi-Turn Memory: Maintains context across conversation
🎨 Visual Output: Generates annotated images for spatial tasks
🧠 Thinking Mode: Optional chain-of-thought reasoning display
🔒 Session Management: Independent conversation sessions
📦 8-bit Quantization: Reduces memory from ~6GB to ~3GB
🌙 Dark Mode UI: Easy on the eyes for extended use

🏗️ Architecture

┌─────────────────────────────────────────────────────────┐
│                     Frontend (React)                     │
│  • Modern chat interface with dark mode                 │
│  • Image upload and preview                             │
│  • Task selection (Auto/Manual modes)                   │
│  • Real-time response streaming                         │
└─────────────────────┬───────────────────────────────────┘
                      │ HTTP/REST API
┌─────────────────────▼───────────────────────────────────┐
│                  Backend (Flask)                         │
│  • Session management                                    │
│  • Auto task detection (Groq Llama 3.3)                 │
│  • Inference orchestration                              │
│  • Memory optimization & cleanup                        │
└─────────────────────┬───────────────────────────────────┘
                      │
┌─────────────────────▼───────────────────────────────────┐
│            RoboBrain2.0 Inference Engine                 │
│  • Qwen2.5-VL model (7B/32B)                            │
│  • 8-bit quantization with bitsandbytes                 │
│  • Multi-turn conversation memory                       │
│  • Visual annotation & plotting                         │
└─────────────────────────────────────────────────────────┘

💻 Requirements

Hardware

Component	Minimum	Recommended
GPU VRAM	6 GB (RTX 2060)	8+ GB (RTX 3070+)
System RAM	16 GB	32 GB
Storage	20 GB free	50 GB free
CUDA	11.8+	12.1+

Software

Python 3.10 or higher
Node.js 16.x or higher
npm 8.x or higher
CUDA 11.8+ (for GPU acceleration)
Git

🚀 Installation

1. Clone Repository

git clone https://github.com/YasiruDEX/Robobrain-2.0.git
cd Robobrain-2.0

2. Backend Setup

Using Conda (Recommended)

# Create environment from environment.yml
conda env create -f environment.yml
conda activate robobrain2-env

# Install additional dependencies
pip install -r requirements.txt

Using pip + venv

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
pip install -r requirements.txt

3. Frontend Setup

cd frontend
npm install
cd ..

4. Environment Configuration

Create a .env file in the project root:

cp .env.example .env

Edit .env and add your API keys:

# Hugging Face token (optional, for cloud model access)
HF_TOKEN=hf_your_token_here

# Groq API key (required for Auto Mode)
GROQ_API_KEY=your_groq_api_key_here

Get API Keys:

Hugging Face: https://huggingface.co/settings/tokens
Groq: https://console.groq.com/keys

🎯 Usage

Starting the Application

1. Start Backend

# Using the convenience script
./run_backend.sh

# Or manually
conda activate robobrain2-env
python backend.py

The backend will start on http://localhost:5001

2. Start Frontend

In a new terminal:

cd frontend
npm run dev

The web interface will open at http://localhost:5173

Web Interface

Create Session: Click "New Chat" to start a conversation
Upload Image (optional): Click the image icon to upload
Select Mode:
- Auto Mode: AI automatically detects the task type
- Manual Mode: Choose specific task (General/Grounding/Affordance/Trajectory/Pointing)
Send Message: Type your question and press Enter

Auto Mode (AI Task Detection)

When Auto Mode is enabled, the system uses Groq's Llama 3.3 to automatically classify your prompt:

"Where is the apple?" → Grounding
"How can I grab this?" → Affordance
"Plan a path to reach the cup" → Trajectory
"Point to all the chairs" → Pointing
"What color is the table?" → General QA

The detected task is displayed in the response with a ✨ sparkle icon.

CLI Scripts

For command-line usage or testing:

# General question answering
python scripts/general.py --image path/to/image.jpg --prompt "What is in this image?"

# Object grounding
python scripts/grounding.py --image path/to/image.jpg --object "red apple"

# Affordance prediction
python scripts/affordance.py --image path/to/image.jpg --task "pick up the cup"

# Trajectory generation
python scripts/trajectory.py --image path/to/image.jpg --task "move to the door"

# Multi-turn conversation
python scripts/multi_turn.py

📡 API Documentation

Base URL

http://localhost:5001/api

Endpoints

Create Session

POST /session

Response:

{
  "session_id": "uuid",
  "sessionId": "uuid"
}

Send Message

POST /chat

Request Body:

{
  "session_id": "uuid",
  "message": "What is in this image?",
  "image": "filename.jpg",
  "task": "auto",
  "enable_thinking": true
}

Response:

{
  "answer": "The image shows...",
  "thinking": "[[coordinates]]",
  "output_image": "/result/annotated.jpg",
  "task": "grounding",
  "task_source": "auto"
}

Upload Image

POST /upload
Content-Type: multipart/form-data

Response:

{
  "path": "/absolute/path/to/image.jpg",
  "filename": "uuid_image.jpg",
  "url": "/uploads/uuid_image.jpg"
}

Delete Session

DELETE /session/<session_id>

Health Check

GET /health

Response:

{
  "status": "healthy",
  "model_loaded": true,
  "active_sessions": 2
}

📁 Project Structure

Robobrain-2.0/
├── backend.py                 # Flask API server
├── RoboBrain2.0_lib/          # Core inference library
│   ├── inference.py           # Model loading & inference
│   └── multi_turn.py          # Conversation memory
├── scripts/                   # CLI task scripts
│   ├── general.py
│   ├── grounding.py
│   ├── affordance.py
│   ├── trajectory.py
│   ├── multi_turn.py
│   └── utils.py               # Model utilities
├── frontend/                  # React web interface
│   ├── src/
│   │   ├── components/        # UI components
│   │   │   ├── ChatContainer.jsx
│   │   │   ├── Message.jsx
│   │   │   └── Sidebar.jsx
│   │   ├── api.js             # Backend API client
│   │   └── App.jsx
│   ├── package.json
│   └── vite.config.js
├── uploads/                   # Uploaded images
├── result/                    # Generated output images
├── conversations/             # Saved conversation JSON
├── weights/                   # Local model weights (optional)
├── requirements.txt           # Python dependencies
├── environment.yml            # Conda environment spec
├── .env                       # API keys (not in git)
├── .env.example               # Template for .env
└── README.md

🔧 Troubleshooting

GPU Out of Memory

Symptoms: CUDA out of memory error during inference

Solutions:

The system automatically uses 8-bit quantization and reserves ~800MB headroom
If still failing, reduce image resolution before uploading
Close other GPU applications (browsers, games, etc.)
Restart the backend to clear GPU cache:
```
pkill -9 python
python backend.py
```

Model Not Loading

Symptoms: Model loaded: False on startup

Solutions:

Check if weights exist in weights/ directory
Verify Hugging Face token in .env if using cloud weights
Ensure sufficient disk space (20GB+)

Check CUDA installation:

python -c "import torch; print(torch.cuda.is_available())"

Frontend Not Connecting

Symptoms: "Failed to fetch" or connection errors in browser

Solutions:

Verify backend is running on port 5001:
```
curl http://localhost:5001/api/health
```
Check if port 5001 is blocked by firewall
Ensure CORS is enabled (already configured in backend.py)

Auto Mode Not Working

Symptoms: Task detection fails or returns "general" for all prompts

Solutions:

Verify GROQ_API_KEY is set in .env
Check Groq API quota: https://console.groq.com/

Test Groq connection:

python -c "from groq import Groq; import os; from dotenv import load_dotenv; load_dotenv(); client = Groq(api_key=os.getenv('GROQ_API_KEY')); print('Connected')"

Port Already in Use

Symptoms: Address already in use when starting backend

Solutions:

# Find and kill process using port 5001
lsof -ti:5001 | xargs kill -9

# Or use the provided script
./kill_backend.sh

🤝 Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Credits

Developed by: Yasiru Jayasooriya (@YasiruDEX)

Built with:

Qwen2.5-VL - Vision-Language Model
Groq - Fast LLM Inference for Auto Mode
React - Frontend Framework
Flask - Backend API
Transformers - Model Library
bitsandbytes - 8-bit Quantization

Special Thanks:

BAAI Team for RoboBrain model architecture
Hugging Face for model hosting and tools
The open-source AI community

⭐ Star this repo if you find it useful!

Report Bug · Request Feature

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
Images		Images
Notebooks		Notebooks
assets		assets
conversations		conversations
docs/images		docs/images
frontend		frontend
scripts		scripts
uploads		uploads
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
README.md.backup		README.md.backup
api_keys_template.py		api_keys_template.py
backend.py		backend.py
environment.yml		environment.yml
kill_backend.sh		kill_backend.sh
requirements.txt		requirements.txt
run_backend.sh		run_backend.sh

YasiruDEX/Robobrain-2.0

Folders and files

Latest commit

History

Repository files navigation

RoboBrain 2.0

📋 Table of Contents

🌟 Overview

✨ Features

Core Capabilities

Technical Features

🏗️ Architecture

💻 Requirements

Hardware

Software

🚀 Installation

1. Clone Repository

2. Backend Setup

Using Conda (Recommended)

Using pip + venv

3. Frontend Setup

4. Environment Configuration

🎯 Usage

Starting the Application

1. Start Backend

2. Start Frontend

Web Interface

Auto Mode (AI Task Detection)

CLI Scripts

📡 API Documentation

Base URL

Endpoints

Create Session

Send Message

Upload Image

Delete Session

Health Check

📁 Project Structure

🔧 Troubleshooting

GPU Out of Memory

Model Not Loading

Frontend Not Connecting

Auto Mode Not Working

Port Already in Use

🤝 Contributing

📄 License

🙏 Credits

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages