A Vision-Language Model for Robotic Understanding and Interaction
Advanced multi-modal AI system with interactive web interface, automatic task detection, and optimized memory management
Installation | Quick Start | Web Interface | Troubleshooting
- Overview
- Features
- Architecture
- Requirements
- Installation
- Usage
- API Documentation
- Project Structure
- Troubleshooting
- Contributing
- License
- Credits
RoboBrain 2.0 is a state-of-the-art vision-language model designed for robotic perception and interaction tasks. This implementation provides:
- Interactive Web Interface: Modern React-based chat UI with dark mode support
- Automatic Task Detection: AI-powered task classification using Groq's Llama 3.3
- Multi-Turn Conversations: Persistent conversation history with image context
- Optimized Memory Management: 8-bit quantization for efficient GPU usage
- Local & Cloud Support: Run offline with downloaded weights or use Hugging Face
The system combines Qwen2.5-VL (7B/32B) for vision-language understanding with a Flask backend and React frontend for seamless interaction.
| Feature | Description |
|---|---|
| General QA | Answer questions about images with natural language |
| Object Grounding | Detect and localize objects with bounding boxes |
| Affordance Prediction | Identify interaction points for robotic manipulation |
| Trajectory Generation | Plan motion paths for task completion |
| Pointing Tasks | Localize specific points of interest in images |
- π€ Auto Mode: Automatically detects task type from natural language prompts
- π¬ Multi-Turn Memory: Maintains context across conversation
- π¨ Visual Output: Generates annotated images for spatial tasks
- π§ Thinking Mode: Optional chain-of-thought reasoning display
- π Session Management: Independent conversation sessions
- π¦ 8-bit Quantization: Reduces memory from ~6GB to ~3GB
- π Dark Mode UI: Easy on the eyes for extended use
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Frontend (React) β
β β’ Modern chat interface with dark mode β
β β’ Image upload and preview β
β β’ Task selection (Auto/Manual modes) β
β β’ Real-time response streaming β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β HTTP/REST API
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ
β Backend (Flask) β
β β’ Session management β
β β’ Auto task detection (Groq Llama 3.3) β
β β’ Inference orchestration β
β β’ Memory optimization & cleanup β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββββββββββββββ
β RoboBrain2.0 Inference Engine β
β β’ Qwen2.5-VL model (7B/32B) β
β β’ 8-bit quantization with bitsandbytes β
β β’ Multi-turn conversation memory β
β β’ Visual annotation & plotting β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Component | Minimum | Recommended |
|---|---|---|
| GPU VRAM | 6 GB (RTX 2060) | 8+ GB (RTX 3070+) |
| System RAM | 16 GB | 32 GB |
| Storage | 20 GB free | 50 GB free |
| CUDA | 11.8+ | 12.1+ |
- Python 3.10 or higher
- Node.js 16.x or higher
- npm 8.x or higher
- CUDA 11.8+ (for GPU acceleration)
- Git
git clone https://github.com/YasiruDEX/Robobrain-2.0.git
cd Robobrain-2.0# Create environment from environment.yml
conda env create -f environment.yml
conda activate robobrain2-env
# Install additional dependencies
pip install -r requirements.txtpython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txtcd frontend
npm install
cd ..Create a .env file in the project root:
cp .env.example .envEdit .env and add your API keys:
# Hugging Face token (optional, for cloud model access)
HF_TOKEN=hf_your_token_here
# Groq API key (required for Auto Mode)
GROQ_API_KEY=your_groq_api_key_hereGet API Keys:
- Hugging Face: https://huggingface.co/settings/tokens
- Groq: https://console.groq.com/keys
# Using the convenience script
./run_backend.sh
# Or manually
conda activate robobrain2-env
python backend.pyThe backend will start on http://localhost:5001
In a new terminal:
cd frontend
npm run devThe web interface will open at http://localhost:5173
- Create Session: Click "New Chat" to start a conversation
- Upload Image (optional): Click the image icon to upload
- Select Mode:
- Auto Mode: AI automatically detects the task type
- Manual Mode: Choose specific task (General/Grounding/Affordance/Trajectory/Pointing)
- Send Message: Type your question and press Enter
When Auto Mode is enabled, the system uses Groq's Llama 3.3 to automatically classify your prompt:
- "Where is the apple?" β Grounding
- "How can I grab this?" β Affordance
- "Plan a path to reach the cup" β Trajectory
- "Point to all the chairs" β Pointing
- "What color is the table?" β General QA
The detected task is displayed in the response with a β¨ sparkle icon.
For command-line usage or testing:
# General question answering
python scripts/general.py --image path/to/image.jpg --prompt "What is in this image?"
# Object grounding
python scripts/grounding.py --image path/to/image.jpg --object "red apple"
# Affordance prediction
python scripts/affordance.py --image path/to/image.jpg --task "pick up the cup"
# Trajectory generation
python scripts/trajectory.py --image path/to/image.jpg --task "move to the door"
# Multi-turn conversation
python scripts/multi_turn.pyhttp://localhost:5001/api
POST /sessionResponse:
{
"session_id": "uuid",
"sessionId": "uuid"
}POST /chatRequest Body:
{
"session_id": "uuid",
"message": "What is in this image?",
"image": "filename.jpg",
"task": "auto",
"enable_thinking": true
}Response:
{
"answer": "The image shows...",
"thinking": "[[coordinates]]",
"output_image": "/result/annotated.jpg",
"task": "grounding",
"task_source": "auto"
}POST /upload
Content-Type: multipart/form-dataResponse:
{
"path": "/absolute/path/to/image.jpg",
"filename": "uuid_image.jpg",
"url": "/uploads/uuid_image.jpg"
}DELETE /session/<session_id>GET /healthResponse:
{
"status": "healthy",
"model_loaded": true,
"active_sessions": 2
}Robobrain-2.0/
βββ backend.py # Flask API server
βββ RoboBrain2.0_lib/ # Core inference library
β βββ inference.py # Model loading & inference
β βββ multi_turn.py # Conversation memory
βββ scripts/ # CLI task scripts
β βββ general.py
β βββ grounding.py
β βββ affordance.py
β βββ trajectory.py
β βββ multi_turn.py
β βββ utils.py # Model utilities
βββ frontend/ # React web interface
β βββ src/
β β βββ components/ # UI components
β β β βββ ChatContainer.jsx
β β β βββ Message.jsx
β β β βββ Sidebar.jsx
β β βββ api.js # Backend API client
β β βββ App.jsx
β βββ package.json
β βββ vite.config.js
βββ uploads/ # Uploaded images
βββ result/ # Generated output images
βββ conversations/ # Saved conversation JSON
βββ weights/ # Local model weights (optional)
βββ requirements.txt # Python dependencies
βββ environment.yml # Conda environment spec
βββ .env # API keys (not in git)
βββ .env.example # Template for .env
βββ README.md
Symptoms: CUDA out of memory error during inference
Solutions:
- The system automatically uses 8-bit quantization and reserves ~800MB headroom
- If still failing, reduce image resolution before uploading
- Close other GPU applications (browsers, games, etc.)
- Restart the backend to clear GPU cache:
pkill -9 python python backend.py
Symptoms: Model loaded: False on startup
Solutions:
- Check if weights exist in
weights/directory - Verify Hugging Face token in
.envif using cloud weights - Ensure sufficient disk space (20GB+)
- Check CUDA installation:
python -c "import torch; print(torch.cuda.is_available())"
Symptoms: "Failed to fetch" or connection errors in browser
Solutions:
- Verify backend is running on port 5001:
curl http://localhost:5001/api/health
- Check if port 5001 is blocked by firewall
- Ensure CORS is enabled (already configured in
backend.py)
Symptoms: Task detection fails or returns "general" for all prompts
Solutions:
- Verify
GROQ_API_KEYis set in.env - Check Groq API quota: https://console.groq.com/
- Test Groq connection:
python -c "from groq import Groq; import os; from dotenv import load_dotenv; load_dotenv(); client = Groq(api_key=os.getenv('GROQ_API_KEY')); print('Connected')"
Symptoms: Address already in use when starting backend
Solutions:
# Find and kill process using port 5001
lsof -ti:5001 | xargs kill -9
# Or use the provided script
./kill_backend.shContributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit changes (
git commit -m 'Add amazing feature') - Push to branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
Developed by: Yasiru Jayasooriya (@YasiruDEX)
Built with:
- Qwen2.5-VL - Vision-Language Model
- Groq - Fast LLM Inference for Auto Mode
- React - Frontend Framework
- Flask - Backend API
- Transformers - Model Library
- bitsandbytes - 8-bit Quantization
Special Thanks:
- BAAI Team for RoboBrain model architecture
- Hugging Face for model hosting and tools
- The open-source AI community
β Star this repo if you find it useful!