A cutting-edge AI-powered research agent that combines web browsing, screenshot analysis, and NVIDIA's advanced AI models to provide comprehensive multimodal research insights.
- NVIDIA Nemotron Integration: Leverages NVIDIA's latest LLM models for advanced reasoning
- Vision Analysis: Screenshot analysis using NVIDIA Vision models
- Multimodal Understanding: Combines text and visual content for comprehensive insights
- Intelligent Browsing: Automated web navigation with Playwright
- Content Extraction: Smart text extraction and cleaning
- Screenshot Capture: Automatic visual content capture
- Multi-URL Processing: Parallel processing of multiple sources
- Memory System: Persistent storage of research sessions
- Search History: Query tracking and suggestions
- User Preferences: Customizable settings and preferences
- Beautiful UI: Responsive, modern web interface
- Real-time Updates: Live progress tracking
- Interactive Results: Rich formatting and source attribution
graph TB
A[Frontend Interface] --> B[FastAPI Backend]
B --> C[Agent Core]
C --> D[Browser Tools]
C --> E[Vision Analysis]
C --> F[NVIDIA AI Models]
D --> G[Playwright Browser]
E --> H[Vision Models]
F --> I[Nemotron LLM]
C --> J[Memory System]
J --> K[SQLite Database]
- Python 3.8+
- Node.js (for Playwright)
- NVIDIA API Access (for full functionality)
- Clone the repository:
git clone <repository-url>
cd multimodal-browser-agent- Install Python dependencies:
pip install -r requirements.txt- Install Playwright browsers:
playwright install- Configure environment variables:
cp .env.example .env
# Edit .env with your NVIDIA API credentials- Start the application:
python -m uvicorn app.main:app --reload --host 0.0.0.0 --port 8000- Access the interface:
- Web UI: http://localhost:8000
- API Documentation: http://localhost:8000/docs
- Health Check: http://localhost:8000/health
Create a .env file with the following configurations:
# NVIDIA API Configuration
NVIDIA_API_KEY=your_nvidia_nim_api_key_here
NVIDIA_VISION_ENDPOINT=https://your-vision-api-endpoint.com/api/v1/vision
NIM_ENDPOINT=https://your-nim-endpoint.com/api/v1/generate
# Database Configuration
DATABASE_URL=sqlite:///./agent_memory.db
# FastAPI Configuration
API_HOST=0.0.0.0
API_PORT=8000
DEBUG=True
# Browser Configuration
HEADLESS_BROWSER=True
BROWSER_TIMEOUT=30000-
Get API Access:
- Visit NVIDIA Developer
- Sign up for NIM (NVIDIA Inference Microservices) access
- Obtain your API key
-
Configure Endpoints:
- Update
NVIDIA_API_KEYwith your API key - Set
NIM_ENDPOINTto your NIM service URL - Configure
NVIDIA_VISION_ENDPOINTfor vision capabilities
- Update
POST /agent/research
Perform multimodal web research on provided URLs.
{
"query": "What are the latest AI trends?",
"urls": [
"https://example.com/ai-news",
"https://example.com/tech-trends"
],
"max_tokens": 512,
"include_screenshots": true
}Response:
{
"result": "Comprehensive analysis results...",
"sources_analyzed": ["https://example.com/ai-news"],
"vision_insights": ["Screenshot analysis results..."],
"processing_time": 15.2,
"timestamp": "2024-01-15T10:30:00Z"
}POST /agent/vision
Analyze uploaded images using NVIDIA vision models.
- Upload image file
- Optional query parameter for targeted analysis
GET /health
Check system health and service status.
import requests
response = requests.post("http://localhost:8000/agent/research", json={
"query": "Compare the latest GPU architectures from NVIDIA",
"urls": [
"https://www.nvidia.com/en-us/geforce/graphics-cards/",
"https://www.nvidia.com/en-us/data-center/a100/"
],
"max_tokens": 1024,
"include_screenshots": True
})
results = response.json()
print(results["result"])import requests
with open("screenshot.png", "rb") as f:
response = requests.post(
"http://localhost:8000/agent/vision",
files={"file": f},
data={"query": "What UI elements are visible in this screenshot?"}
)
analysis = response.json()
print(analysis["analysis"])FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
RUN playwright install --with-deps
COPY . .
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]- NVIDIA DGX Cloud: Optimal for GPU-accelerated inference
- AWS/GCP/Azure: Standard cloud deployment
- Kubernetes: Scalable container orchestration
- Docker Compose: Multi-service local deployment
multimodal-browser-agent/
βββ app/
β βββ main.py # FastAPI application
β βββ agent/
β β βββ core.py # Agent orchestration
β β βββ browser_tools.py # Playwright automation
β β βββ vision.py # Vision model integration
β βββ models/
β β βββ schemas.py # API schemas
β βββ utils/
β βββ memory.py # Memory management
βββ frontend/
β βββ index.html # Web interface
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
βββ README.md # Documentation
- Create virtual environment:
python -m venv venv
source venv/bin/activate # Linux/Mac
# or
venv\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txt
pip install pytest pytest-asyncio # For testing- Run tests:
pytest- Development server:
uvicorn app.main:app --reload --log-level debug- New Agent Tools: Extend
app/agent/modules - API Endpoints: Add routes to
app/main.py - Frontend Features: Modify
frontend/index.html - Data Models: Update
app/models/schemas.py
# Run all tests
pytest
# Run specific test file
pytest tests/test_agent.py
# Run with coverage
pytest --cov=app tests/# Test API endpoints
pytest tests/test_api.py
# Test browser automation
pytest tests/test_browser.py# Using Locust
pip install locust
locust -f tests/load_test.py --host=http://localhost:8000- Health Check:
/healthendpoint - Agent Status:
/agent/statusendpoint - Database Stats: Memory usage and session statistics
The application uses structured logging:
import logging
logger = logging.getLogger(__name__)
# Logs are automatically formatted with timestamps and levels
logger.info("Research completed successfully")
logger.error("NVIDIA API error", extra={"status_code": 500})- Processing time tracking
- Memory usage monitoring
- API response times
- Success/failure rates
- Rate Limiting: Implement request rate limiting
- Authentication: Add API key authentication for production
- Input Validation: Comprehensive input sanitization
- CORS Configuration: Proper cross-origin settings
- Secure Storage: Encrypted database connections
- Data Retention: Configurable session cleanup
- User Privacy: No sensitive data logging
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests
- Submit a pull request
- Follow PEP 8 for Python code
- Use type hints
- Add docstrings to functions
- Keep functions focused and modular
This project is licensed under the MIT License - see the LICENSE file for details.
- NVIDIA: For providing cutting-edge AI models and infrastructure
- FastAPI: For the excellent web framework
- Playwright: For robust browser automation
- LangChain: For AI application development tools
- Documentation: Check the
/docsendpoint - Issues: Report bugs via GitHub issues
- Discussions: Join community discussions
- Email: Contact the development team
Built with β€οΈ using NVIDIA AI Stack
Ready to revolutionize web research with multimodal AI? Get started today!