A specialized deep learning project that combines fine-tuned YOLO object detection with Gemini AI to enable intelligent UI navigation. The system generates accurate bounding boxes around UI elements using our custom-trained YOLO model, then leverages Gemini's advanced reasoning to select the most appropriate elements for executing user-specified tasks and interactions.
agent_demo.mp4
This project implements a multi-stage approach to UI element detection and classification:
-
Data Processing and Preparation
- Download and format datasets for training
- Automatic YOLO format conversion for bounding boxes
-
Model Training
- YOLO-based object detection
- Binary classification models
- Multi-class classification models
- CLIP and CNN-based approaches
-
Inference
- Real-time inference capabilities
- Support for different input sources
- Intelligent agent-based inference system with Gemini AI integration
- Interactive web interface using Streamlit
Watch our intelligent agent system in action! The demo shows:
- Upload of a UI screenshot
- Natural language task specification
- Real-time element detection
- Intelligent element selection
- Task analysis and recommendations
Try it yourself:
# Start the Streamlit interface
streamlit run 3_4_real_time_inference_AGENT.py
# Then:
1. Upload any UI screenshot
2. Enter a task (e.g., "Click the submit button")
3. Watch as the system analyzes and suggests the best actionExample tasks to try:
- "Click the login button"
- "Find the search box and enter 'products'"
- "Select the dropdown menu"
- "Click the close icon in the top right"
The system will intelligently:
- Identify relevant UI elements
- Suggest appropriate actions
- Handle complex multi-step tasks
- Provide clear visual feedback
- Install the required dependencies:
pip install -r requirements.txt- Set up your Gemini API key:
# Create a .env file and add your Gemini API key
echo "GEMINI_API_KEY=your_api_key_here" > .env.
├── Data Processing
│ ├── 2_0_download_data.py
│ ├── 2_1_create_dataframe.py
│ └── 3_0_download_and_format_dataset.py
│
├── Binary Classification
│ ├── 2_2_prepare_binary_classification.py
│ ├── 2_3_train_binary_classification.py
│ ├── 2_4_train_binary_classification_clip.py
│ └── 2_5_train_binary_classification_cnn.py
│
├── Multiclass Classification
│ ├── 2_6_prepare_multiclass_classification.py
│ ├── 2_7_train_multiclass_classification.py
│ ├── 2_8_train_multiclass_classification_clip.py
│ └── 2_9_train_multiclass_classification_cnn.py
│
├── YOLO Training and Inference
│ ├── 3_1_train_yolo.py
│ ├── 3_2_inference.py
│ ├── 3_3_real_time_inference.py
│ └── 3_4_real_time_inference_AGENT.py
│
└── Notebooks
├── 3_0_explore_dataset.ipynb
└── 3_5_gemini_second_dataset.ipynb
# Download and prepare the dataset
python 3_0_download_and_format_dataset.py
# Create dataframes for training
python 2_1_create_dataframe.py# Train YOLO model
python 3_1_train_yolo.py --model_size n --epochs 100 --batch_size 16
# Train binary classification
python 2_3_train_binary_classification.py
# Train multiclass classification
python 2_7_train_multiclass_classification.py# Regular inference
python 3_2_inference.py --model_path path/to/model --source path/to/images
# Real-time inference
python 3_3_real_time_inference.py --model_path path/to/model
# Interactive Agent-based inference with Streamlit UI
streamlit run 3_4_real_time_inference_AGENT.pyThe project includes an advanced agent-based inference system (3_4_real_time_inference_AGENT.py) that combines YOLO object detection with Google's Gemini AI for intelligent UI interaction:
-
Interactive Web Interface
- Built with Streamlit for easy interaction
- Real-time visualization of detection results
- Configurable visualization settings
- Task history tracking
-
Intelligent Task Analysis
- Natural language task processing
- Context-aware element selection
- Multi-step task planning
- Confidence-based decision making
-
Advanced Visualization
- Customizable bounding box display
- Adjustable confidence thresholds
- Numeric or class-based labels
- Selected element highlighting
-
Action Types
- Direct element clicks
- Zoom recommendations for small/clustered elements
- Text input suggestions
- Multi-step interaction planning
- Model path and dataset YAML configuration
- Visualization settings:
- Label display options
- Box and label opacity
- Font size customization
- Confidence thresholds
- Task history management
- Real-time analysis updates
-
YOLO Detection
- YOLOv8 architecture
- Multiple model sizes (nano to xlarge)
- Custom-trained on UI element dataset
-
Classification Models
- Binary classification for element detection
- Multi-class classification for element type identification
- CLIP-based models for zero-shot learning
- Custom CNN architectures
-
Gemini AI Integration
- Task understanding and decomposition
- Context-aware element selection
- Natural language interaction
- Multi-modal analysis (image and text)
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request