An AI-powered computer vision control assistant that uses multiple AI models (Google's Gemini Pro Vision, Groq's LLaMA, and Ollama) to understand and control your computer through natural language commands and visual context.
Demo video showing the AI assistant controlling the browser using natural language commands and visual feedback
AI navigating to YouTube using keyboard shortcuts and visual feedback
Active area highlighting with blue border showing the current focus area
- 🎯 Natural language command processing
- ⌨️ Intelligent keyboard shortcut usage
- 🔷 Visual feedback with blue border highlighting
- 🖱️ Precise mouse control when needed
- 📊 Real-time status updates
- 🔄 Multiple AI model support (Gemini, Groq, Ollama)
- 🧠 Enhanced task planning and analysis
- 📝 Detailed result interpretation
- Command Input: Type natural language commands like "Open Chrome and go to YouTube"
- Task Planning:
- Breaks down complex tasks into steps
- Plans visual analysis when needed
- Considers different scenarios
- Model Selection: Choose between Gemini, Groq, or Ollama models
- Visual Analysis:
- AI analyzes the screen to understand the current state
- Interprets text and UI elements
- Provides detailed insights from search results
- Smart Execution:
- First attempts to use keyboard shortcuts
- Falls back to mouse control if needed
- Visual Feedback:
- Blue border highlights active areas
- Status updates show progress
- Real-time command execution feedback
- Result Analysis:
- Interprets search results
- Extracts key information
- Provides concise summaries
- 🎯 Intelligent Computer Control: Control your computer using natural language commands and visual understanding
- 🖼️ Visual Context Understanding: Uses computer vision and screenshots to understand the current state of your computer
- 🤖 Multiple AI Models:
- Google Gemini Pro Vision
- Groq LLaMA Vision
- Ollama Local Models
- 🧠 Advanced Task Planning:
- Complex task breakdown
- Visual analysis integration
- Scenario consideration
- 📝 Result Analysis:
- Search result interpretation
- Data extraction and summarization
- Contextual understanding
- ⌨️ Smart Input Prioritization:
- Prioritizes keyboard shortcuts for efficiency
- Falls back to mouse control when necessary
- 🔷 Visual Feedback:
- Blue border highlighting of active areas
- Real-time status updates
- Clear visual feedback for all actions
- 🔒 Security:
- Secure API key management
- Local screenshot processing
- Permission-based access control
- Python 3.8 or higher
- MacOS (currently only supports MacOS)
- API Keys:
- Google Gemini API key (for Gemini model)
- Groq API key (for Groq model)
- Ollama running locally (for Ollama model)
- Clone the repository:
git clone https://github.com/atesbey-design/computer-vision.git
cd computer-vision- Create and activate a virtual environment:
python3 -m venv venv
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Set up your API keys:
- Get a Gemini API key from Google AI Studio
- Get a Groq API key from Groq Console
- The app will prompt you to enter the keys in settings
- Start the application:
python -m app.app-
Grant necessary permissions:
- The app will request accessibility permissions
- This is required for keyboard and mouse control
-
Configure API keys:
- Click the ⚙️ Settings button
- Enter your Gemini and Groq API keys
- Click Save
-
Select your preferred AI model:
- Choose between Gemini, Groq, or Ollama
- Each model has its own strengths
-
Use natural language commands like:
- "Open Chrome and go to google.com"
- "Click the search button"
- "Type 'hello world' in the text field"
computer-vision/
├── app/
│ ├── models/
│ │ ├── gemini.py # Gemini model integration
│ │ ├── groq.py # Groq model integration
│ │ └── ollama.py # Ollama model integration
│ ├── utils/
│ │ ├── screen.py # Screen capture and highlighting
│ │ └── settings.py # Settings management
│ ├── app.py # Main application
│ ├── core.py # Core functionality
│ ├── interpreter.py # Command interpretation
│ └── ui.py # User interface
├── requirements.txt # Dependencies
├── setup.py # Package configuration
└── README.md # Documentation
- Multiple AI Models:
- Gemini: Google's vision model
- Groq: High-performance LLaMA model
- Ollama: Local model support
- Interpreter: Executes commands using PyAutoGUI
- Screen Manager: Handles screen capture and visual feedback
- Settings Manager: Manages configuration and API keys
-
Cross-Platform Support
- Windows support
- Linux support
-
Enhanced Visual Understanding
- Element recognition improvements
- OCR integration
- Better coordinate precision
-
Advanced Features
- Custom shortcut definitions
- Macro recording and playback
- Task automation sequences
- Voice command support
-
UI Improvements
- Dark mode support
- Customizable highlight colors
- Better status visualization
- Command history view
-
Performance Optimizations
- Faster screen capture
- Reduced API calls
- Better caching
Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
This project is licensed under the MIT License - see the LICENSE file for details.
- Google's Gemini Pro Vision model
- Groq's LLaMA Vision model
- Ollama for local model support
- PyAutoGUI for computer control
- The open-source community
If you encounter any issues or have questions:
- Check the Issues page
- Create a new issue if needed
- Contact the maintainers directly
Made with ❤️ by Ates


