Computer Vision 🤖

An AI-powered computer vision control assistant that uses multiple AI models (Google's Gemini Pro Vision, Groq's LLaMA, and Ollama) to understand and control your computer through natural language commands and visual context.

🎥 Demo & Screenshots

Browser Control Demo

Demo video showing the AI assistant controlling the browser using natural language commands and visual feedback

YouTube Navigation

AI navigating to YouTube using keyboard shortcuts and visual feedback

Visual Feedback System

Active area highlighting with blue border showing the current focus area

Key Features Demonstrated:

🎯 Natural language command processing
⌨️ Intelligent keyboard shortcut usage
🔷 Visual feedback with blue border highlighting
🖱️ Precise mouse control when needed
📊 Real-time status updates
🔄 Multiple AI model support (Gemini, Groq, Ollama)
🧠 Enhanced task planning and analysis
📝 Detailed result interpretation

How It Works:

Command Input: Type natural language commands like "Open Chrome and go to YouTube"
Task Planning:
- Breaks down complex tasks into steps
- Plans visual analysis when needed
- Considers different scenarios
Model Selection: Choose between Gemini, Groq, or Ollama models
Visual Analysis:
- AI analyzes the screen to understand the current state
- Interprets text and UI elements
- Provides detailed insights from search results
Smart Execution:
- First attempts to use keyboard shortcuts
- Falls back to mouse control if needed
Visual Feedback:
- Blue border highlights active areas
- Status updates show progress
- Real-time command execution feedback
Result Analysis:
- Interprets search results
- Extracts key information
- Provides concise summaries

🌟 Features

🎯 Intelligent Computer Control: Control your computer using natural language commands and visual understanding
🖼️ Visual Context Understanding: Uses computer vision and screenshots to understand the current state of your computer
🤖 Multiple AI Models:
- Google Gemini Pro Vision
- Groq LLaMA Vision
- Ollama Local Models
🧠 Advanced Task Planning:
- Complex task breakdown
- Visual analysis integration
- Scenario consideration
📝 Result Analysis:
- Search result interpretation
- Data extraction and summarization
- Contextual understanding
⌨️ Smart Input Prioritization:
- Prioritizes keyboard shortcuts for efficiency
- Falls back to mouse control when necessary
🔷 Visual Feedback:
- Blue border highlighting of active areas
- Real-time status updates
- Clear visual feedback for all actions
🔒 Security:
- Secure API key management
- Local screenshot processing
- Permission-based access control

🚀 Getting Started

Prerequisites

Python 3.8 or higher
MacOS (currently only supports MacOS)
API Keys:
- Google Gemini API key (for Gemini model)
- Groq API key (for Groq model)
- Ollama running locally (for Ollama model)

Installation

Clone the repository:

git clone https://github.com/atesbey-design/computer-vision.git
cd computer-vision

Create and activate a virtual environment:

python3 -m venv venv
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Set up your API keys:
- Get a Gemini API key from Google AI Studio
- Get a Groq API key from Groq Console
- The app will prompt you to enter the keys in settings

Usage

Start the application:

python -m app.app

Grant necessary permissions:
- The app will request accessibility permissions
- This is required for keyboard and mouse control
Configure API keys:
- Click the ⚙️ Settings button
- Enter your Gemini and Groq API keys
- Click Save
Select your preferred AI model:
- Choose between Gemini, Groq, or Ollama
- Each model has its own strengths
Use natural language commands like:
- "Open Chrome and go to google.com"
- "Click the search button"
- "Type 'hello world' in the text field"

🛠️ Technical Details

Project Structure

computer-vision/
├── app/
│   ├── models/
│   │   ├── gemini.py      # Gemini model integration
│   │   ├── groq.py        # Groq model integration
│   │   └── ollama.py      # Ollama model integration
│   ├── utils/
│   │   ├── screen.py      # Screen capture and highlighting
│   │   └── settings.py    # Settings management
│   ├── app.py            # Main application
│   ├── core.py           # Core functionality
│   ├── interpreter.py    # Command interpretation
│   └── ui.py             # User interface
├── requirements.txt      # Dependencies
├── setup.py             # Package configuration
└── README.md            # Documentation

Key Components

Multiple AI Models:
- Gemini: Google's vision model
- Groq: High-performance LLaMA model
- Ollama: Local model support
Interpreter: Executes commands using PyAutoGUI
Screen Manager: Handles screen capture and visual feedback
Settings Manager: Manages configuration and API keys

🔜 Future Enhancements

Cross-Platform Support
- Windows support
- Linux support
Enhanced Visual Understanding
- Element recognition improvements
- OCR integration
- Better coordinate precision
Advanced Features
- Custom shortcut definitions
- Macro recording and playback
- Task automation sequences
- Voice command support
UI Improvements
- Dark mode support
- Customizable highlight colors
- Better status visualization
- Command history view
Performance Optimizations
- Faster screen capture
- Reduced API calls
- Better caching

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Google's Gemini Pro Vision model
Groq's LLaMA Vision model
Ollama for local model support
PyAutoGUI for computer control
The open-source community

📞 Support

If you encounter any issues or have questions:

Check the Issues page
Create a new issue if needed
Contact the maintainers directly

Made with ❤️ by Ates

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Computer Vision 🤖

🎥 Demo & Screenshots

Browser Control Demo

YouTube Navigation

Visual Feedback System

Key Features Demonstrated:

How It Works:

🌟 Features

🚀 Getting Started

Prerequisites

Installation

Usage

🛠️ Technical Details

Project Structure

Key Components

🔜 Future Enhancements

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Support

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
app		app
screenshots		screenshots
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

License

OmTun-Labs/computer-vision

Folders and files

Latest commit

History

Repository files navigation

Computer Vision 🤖

🎥 Demo & Screenshots

Browser Control Demo

YouTube Navigation

Visual Feedback System

Key Features Demonstrated:

How It Works:

🌟 Features

🚀 Getting Started

Prerequisites

Installation

Usage

🛠️ Technical Details

Project Structure

Key Components

🔜 Future Enhancements

🤝 Contributing

📝 License

🙏 Acknowledgments

📞 Support

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages