LLM Vision OS is a Python-based application that utilizes Gemini 1.5 Flash API for image analysis and and Google Speech for speech synthesis. When you click 'start' it will take screenshots every few seconds for Gemini Flash to analyze. You can ask questions and get answers related to anything you're viewing on the screen.
- Capture and analyze screenshots at specified intervals
- Real-time speech recognition and synthesis
- Export logs of analysis
- Simple GUI
- Python 3.10 or later
- Required Python packages (see
requirements.txt) - Google Cloud account with
Generative Language APIandCloud Text-to-Speech APIboth enabled - Google Cloud API set as environment variable with name
GOOGLE_API_KEY
- Screenshots will be taken and processed every 2 seconds by default. This can be changed in the interface.
