-
Notifications
You must be signed in to change notification settings - Fork 3
feat(templates): Implement Gemini Computer Use templates for TypeScript and Python #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implement browser control agents using the Gemini 2.5 Computer Use Preview model (gemini-2.5-computer-use-preview-10-2025) with Kernel's Computer Controls API. ## TypeScript Template Changes - Refactored index.ts to use modular architecture - Added loop.ts: Core sampling loop implementing Google's agent pattern - Added session.ts: KernelBrowserSession for browser lifecycle management - Added tools/computer.ts: Maps Gemini actions to Kernel Computer Controls - Added tools/types/gemini.ts: Type definitions for Gemini actions - Updated package.json: @google/genai ^1.0.0 ## Python Template (New) - main.py: Action handler with KernelBrowserSession context manager - loop.py: Async sampling loop matching TypeScript implementation - session.py: Async context manager for browser lifecycle - tools/computer.py: Gemini-to-Kernel action mapping - tools/types.py: Dataclasses and enums for Gemini actions ## Integration Pattern Both templates implement Google's recommended Computer Use agent loop: 1. Send request with computerUse tool configured 2. Receive function_call from model (click_at, type_text_at, navigate, etc.) 3. Execute via Kernel Computer Controls API 4. Capture screenshot + URL, send as function_response 5. Loop until task complete ## Key Features - Coordinate denormalization (Gemini 0-1000 → Kernel pixels) - Screenshot pruning to manage context size - Optional replay recording for debugging - Safety decision handling (auto-acknowledge for automation) Tested: Both templates successfully navigate to Wikipedia and extract the featured article title.
Minor cleanup following Option A from the alignment analysis: ## Removed Unused Code - Python: Removed unused EnvState class from types.py - Python: Removed unused GeminiAction and Optional imports from loop.py - TypeScript: Removed unused EnvState interface from gemini.ts ## Fixed Type Hints - Python: Use GeminiFunctionArgs TypedDict instead of Dict[str, Any] - Python: Export GeminiFunctionArgs from tools package ## Fixed Naming Consistency - Python: Renamed SCREENSHOT_DELAY_MS to SCREENSHOT_DELAY_SECS (was already in seconds) ## Improved Error Messages - Added deployment hint to GOOGLE_API_KEY error - Added payload format hint to query validation error Net result: -15 lines of unused code, better type safety, clearer errors.
Add in new gemini computer use templates to qa.md flow
Remove excessive JSDoc/docstring comments on private helpers and simple type definitions that don't need explanation. Keep only necessary comments that document non-obvious behavior.
…dling - Changed default viewport dimensions from 1024x768 to 1200x800. - Refactored session info construction in the stop method to avoid errors if the session wasn't started, ensuring safer handling of session data.
- Changed the variable used for capturing screenshots from `result.screenshot` to `result.base64_image` for predefined functions, ensuring compatibility with the updated response structure.
update default viewport size
Update DEFAULT_SCREEN_SIZE from 1024x768 to 1200x800 to match the actual browser viewport created in session.ts/session.py. Mismatched dimensions caused incorrect coordinate denormalization for Gemini's 0-1000 scale.
…_SIZE Import DEFAULT_SCREEN_SIZE from types into session files so viewport dimensions are defined in one place. Users can now update dimensions by editing only the types file.
This comment was marked as resolved.
This comment was marked as resolved.
- Introduced `getSystemPrompt` function in both TypeScript and Python templates to generate the system prompt dynamically, including the current date. - Updated the sampling loop to utilize the new function, improving code readability and maintainability.
- Deleted the test case for the Gemini Computer Use template for Python, as it is no longer available. - Updated the list of unavailable template-language combinations accordingly.
Handled in this commit 08ff1aa |
- Corrected the Google AI API key link in both Python and TypeScript README files. - Updated the Kernel documentation link to point to the Computer Controls section for better clarity.
- Updated the magnitude assignment in the ComputerTool class to use nullish coalescing (??) instead of logical OR (||) for better handling of undefined values.
|
Conducted multiple agent reviews + deslop commands + bugbot fixes. |
- Added descriptions for `open_web_browser` and `search` actions in both Python and TypeScript README files to enhance clarity on browser functionalities.
…uterTool - Eliminated the `screenshot` attribute from the `ToolResult` class and its usage in the `ComputerTool` class, streamlining the data structure and focusing on the `base64_image` representation.
|
Both templates for Gemini CUA worked very well in testing. replays.9.mp4 |
Sayan-
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall lgtm. my comments aren't hard blocking but would prefer to address now.
pkg/templates/typescript/gemini-computer-use/tools/types/gemini.ts
Outdated
Show resolved
Hide resolved
Resolves conflicts by including both Gemini and Yutori computer use templates for Python.
- Added an `error` field to the `QueryOutput` type in both Python and TypeScript templates to capture and return error messages. - Updated the `sampling_loop` function in both languages to handle exceptions and store error messages, improving feedback during execution. - Adjusted return values to include the new `error` field, ensuring consistent output structure across templates.
pkg/templates/typescript/gemini-computer-use/tools/types/gemini.ts
Outdated
Show resolved
Hide resolved
refactor(loop): simplify Gemini client initialization in sampling loop - Removed environment variable dependencies from the Gemini client initialization in the `sampling_loop` function, streamlining the setup process. - The client is now instantiated solely with the API key, enhancing clarity and reducing complexity in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
pkg/templates/typescript/gemini-computer-use/tools/types/gemini.ts
Outdated
Show resolved
Hide resolved
…ns dynamically - Updated the `PREDEFINED_COMPUTER_USE_FUNCTIONS` in both Python and TypeScript templates to derive directly from the `GeminiAction` enum, ensuring consistency and reducing maintenance overhead when adding new actions.
…ion option for python - Added error logging in the TypeScript template to print error messages when present in the result. - Implemented a local execution block in the Python template to run tests directly, including error handling and logging for better debugging feedback.
feat(templates): Implement Gemini Computer Use templates for TypeScript and Python
Summary
This PR introduces full-fledged Gemini Computer Use templates for both TypeScript and Python. The templates implement Google's Gemini 2.5 Computer Use model using Kernel's Computer Controls API for browser automation.
Changes
New Templates
gemini-computer-usetemplate with native Kernel integrationgemini-computer-usetemplate following the same architectureArchitecture
Both templates follow a modular design based on Google's computer-use-preview reference:
index.ts/main.pyloop.ts/loop.pysession.ts/session.pytools/computer.ts/tools/computer.pytools/types/Supported Actions
click_athover_attype_text_atscroll_documentscroll_atnavigatego_back/go_forwardkey_combinationdrag_and_dropwait_5_secondsKey Features
npx tsx index.ts/ direct Python executionCLI Updates
TemplateGeminiComputerUseconstant and template metadata inpkg/create/templates.goUsage
Requirements
GOOGLE_API_KEY- Google AI StudioTesting
Templates have been manually tested with various browser automation tasks.
Related
Note
Adds native Gemini Computer Use support across both languages and aligns docs/tests/CLI metadata.
python/gemini-computer-usewithmain.py,loop.py,session.py, andtools/implementing Kernel Computer Controls and Gemini sampling looptypescript/gemini-computer-use: replaces Stagehand with native Kernel APIs; addsloop.ts,session.ts,tools/; updatesindex.ts, README, and dependenciesGeminiComputerUseas available for Python and TypeScript; sets concrete invoke commands for both languagespy-gemini-cuato matrix, create/deploy steps, invoke commands; updates totals (18 apps/21 tests)Written by Cursor Bugbot for commit 5e859dc. This will update automatically on new commits. Configure here.