Skip to content

Latest commit

 

History

History
53 lines (39 loc) · 2.33 KB

File metadata and controls

53 lines (39 loc) · 2.33 KB

SYLVA: Smart Yield-Logic Vision Agent

SYLVA is a vision-first autonomous navigator designed for the Gemini Live Agent Challenge. Unlike traditional automation tools that rely on the fragile DOM, SYLVA perceives the web through raw pixels, using Gemini's visual reasoning to drive interaction with sub-5s latency.

System Architecture

The following diagram illustrates the interaction between the Local Sentinel (Driver), the Reasoning Core (Bridge), and Google's AI models.

Architecture

graph TD
    subgraph "Local Environment (User Machine)"
        Driver["SYLVA Driver (Sentinel)"]
        Browser["Target Browser (Playwright)"]
        Overlay["Integrated UI Overlay"]
    end

    subgraph "Google Cloud (Reasoning Core)"
        Bridge["SYLVA Bridge (FastAPI/Cloud Run)"]
        Gemini["Gemini 2.5 Flash Lite (Vertex AI)"]
    end

    %% Interaction Flow
    Browser -- "Raw Screenshots & Telemetry" --> Driver
    Driver -- "Optimized JPEG & State" --> Bridge
    Bridge -- "Visual Grounding Prompt" --> Gemini
    Gemini -- "Structured Action JSON" --> Bridge
    Bridge -- "SmartActionChip" --> Driver
    Driver -- "Direct Logic Execution" --> Browser
    Driver -- "Real-time Logs" --> Overlay

    %% Styling
    style Gemini fill:#4285F4,stroke:#333,stroke-width:2px,color:#fff
    style Bridge fill:#34A853,stroke:#333,stroke-width:2px,color:#fff
    style Driver fill:#FBBC05,stroke:#333,stroke-width:2px,color:#000
Loading

Technical Highlights

  • Vision-First Constraint: Zero reliance on CSS selectors or element IDs. The system uses a normalized 0-1000 coordinate system derived from raw pixel analysis.
  • Latency Optimization: Achieved a 4.09s P95 round-trip latency by utilizing multi-stage image optimization and the high-speed inference of Gemini 2.5 Flash Lite.
  • Resilient Navigation: Capable of handling modern, dynamic web applications where DOM structures frequently change.

Verification & Performance

  • Environment: Deployed on Google Cloud Run using gcloud and ADC.
  • Browser: Successfully tested on complex web workflows using Playwright.
  • Security: Implements a zero-secret policy using Google Cloud IAM roles.

Note: This document and the associated project components were created for the purposes of entering the Google Gemini Live Agent Challenge hackathon.