Skip to content

azaynul10/ian-accessibility-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

IAN: Intelligent Accessibility Navigator

Retro Tech Robot GIF

Voice-driven, autonomous web navigation powered by Gemini 2.5 Native Audio & Vision.

Welcome

Welcome to the official repository for IAN! ๐Ÿ‘‹

Navigating the modern web can be an absolute nightmare for visually impaired users. Traditional screen readers break the moment they hit a messy DOM or a cluttered e-commerce site. IAN fixes this by bypassing the code entirely. You speak naturally to our Neo-brutalist React dashboard, and IAN physically "sees" the screen and clicks through the browser for you using headless Chromium.

Built with blood, sweat, and Google Cloud credits for the Google Gemini Live Agent Challenge.

Watch the Live Demo

Watch the IAN Demo

Click the image above to watch the agent navigate Amazon autonomously!

Dual-Model Architecture

To prevent API rate limits and avoid blocking the WebSocket event loop, IAN splits the brain into two separate processes using the Google Agent Development Kit (ADK):

  • The Audio Orchestrator: Streams live PCM audio to gemini-2.5-flash-native-audio to detect voice commands and extract the user's intent with zero-shot accuracy.
  • The Visual Navigator: A background thread running headless Playwright that uses gemini-2.5-flash to analyze browser screenshots and calculate precise (X, Y) coordinates to click and type.

Architecture Diagram

Since the original image link is broken or inaccessible, here's a text-based representation of the architecture using Mermaid for better rendering on GitHub:

graph TD
    A[User Voice Input] --> B[Audio Orchestrator]
    B --> C[Gemini 2.5 Flash Native Audio]
    C --> D[Intent Extraction]
    D --> E[WebSocket Event Loop]
    E --> F[Visual Navigator]
    F --> G[Headless Playwright Chromium]
    G --> H[Browser Screenshot]
    H --> I[Gemini 2.5 Flash Vision Analysis]
    I --> J["Calculate (X, Y) Coordinates"]
    J --> K[Click/Type Actions]
    K --> G
    subgraph "Google Agent Development Kit (ADK)"
        B
        F
    end
    L[Google Cloud Run] -.-> F
Loading

This diagram illustrates the flow from user input through audio processing, intent extraction, and visual navigation in a looped browser interaction. If you have the original diagram details, it can be refined further. Architecture Diagram

Currently Working On

Optimization Focus Details
๐Ÿ› ๏ธ Optimizing the Agent
Right now, the focus is strictly on stability and hackathon delivery!
  • โœ… Finalizing the Dual-Model Architecture to prevent API rate limits.
  • โœ… Perfecting the AG-UI WebSocket Protocol for real-time Voice Activity Detection.
  • ๐Ÿ”„ Scaling the Playwright visual agent loop on Google Cloud Run.
  • ๐Ÿ”œ Adding support for multi-tab contextual memory.

๐Ÿ‘‰ Vote for us on Devpost!
Hacking GIF

Tech Stack & Skills

Built with a thread-safe, non-blocking Python/React stack.

๐Ÿ› ๏ธ Tech Stack

  • AI Models: Gemini 2.5 Flash Native Audio, Gemini 2.5 Flash (Vision)
  • Frameworks: Google GenAI SDK, Google Agent Development Kit (ADK)
  • Backend: Python, FastAPI, WebSockets (AG-UI Protocol)
  • Browser Automation: Playwright (Headless Chromium with Stealth Mode)
  • Frontend: React, Next.js, Web Audio API
  • Cloud Infrastructure: Google Cloud Run, Secret Manager

๐Ÿ› ๏ธ Tech Stack & Skills

The Brains

  • AI Models: Gemini 2.5 Flash Native Audio, Gemini 2.5 Flash (Vision)
  • Frameworks: Google GenAI SDK, Google Agent Development Kit (ADK)

The Brawn (Backend)

  • Language & API: Python, FastAPI, WebSockets (AG-UI Protocol)
  • Browser Automation: Playwright (Headless Chromium with Stealth Mode)
  • Cloud Infrastructure: Google Cloud Run, Secret Manager

The Beauty (Frontend)

  • UI: React, Next.js (Neo-brutalist dashboard style)
  • Audio: Web Audio API (PCM 16kHz)
  • Communication: WebSocket for real-time bidirectional communication

โš™๏ธ How to Run Locally

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • 2 separate Google Gemini API Keys (to separate audio/vision quotas)

Backend Setup

  1. Navigate to the backend directory:
    cd backend
  2. Install dependencies:
    pip install -r requirements.txt
  3. Install Playwright browsers:
    playwright install chromium
  4. Create a .env file and add your keys:
    GEMINI_API_KEY=your_audio_agent_key
    GEMINI_API_KEY_BROWSER=your_vision_agent_key
  5. Start the server:
    uvicorn main:app --reload

Frontend Setup

  1. Navigate to the frontend directory:
    cd frontend
  2. Install dependencies:
    npm install
  3. Start the development server:
    npm run dev

Open http://localhost:3000 and hold the green microphone button to speak!

โš™๏ธ How to Run Locally

Prerequisites

  • Python 3.11+
  • Node.js 18+
  • 2 separate Google Gemini API Keys (to separate audio/vision quotas)

Backend Setup

  1. Navigate to the backend directory: cd backend
  2. Install dependencies: pip install -r requirements.txt
  3. Install Playwright browsers: playwright install chromium
  4. Create a .env file and add your keys:
    GEMINI_API_KEY=your_audio_agent_key
    GEMINI_API_KEY_BROWSER=your_vision_agent_key

Start the server: uvicorn main:app --reload

Frontend Setup Navigate to the frontend directory: cd frontend

Install dependencies: npm install

Start the development server: npm run dev

Open http://localhost:3000 and hold the green button to speak!

Built with โค๏ธ for the Gemini Live Agent Challenge 2026.

To push the README:

git add README.md
git commit -m "Add official project README and architecture diagram"
git push

Establish Connection

Built by Zaynul Abedin Miah โ€“ Tech Community Leader & AI Developer.

Let's collaborate, talk about AGI, or build something awesome together!

Twitter LinkedIn Facebook

"Stop parsing the DOM. Just look at the screen."

About

A multimodal AI agent powered by Gemini 2.5 that uses voice and vision to autonomously navigate the web for visually impaired users.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors