IAN: Intelligent Accessibility Navigator

Voice-driven, autonomous web navigation powered by Gemini 2.5 Native Audio & Vision.

Welcome

Welcome to the official repository for IAN! 👋

Navigating the modern web can be an absolute nightmare for visually impaired users. Traditional screen readers break the moment they hit a messy DOM or a cluttered e-commerce site. IAN fixes this by bypassing the code entirely. You speak naturally to our Neo-brutalist React dashboard, and IAN physically "sees" the screen and clicks through the browser for you using headless Chromium.

Built with blood, sweat, and Google Cloud credits for the Google Gemini Live Agent Challenge.

Watch the Live Demo

Click the image above to watch the agent navigate Amazon autonomously!

Dual-Model Architecture

To prevent API rate limits and avoid blocking the WebSocket event loop, IAN splits the brain into two separate processes using the Google Agent Development Kit (ADK):

The Audio Orchestrator: Streams live PCM audio to gemini-2.5-flash-native-audio to detect voice commands and extract the user's intent with zero-shot accuracy.
The Visual Navigator: A background thread running headless Playwright that uses gemini-2.5-flash to analyze browser screenshots and calculate precise (X, Y) coordinates to click and type.

Architecture Diagram

Since the original image link is broken or inaccessible, here's a text-based representation of the architecture using Mermaid for better rendering on GitHub:

graph TD
    A[User Voice Input] --> B[Audio Orchestrator]
    B --> C[Gemini 2.5 Flash Native Audio]
    C --> D[Intent Extraction]
    D --> E[WebSocket Event Loop]
    E --> F[Visual Navigator]
    F --> G[Headless Playwright Chromium]
    G --> H[Browser Screenshot]
    H --> I[Gemini 2.5 Flash Vision Analysis]
    I --> J["Calculate (X, Y) Coordinates"]
    J --> K[Click/Type Actions]
    K --> G
    subgraph "Google Agent Development Kit (ADK)"
        B
        F
    end
    L[Google Cloud Run] -.-> F

This diagram illustrates the flow from user input through audio processing, intent extraction, and visual navigation in a looped browser interaction. If you have the original diagram details, it can be refined further.

Currently Working On

Optimization Focus

Details

🛠️ Optimizing the Agent
Right now, the focus is strictly on stability and hackathon delivery!

✅ Finalizing the Dual-Model Architecture to prevent API rate limits.
✅ Perfecting the AG-UI WebSocket Protocol for real-time Voice Activity Detection.
🔄 Scaling the Playwright visual agent loop on Google Cloud Run.
🔜 Adding support for multi-tab contextual memory.

👉 Vote for us on Devpost!

Tech Stack & Skills

Built with a thread-safe, non-blocking Python/React stack.

🛠️ Tech Stack

AI Models: Gemini 2.5 Flash Native Audio, Gemini 2.5 Flash (Vision)
Frameworks: Google GenAI SDK, Google Agent Development Kit (ADK)
Backend: Python, FastAPI, WebSockets (AG-UI Protocol)
Browser Automation: Playwright (Headless Chromium with Stealth Mode)
Frontend: React, Next.js, Web Audio API
Cloud Infrastructure: Google Cloud Run, Secret Manager

🛠️ Tech Stack & Skills

The Brains

AI Models: Gemini 2.5 Flash Native Audio, Gemini 2.5 Flash (Vision)
Frameworks: Google GenAI SDK, Google Agent Development Kit (ADK)

The Brawn (Backend)

Language & API: Python, FastAPI, WebSockets (AG-UI Protocol)
Browser Automation: Playwright (Headless Chromium with Stealth Mode)
Cloud Infrastructure: Google Cloud Run, Secret Manager

The Beauty (Frontend)

UI: React, Next.js (Neo-brutalist dashboard style)
Audio: Web Audio API (PCM 16kHz)
Communication: WebSocket for real-time bidirectional communication

⚙️ How to Run Locally

Prerequisites

Python 3.11+
Node.js 18+
2 separate Google Gemini API Keys (to separate audio/vision quotas)

Backend Setup

Navigate to the backend directory:
```
cd backend
```
Install dependencies:
```
pip install -r requirements.txt
```
Install Playwright browsers:
```
playwright install chromium
```

Create a .env file and add your keys:

GEMINI_API_KEY=your_audio_agent_key
GEMINI_API_KEY_BROWSER=your_vision_agent_key

Start the server:
```
uvicorn main:app --reload
```

Frontend Setup

Navigate to the frontend directory:
```
cd frontend
```
Install dependencies:
```
npm install
```
Start the development server:
```
npm run dev
```

Open http://localhost:3000 and hold the green microphone button to speak!

⚙️ How to Run Locally

Prerequisites

Python 3.11+
Node.js 18+
2 separate Google Gemini API Keys (to separate audio/vision quotas)

Backend Setup

Navigate to the backend directory: cd backend
Install dependencies: pip install -r requirements.txt
Install Playwright browsers: playwright install chromium

Create a .env file and add your keys:

GEMINI_API_KEY=your_audio_agent_key
GEMINI_API_KEY_BROWSER=your_vision_agent_key

Start the server: uvicorn main:app --reload

Frontend Setup Navigate to the frontend directory: cd frontend

Install dependencies: npm install

Start the development server: npm run dev

Open http://localhost:3000 and hold the green button to speak!

Built with ❤️ for the Gemini Live Agent Challenge 2026.

To push the README:

git add README.md
git commit -m "Add official project README and architecture diagram"
git push

Establish Connection

Built by Zaynul Abedin Miah – Tech Community Leader & AI Developer.

Let's collaborate, talk about AGI, or build something awesome together!

"Stop parsing the DOM. Just look at the screen."

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github/workflows		.github/workflows
backend		backend
frontend		frontend
.gitignore		.gitignore
README.md		README.md
architecture_diagram.png		architecture_diagram.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

IAN: Intelligent Accessibility Navigator

Welcome

Watch the Live Demo

Dual-Model Architecture

Architecture Diagram

Currently Working On

Tech Stack & Skills

🛠️ Tech Stack

🛠️ Tech Stack & Skills

The Brains

The Brawn (Backend)

The Beauty (Frontend)

⚙️ How to Run Locally

Prerequisites

Backend Setup

Frontend Setup

⚙️ How to Run Locally

Prerequisites

Backend Setup

Establish Connection

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

IAN: Intelligent Accessibility Navigator

Welcome

Watch the Live Demo

Dual-Model Architecture

Architecture Diagram

Currently Working On

Tech Stack & Skills

🛠️ Tech Stack

🛠️ Tech Stack & Skills

The Brains

The Brawn (Backend)

The Beauty (Frontend)

⚙️ How to Run Locally

Prerequisites

Backend Setup

Frontend Setup

⚙️ How to Run Locally

Prerequisites

Backend Setup

Establish Connection

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages