A real-time, multimodal Live Agent that sees what you see, hears what you hear, and acts before you ask.
Category: Live Agents
Hackathon: Gemini Live Agent Challenge #GeminiLiveAgentChallenge
You can test Arqivon in under 2 minutes — no setup or build required.
- Download the APK: arqivon-release.apk (38 MB)
- Install on any Android device (Android 7+). You may need to allow "Install from unknown sources."
- Sign in with a Gmail account — Arqivon uses Google Sign-In via Firebase Auth. Any standard
@gmail.comaccount will work. - Grant permissions when prompted — microphone and camera access are required for live sessions.
| Step | What to do | What you'll see |
|---|---|---|
| 1. Assistant Mode | Tap Start Video Live Session → point camera at any object and ask "What is this?" | Real-time voice response + Smart Action Cards |
| 2. Translator Mode | Swipe to Translator tab → speak or point camera at text in another language | Live translation subtitle overlay |
| 3. Tutor Mode | Swipe to Tutor tab → point camera at a math problem or textbook page → ask "Solve this" | Step-by-step solution cards with final answer |
| 4. Support Mode | Swipe to Support tab → describe any technical issue verbally | Topic tracking + resolution logging |
| 5. Audio-Only | Tap Start Audio Live Session instead → have a voice-only conversation | Works without camera, pure voice agent |
| 6. PDF Export | In any mode, say "Export this as a PDF" after getting a response | PDF generated → native share sheet opens |
| 7. Mode Switch | Switch between modes mid-session using the top mode selector | Agent persona, voice, and tools change instantly |
- Use a Gmail account (
@gmail.com) to sign in. Firebase Auth is configured for Google Sign-In. - Allow microphone + camera permissions — the agent needs both for multimodal input.
- The backend is always running on Google Cloud Run (min 1 instance, zero cold start). No warm-up needed.
- Backend URL:
wss://arqivon-backend-653546103163.us-central1.run.app/ws - If the first connection takes a moment, it's the Gemini Live API session initializing (~2-3 seconds).
Current AI assistants are trapped behind a text box. You type, wait, read. But real life doesn't pause for you to type — you're holding groceries while reading a foreign menu, staring at a math problem on a whiteboard, or troubleshooting a device with both hands full. The world is multimodal; your AI assistant should be too.
Arqivon transforms your phone into an intelligent Living Lens. Point your camera at anything — a document in another language, a math problem, a broken appliance — and Arqivon simultaneously processes your live video feed and continuous voice through the Gemini Live API. It doesn't just describe what it sees; it takes action through 17 agentic tools, creates exportable PDFs, and remembers context across sessions.
| Feature | Traditional AI | Arqivon |
|---|---|---|
| Input | Text only | Simultaneous voice + camera (2fps JPEG + 16kHz PCM) |
| Interaction | Turn-based | Real-time with barge-in (native VAD) |
| Output | Text response | Voice + UI cards + translations + PDF exports |
| Context | Single session | Persistent memory via Firestore |
| Specialization | One-size-fits-all | 4 mode-specific agent personas with dedicated tools |
| Mode | What It Does | Key Tools |
|---|---|---|
| Assistant ✨ | Proactive multimodal assistant — detects actionable items in your camera feed, creates Smart Action Cards, maintains persistent memory | analyze_live_frame, create_ui_action, upsert_firestore_memory |
| Translator 🌐 | Broadcast-quality real-time translator — handles 100+ languages, document translation via camera, PDF export of translations | live_translate, detect_language, translation_card, export_document |
| Tutor 🎓 | Vision-enabled genius tutor — solves math/science/any subject fully, shows step-by-step solutions, grades work, exports solutions as PDF | analyze_homework, solve_problem, explain_concept, provide_hint, grade_step, tutor_card, export_document |
| Support 🎧 | Elite customer support agent — tracks conversation topics, escalates cases, logs resolutions, exports support notes | switch_topic, escalate_case, log_resolution, support_card, export_document |
| Technology | How We Use It |
|---|---|
Gemini Live API (gemini-2.5-flash-native-audio-latest) |
Real-time bidirectional audio + vision via google-genai SDK aio.live.connect() with native audio output |
Google GenAI SDK (google-genai>=1.64.0) |
send_realtime_input(), send_client_content(), send_tool_response() for multimodal streaming |
| Google Cloud Run | Production container hosting — min 1 instance (zero cold starts), 1hr WebSocket timeout, CPU always-on |
| Cloud Firestore | Sessions, persistent memories, translations, solutions, support topics, exported documents |
| Firebase Auth | Google Sign-In authentication with Firebase Admin SDK verification |
| Secret Manager | GEMINI_API_KEY secret injection into Cloud Run |
| Container Registry | Docker image storage for Cloud Run deployments |
| Cloud Storage | Media caching for camera frames and audio |
┌──────────────────────┐ Bidirectional WebSocket ┌─────────────────────────┐
│ Flutter App │ ◄──────────────────────────► │ FastAPI on Cloud Run │
│ (Riverpod) │ audio PCM + JPEG frames → │ │
│ │ ← audio + UI actions + │ ┌───────────────────┐ │
│ ┌────────────────┐ │ translations + tutor │ │ Mode-Aware System │ │
│ │ Mode Selector │ │ steps + exports │ │ Prompts │ │
│ ├────────────────┤ │ │ ├───────────────────┤ │
│ │ Camera 2fps │ │ │ │ Tool Registry │ │
│ ├────────────────┤ │ │ │ (17 tools) │ │
│ │ Mic 16kHz PCM │ │ │ ├───────────────────┤ │
│ ├────────────────┤ │ │ │ Gemini Live │ │
│ │ Mode-Specific │ │ │ │ Session │ │
│ │ UI Overlays │ │ │ │ (per-user) │ │
│ ├────────────────┤ │ │ ├───────────────────┤ │
│ │ PDF Export │ │ │ │ Firestore │ │
│ └────────────────┘ │ │ └───────────────────┘ │
└────────┬──────────────┘ └───────┬─────────────────┘
│ Firebase Auth / Firestore │
└──────────────────────────────────────────────────────┘
- User speaks + points camera → Flutter captures 16kHz PCM audio + 2fps JPEG frames
- WebSocket → Sends both streams simultaneously to FastAPI backend on Cloud Run
- Backend → Relays to Gemini Live API via
google-genaiSDKaio.live.connect() - Gemini → Processes multimodal input, invokes function calls (17 registered tools)
- Tool Registry → Dispatches tool calls, routes typed results back to client
- Client → Renders mode-specific overlays (translation subtitles, tutor steps, export cards)
- Audio response → Gemini's native audio streams back through WebSocket to client speaker
- Barge-in / Interruption: Native VAD — interrupt the agent mid-sentence and it instantly re-focuses
- Continuous streaming: Audio and video flow simultaneously, not turn-based
- Mode switching: Change agent persona mid-session without disconnecting
- Each mode has a completely different system prompt and personality
- Mode-specific tool declarations — agents only see tools relevant to their role
- Visual identity via mode-colored UI accents (Indigo/Amber/Emerald/Blue)
- WebSocket reconnection with exponential backoff (5 attempts, jitter)
- 12-second heartbeat keeps connections alive
- Graceful Gemini session recovery on API timeouts
- Force-restart audio recorder on Android audio focus changes
Arqivon/
├── backend/ # Python FastAPI backend
│ ├── main.py # WebSocket relay + Gemini Live sessions (759 LOC)
│ ├── tool_registry.py # 17 agentic tools across 4 modes (668 LOC)
│ ├── models.py # Pydantic schemas (AgentMode, messages, sessions)
│ ├── config.py # Environment-based settings
│ ├── requirements.txt # Python dependencies
│ ├── Dockerfile # Multi-stage Cloud Run container
│ ├── service.yaml # Cloud Run Knative deployment config
│ └── .env.example # Required env vars template
├── app/ # Flutter mobile app
│ ├── pubspec.yaml
│ └── lib/
│ ├── main.dart # App entry, auth gate, navigation
│ ├── config/
│ │ ├── constants.dart # Backend URL, audio/video settings
│ │ └── theme.dart # Material 3 light/dark themes
│ ├── models/
│ │ ├── agent_mode.dart # AgentMode enum + TutorStep, TranslationOverlay,
│ │ │ # SupportTopic, ExportDocument models
│ │ ├── ws_message.dart # WebSocket message types
│ │ ├── smart_action.dart # Smart Action Card model
│ │ └── session_model.dart # Archive session model
│ ├── services/
│ │ ├── websocket_service.dart # Production WS with backoff
│ │ ├── audio_service.dart # Mic capture (16kHz PCM) + playback
│ │ ├── auth_service.dart # Firebase Auth + Google Sign-In
│ │ ├── export_service.dart # PDF generation + sharing
│ │ └── firestore_service.dart # Sessions & memories CRUD
│ ├── providers/
│ │ ├── live_session_provider.dart # Mode-aware AsyncNotifier (core)
│ │ ├── settings_provider.dart # Theme, voice, mode, language
│ │ ├── session_provider.dart # Archive session list
│ │ ├── auth_provider.dart # Auth state
│ │ └── firebase_provider.dart # Firebase init
│ ├── widgets/
│ │ ├── tutor_guidance_card.dart # Step-by-step tutor card with solutions
│ │ ├── translation_overlay.dart # Live translation subtitle card
│ │ ├── export_document_card.dart # PDF export card
│ │ ├── smart_action_card.dart # Expandable action cards
│ │ ├── support_topic_tracker.dart # Topic trail tracker
│ │ ├── live_wave.dart # Animated orb/wave visualizer
│ │ ├── audio_visualizer.dart # Waveform bars
│ │ ├── mode_selector.dart # Horizontal mode picker
│ │ ├── connection_indicator.dart # Connection status dot
│ │ ├── glassmorphic_card.dart # Frosted-glass card wrapper
│ │ └── session_tile.dart # Mode-colored archive tiles
│ └── screens/
│ ├── live_screen.dart # Camera + audio + mode overlays (721 LOC)
│ ├── home_screen.dart # Home tab with mode selector
│ ├── login_screen.dart # Google/Apple Sign-In
│ ├── settings_screen.dart # Settings
│ └── archive_screen.dart # Past sessions with mode filter
├── firebase/
│ ├── firestore.rules # Strict uid-based data isolation
│ ├── storage.rules # User-scoped media (10MB max)
│ └── firebase.json # Firebase project config
├── deploy.sh # Automated Cloud Run deployment script
├── architecture.mmd # Mermaid diagram source
├── architecture.png # Rendered architecture diagram
├── demo_storyboard.md # Demo video script
└── README.md # This file
Codebase: ~8,600 lines across 38 source files (34 Dart + 4 Python)
- Flutter 3.x SDK (install)
- Python 3.11+
- A Gemini API key from Google AI Studio
- A Firebase project with Auth + Firestore enabled
- Google Cloud SDK (
gcloud) installed
git clone https://github.com/Medialordofficial/Arqivon.git
cd Arqivoncd backend
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
# Configure environment
cp .env.example .env
# Edit .env and add your GEMINI_API_KEY
# Run locally
uvicorn main:app --reload --port 8080Verify: curl http://localhost:8080/health → {"status": "ok"}
# Option A: Use the automated deploy script
chmod +x deploy.sh
./deploy.sh
# Option B: Manual deployment
cd backend
gcloud run deploy arqivon-backend \
--source . \
--region us-central1 \
--allow-unauthenticated \
--set-secrets=GEMINI_API_KEY=GEMINI_API_KEY:latest \
--max-instances=10 \
--min-instances=1 \
--no-cpu-throttling \
--timeout=3600 \
--concurrency=100 \
--memory=512Mi \
--cpu=1 \
--project=arqivon-inccd app
flutter pub get
# Update WebSocket URL in lib/config/constants.dart
# to point to your Cloud Run service URL
# Run on connected device
flutter run
# Or build APK
flutter build apk --debug# Install FlutterFire CLI
dart pub global activate flutterfire_cli
# Configure Firebase for your project
cd app && flutterfire configureDeploy Firestore rules:
cd firebase
firebase deploy --only firestore:rules,storageOur backend runs on Google Cloud Run at:
https://arqivon-backend-653546103163.us-central1.run.app
Evidence in code:
backend/Dockerfile— Multi-stage Docker build for Cloud Runbackend/service.yaml— Cloud Run Knative service configurationapp/lib/config/constants.dart— WebSocket URL pointing to Cloud Rundeploy.sh— Automated Cloud Run deployment script
Google Cloud services visible in code:
backend/main.py—google.genaiSDK for Gemini Live API, Firebase Admin SDK for auth + Firestorebackend/config.py— Secret Manager integration forGEMINI_API_KEYbackend/tool_registry.py— Firestore writes for memories, translations, solutions, exports
| Category | Tool | Purpose |
|---|---|---|
| Shared (all modes) | analyze_live_frame |
Analyze camera frame via Gemini vision |
upsert_firestore_memory |
Save persistent memory to Firestore | |
create_ui_action |
Render Smart Action Card in Flutter UI | |
| Translator | live_translate |
Real-time translation with subtitle overlay |
detect_language |
Language detection from speech/text | |
translation_card |
Saveable translation flashcard | |
export_document |
Export translations as PDF | |
| Tutor | analyze_homework |
Analyze homework/diagram from camera |
solve_problem |
Complete step-by-step solution with final answer | |
explain_concept |
Rich concept explanation with examples | |
provide_hint |
Contextual hint without full answer | |
grade_step |
Grade student's work step-by-step | |
tutor_card |
Render progress/guidance card | |
export_document |
Export solutions as PDF | |
| Support | switch_topic |
Track mid-conversation topic changes |
escalate_case |
Escalate unresolvable cases | |
log_resolution |
Log resolution outcome + satisfaction | |
support_card |
Render contextual support card | |
export_document |
Export support notes as PDF |
| Integration | License | Usage |
|---|---|---|
| Flutter SDK | BSD-3-Clause | Mobile app framework |
| Riverpod | MIT | State management |
| FastAPI | MIT | Python backend framework |
google-genai |
Apache-2.0 | Google Gemini API SDK |
firebase-admin |
Apache-2.0 | Firebase Admin SDK for Python |
| Firebase SDKs | Apache-2.0 | Auth, Firestore, Storage |
pdf (Dart) |
Apache-2.0 | PDF generation |
share_plus |
BSD-3-Clause | Native share sheet |
record |
MIT | Audio recording |
just_audio |
MIT | Audio playback |
camera |
BSD-3-Clause | Camera access |
glassmorphism |
MIT | Frosted glass UI effects |
All packages used under their respective open-source licenses.
-
Mode-Aware Backend: FastAPI manages per-user WebSocket connections, each spawning a Gemini Live API session with mode-specific system prompts and tool declarations. Three concurrent asyncio coroutines handle client→Gemini, Gemini→client, and heartbeat. Mode/language switching triggers live session reconnection with the correct persona.
-
Tool Registry:
tool_registry.pydeclares 17FunctionDeclarationobjects across 4 categories. When Gemini invokes a tool, the backend dispatcher routes it to the correct handler, converts results to typed outbound messages (TRANSLATION,TUTOR_STEP,SUPPORT_TOPIC,UI_ACTION,EXPORT), and sends tool responses back to Gemini viasend_tool_response(). -
Flutter State:
LiveSessionNotifier(RiverpodAutoDisposeAsyncNotifier) owns the full lifecycle: connect → set mode → stream audio/video → receive mode-specific responses → render overlay widgets → persist session on disconnect. TheTabIndexNotifierensures audio/video stops when navigating away. -
Mode-Specific UI: Each mode gets its own overlay widget:
TranslationOverlayWidget(source→target with formality),TutorGuidanceCard(numbered solution steps, final answer box, concept examples),SupportTopicTracker(topic trail),ExportDocumentCard(PDF preview + share). TheModeSelectorStripenables instant mode switching. -
PDF Export Pipeline: When Gemini calls
export_document, the backend routes the content to the client as anEXPORTmessage. The FlutterExportServicegenerates a formatted PDF using thepdfpackage and opens the native share sheet viashare_plus. -
Production Resilience: WebSocket service implements exponential backoff with jitter (5 attempts). Backend heartbeat (12s interval) keeps Cloud Run connections alive. Audio recorder force-restarts on Android audio focus changes. Fresh
AudioPlayerper AI turn prevents playback bugs.
- Gemini Live API SDK migration: The
session.send()method was deprecated ingoogle-genai>=1.64.0. Migrated tosend_realtime_input()for audio/video,send_client_content()for text, andsend_tool_response()for function call results. - Android audio focus: The
recordplugin silently dies when another app steals audio focus. Fixed with a force-restart mechanism inensureRecording(). - Cloud Run WebSocket timeout: Default 5-minute timeout kills long conversations. Set to 3600s with
--no-cpu-throttlingand--min-instances=1for always-on behavior. - Multimodal rate limiting: Sending camera frames at >3fps triggers Gemini rate limits. Settled on 2fps as the sweet spot for real-time vision without throttling.
MIT
