A real-time, voice-first conversational AI agent using Google Gemini, LiveKit, and a local RAG module for grounded, low-latency responses.
This repository contains a voice-first conversational AI agent that uses Google's Gemini Live API for real-time speech-to-text, language understanding, and text-to-speech, with LiveKit handling the low-latency audio transport over WebRTC. A local RAG (Retrieval-Augmented Generation) module grounds the agent's responses in a specific knowledge base, ensuring it answers questions based on provided documentation.
For a detailed explanation of the RAG implementation, see RAG_DOCUMENTATION.md.
The system is composed of three main parts that run concurrently:
- React Frontend (
my-voice-app/): A simple web interface that captures microphone audio and streams it to LiveKit. It also plays back the audio stream received from the agent. - Token Server (
token_server.py): A lightweight FastAPI server that issues JWTs (JSON Web Tokens) to the frontend, authorizing it to connect to a specific LiveKit room. - Voice Agent (
agent.py): A Python worker that connects to the same LiveKit room. It receives the audio stream, forwards it to the Gemini Live API, and executes tools (like RAG lookups) when requested by the model.
graph TD
subgraph Browser
A[React UI]
end
subgraph Local Services
B[Token Server @ FastAPI]
C[Voice Agent @ Python]
end
subgraph Cloud Services
D[LiveKit Cloud]
E[Google Gemini Live API]
end
subgraph Data
F[RAG Module @ FAISS]
G[ecommerce.json]
end
A --"1. GET /getToken"--> B
B --"2. Returns JWT"--> A
A --"3. Connect w/ JWT"--> D
C --"4. Connect w/ API Key"--> D
D --"5. Bridge Audio Stream"--> C
C --"6. Stream Audio"--> E
E --"7. Request Tool Call"--> C
C --"8. lookup_company_info()"--> F
F --"9. Search"--> G
F --"10. Return Context"--> C
C --"11. Send Context"--> E
E --"12. Stream Audio Response"--> C
C --"13. Stream Audio"--> D
D --"14. Stream to Browser"--> A
- Python 3.10+
- Node.js 18+ and npm 9+
- A Google AI Studio API key.
- A LiveKit Cloud project.
-
Clone the repository:
git clone https://github.com/Youssef-Ashraf-Dev/Voice-Agent.git cd Voice-Agent
-
Install dependencies:
# Set up Python virtual environment python -m venv .venv .\.venv\Scripts\Activate.ps1 # Install Python packages pip install -r requirements.txt # Install frontend packages cd my-voice-app npm install cd ..
-
Configure Environment Variables: Create a file named
.envin the root of the project directory and add your credentials. This file is ignored by Git.# Get this from Google AI Studio GOOGLE_API_KEY=AI... # Get these from your LiveKit Cloud project settings LIVEKIT_URL=wss://<your-project-name>.livekit.cloud LIVEKIT_API_KEY=API... LIVEKIT_API_SECRET=...
-
Generate RAG Embeddings: The first time you run the agent, it will automatically generate and cache the embeddings for the knowledge base (
data/ecommerce.json). You can also pre-generate them with:python -c "import rag; rag.get_stats()"
If you modify
data/ecommerce.json, you must delete theembeddings_cache/directory or runpython -c "import rag; rag.rebuild_cache()"to force a regeneration.
The system requires three separate terminal sessions to run correctly.
| Terminal | Command | Purpose |
|---|---|---|
| 1 | python token_server.py |
Serves the LiveKit authentication token. |
| 2 | python agent.py dev |
Runs the voice agent worker. |
| 3 | cd my-voice-app; npm run dev |
Starts the frontend development server. |
Once all three processes are running:
- Open your browser to
http://localhost:5173. - Click the "Start Voice Chat" button.
- Allow microphone access when prompted.
- Start speaking. The agent will listen and respond.