Skip to content

jck411/deepgram_voice_agent

Repository files navigation

Deepgram Voice Agent

A real-time voice conversation agent using Deepgram's Voice Agent API with OpenRouter LLMs. Built with FastAPI and modern web technologies for the lowest latency possible.

Features

  • 🎙️ Real-time voice conversation with AI using Deepgram's Voice Agent API
  • 🤖 Multiple AI models from OpenRouter (Claude, GPT-4, Gemini, Llama, and more)
  • 🌐 Modern web interface with live transcript display
  • Ultra-low latency streaming architecture
  • 🎨 Beautiful, responsive UI with real-time status indicators
  • 🔧 Configurable system prompts for custom agent behavior

Architecture

  • Backend: FastAPI with WebSocket support
  • Frontend: Vanilla JavaScript with Web Audio API
  • STT: Deepgram Nova 3
  • LLM: OpenRouter (configurable models)
  • TTS: Deepgram Aura 2
  • Audio Pipeline: Browser microphone → WebSocket → Deepgram Agent → WebSocket → Browser speakers

Prerequisites

Installation

  1. Clone the repository (if you haven't already)

  2. Install dependencies using uv:

    uv sync
  3. Set up environment variables:

    Copy sample.env to .env and fill in your API keys:

    cp sample.env .env

    Edit .env with your actual keys:

    DEEPGRAM_API_KEY=your_deepgram_api_key_here
    OPENROUTER_API_KEY=your_openrouter_api_key_here

Usage

  1. Start the server:

    uv run python -m src.main

    Or activate the virtual environment first:

    source .venv/bin/activate
    python -m src.main
  2. Open your browser:

    Navigate to http://localhost:8000

  3. Select a model from the dropdown menu

  4. Click "Connect & Start" and allow microphone access

  5. Start talking! The agent will transcribe your speech, process it with the selected LLM, and speak back to you

Available Models

The application supports various OpenRouter models including:

  • Anthropic: Claude 3.5 Sonnet, Claude 3 Opus
  • OpenAI: GPT-4o, GPT-4o Mini, GPT-4 Turbo
  • Google: Gemini 2.0 Flash, Gemini Pro 1.5
  • Meta: Llama 3.3 70B, Llama 3.1 405B
  • X.AI: Grok 2
  • Mistral: Mistral Large
  • Cohere: Command R+

Project Structure

deepgram_voice_agent/
├── src/
│   ├── main.py           # FastAPI application with WebSocket endpoints
│   ├── config.py         # Configuration and settings
│   ├── voice_agent.py    # Deepgram Voice Agent wrapper
│   └── static/
│       ├── index.html    # Frontend UI
│       └── app.js        # Real-time audio handling and WebSocket client
├── tests/
│   └── test_main.py      # Tests
├── .env                  # Your API keys (not in git)
├── sample.env            # Example environment file
├── pyproject.toml        # Project dependencies
└── README.md            # This file

How It Works

  1. Browser captures audio from your microphone using the Web Audio API
  2. Audio is streamed as PCM 16-bit data over WebSocket to the FastAPI backend
  3. Backend forwards audio to Deepgram's Voice Agent API
  4. Deepgram processes the audio pipeline:
    • Listen (STT): Transcribes your speech using Nova 3
    • Think (LLM): Processes with your selected OpenRouter model
    • Speak (TTS): Generates speech using Aura 2
  5. Agent's audio is streamed back through WebSocket
  6. Browser plays the audio in real-time
  7. Transcripts are displayed live on the screen

API Endpoints

REST Endpoints

  • GET / - Serve the web interface
  • GET /api/models - Get available OpenRouter models
  • GET /api/health - Health check endpoint

WebSocket

  • WS /ws/agent - Main WebSocket connection for voice agent communication

Configuration

Edit src/config.py to customize:

  • Default models
  • Audio settings (sample rate, encoding)
  • System prompts
  • Available models list

Development

Run in development mode with auto-reload:

uv run uvicorn src.main:app --reload --port 8000

Future Enhancements

  • Function calling / MCP tools integration
  • Session persistence and conversation history
  • Multiple TTS voice options
  • Voice activity detection (VAD) controls
  • Recording and export capabilities
  • Multi-language support

Troubleshooting

No audio from agent

  • Check your browser's speaker/audio output settings
  • Look for errors in the browser console
  • Ensure your Deepgram API key has TTS credits

Microphone not working

  • Grant microphone permissions in your browser
  • Check browser console for getUserMedia errors
  • Try using HTTPS (some browsers require it)

Connection errors

  • Verify your API keys in .env
  • Check that the server is running on the correct port
  • Look at server logs for detailed error messages

License

MIT

Credits

Built with:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

No contributors