Skip to content

gith-karan/SARVIK

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

SARVIK - Complete Project Documentation

Overview

SARVIK (Smart Assistant for Real-time Voice Interaction and Knowledge) is an advanced AI personal assistant developed by Ganpat University students: Karan, Krish, and Vaibhavi. The project implements a sophisticated microservices architecture combining voice processing, natural language understanding, text-to-speech synthesis, persistent context management, and real-world function calling capabilities.

πŸ†• Key Capabilities

  • Voice & Text Interaction: Natural conversation with real-time audio visualization
  • Dual LLM Providers: Switch between local Qwen3-4B (GPU) or cloud Groq API (Llama 3.3 70B)
  • Function Calling: 9+ integrated tools for real-world actions (weather, Gmail, Drive, Calendar)
  • Google Services Integration: Connect Gmail, Drive, and Calendar with OAuth
  • Context-Aware AI: Semantic search across conversation history for relevant responses
  • Voice Biometrics: Secure voice enrollment and verification

β–Ά Demo

Live Web Preview
Interactive web demo to explore UI & features (Frontend Only)

Watch Demo Video
Complete system showcase with backend, function calling & AI responses

πŸ—οΈ Architecture Overview

SARVIK follows a microservices architecture with 4 main components:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         SARVIK ECOSYSTEM                            β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                               β”‚
β”‚  β”‚  MYAI-DESKTOP    β”‚  ← Electron + React Frontend                  β”‚
β”‚  β”‚  (Port 3000)     β”‚     β€’ Voice Input/Output                      β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β€’ Chat Interface                          β”‚
β”‚           β”‚                β€’ Audio Visualization                    β”‚
β”‚           ↓                                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                               β”‚
β”‚  β”‚  MYAI-BACKEND    β”‚  ← FastAPI Backend + AI Services              β”‚
β”‚  β”‚  (Port 8000)     β”‚     β€’ Authentication                          β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β€’ Voice Processing (Whisper, SpeechBrain) β”‚
β”‚           β”‚                β€’ Context Management                     β”‚
β”‚           β”‚                β€’ Conversation Storage                   β”‚
β”‚           β”‚                                                         β”‚
β”‚       β”Œβ”€β”€β”€β”΄β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                 β”‚
β”‚       ↓        ↓                   ↓                                β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                            β”‚
β”‚  β”‚ LLM     β”‚ β”‚  TTS    β”‚    β”‚ DATABASESβ”‚                            β”‚
β”‚  β”‚PROVIDERSβ”‚ β”‚ SERVER  β”‚    β”‚          β”‚                            β”‚
β”‚  β”‚(8001)   β”‚ β”‚(8002)   β”‚    β”‚PostgreSQLβ”‚                            β”‚
β”‚  β”‚         β”‚ β”‚         β”‚    β”‚  Redis   β”‚                            β”‚
β”‚  β”‚Qwen3-4B β”‚ β”‚ Piper   β”‚    β”‚  Qdrant  β”‚                            β”‚
β”‚  β”‚Groq API β”‚ β”‚         β”‚    β”‚          β”‚                            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            β”‚
β”‚                                                                     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ Components Summary

1. myai-desktop (Electron + React)

  • Purpose: User-facing desktop application
  • Port: 3000
  • Key Features:
    • Voice interaction with real-time audio visualization
    • Text-based chat interface
    • Voice enrollment and authentication
    • Conversation history management
    • Device audio management
  • Technology Stack: Electron, React, Styled Components, Three.js
  • Documentation: MYAI_DESKTOP.md

2. myai-backend (FastAPI Backend)

  • Purpose: Core API server and AI orchestration
  • Port: 8000
  • Key Features:
    • User authentication (Google OAuth + JWT)
    • Voice processing (Whisper ASR + SpeechBrain)
    • Context management (embedding-based semantic search)
    • Conversation storage (PostgreSQL + Qdrant)
    • LLM and TTS orchestration
    • Function calling orchestrator (detects and executes tool calls)
    • Google Services OAuth (Gmail, Drive, Calendar integration)
    • 9+ Built-in Tools (weather, email, calendar, file management)
  • Technology Stack: FastAPI, SQLAlchemy, Whisper, SpeechBrain, Sentence Transformers, Google APIs
  • Documentation: MYAI_BACKEND.md

3. llm-server (LLM Microservice)

  • Purpose: Dedicated LLM inference server with dual provider support
  • Port: 8001
  • Key Features:
    • Dual LLM Providers: Server-Hosted (Qwen3-4B GPU) or Cloud (Groq API)
    • Runtime Switching: Change providers without restart
    • Smart Routing: Automatic provider selection based on user preference
    • Streaming token generation (20-350 tokens/sec)
    • Concurrent request handling
    • Separate system prompts for voice/text modes
  • Technology Stack: FastAPI, llama-cpp-python, CUDA, Groq SDK
  • Documentation: 3_LLM_SERVER.md (includes provider integration details)

4. tts-server (TTS Microservice)

  • Purpose: Text-to-speech synthesis server
  • Port: 8002
  • Key Features:
    • Real-time audio synthesis using Piper TTS
    • Multiple voice support (lessac, sarah, alba, amy, david)
    • WebSocket-based audio streaming
    • Parallel sentence synthesis with sequencing
  • Technology Stack: FastAPI, Piper TTS, ONNX Runtime
  • Documentation: TTS_SERVER.md

πŸ”„ Complete System Workflow

Voice Mode Query Flow

1. USER SPEAKS
   ↓
2. DESKTOP: Audio Recording
   - globalAudioManager.js captures audio
   - Sends to backend via /api/voice/process
   ↓
3. BACKEND: Voice Processing
   - Whisper transcribes audio β†’ text
   - SpeechBrain verifies user identity
   - Stores user query in PostgreSQL
   - Generates 768D embedding via Sentence Transformers
   ↓
4. BACKEND: Context Building
   - Retrieves recent conversations (PostgreSQL)
   - Semantic search in Qdrant (768D vectors)
   - Combines context for LLM
   ↓
5. LLM PROVIDER SELECTION:
   - Backend checks user.llm_provider setting
   - Routes to Server-Hosted (Qwen3-4B) or Groq API (cloud llm)
   
6. LLM GENERATION: Response Generation
   - Server-Hosted: POST /generate-stream (Qwen3-4B on GPU)
   - Groq API: client.chat.completions.create() (Cloud LLM)
   - Generates streaming response (20-350 tok/sec)
   - Returns tokens via SSE
   ↓
7. BACKEND: Parallel Processing
   - Buffers tokens into sentences
   - Sends sentences to TTS server
   - Streams tokens to desktop
   ↓
8. TTS SERVER: Audio Synthesis
   - Converts sentences to speech (Piper)
   - Broadcasts WAV audio via WebSocket
   ↓
9. DESKTOP: Playback
   - Receives text tokens (displays in chat)
   - Receives audio chunks (plays sequentially)
   - Updates conversation UI

Text Mode Query Flow

1. USER TYPES
   ↓
2. DESKTOP: Text Submission
   - Sends query to /api/query/process-llm-stream
   ↓
3. BACKEND: Context + LLM Provider Selection
   - Stores query in PostgreSQL
   - Builds context (recent + semantic)
   - Formats prompt with TEXT MODE system prompt
   - Checks user.llm_provider setting
   ↓
4. LLM GENERATION: Text Generation
   - Server-Hosted: Qwen3-4B on GPU (slow)
   - Groq API: Llama 3.3 70B cloud (very fast)
   - Generates detailed response
   - Streams tokens back
   ↓
5. DESKTOP: Rendering
   - Markdown rendering with syntax highlighting
   - Code blocks, lists, formatting
   - Real-time token display

πŸ—„οΈ Database Architecture

PostgreSQL (Relational Data)

Collections/Tables:

  • users - User accounts (Google OAuth)
    • llm_provider - Current provider ('server_hosted' or 'groq_api')
    • groq_api_key_encrypted - Fernet-encrypted API key
    • groq_api_key_expires_at - 30-day auto-expiry
    • groq_model - Selected Groq model
  • voice_profiles - Encrypted voice embeddings
  • conversations - All user conversations (user + assistant messages)
  • settings - User preferences (timezone, voice preference, etc.)
  • service_connections - OAuth connections for external services
    • service_name - Service type ('gmail', 'drive', 'calendar')
    • access_token - Encrypted OAuth access token
    • refresh_token - Encrypted OAuth refresh token
    • token_expires_at - Token expiry timestamp
    • scopes - Granted OAuth permissions
    • service_account_email - Connected Google account for connection

Redis (Session & Cache)

Usage:

  • Voice enrollment sessions (temporary)
  • Session tokens
  • Rate limiting
  • Caching

Qdrant (Vector Database)

Collections (Per-User):

  • conversations_{user_id} - 768D embeddings of conversations for semantic search

Why Separate Collections?

  • Data isolation per user
  • Privacy and security
  • Optimized search performance

πŸ” Security & Authentication

Authentication Flow

1. Desktop β†’ Google OAuth Login
2. Backend validates with Google
3. Issues JWT token (10080 min expiry)
4. Token stored in localStorage
5. All API calls include: Authorization: Bearer {token}
6. Backend verifies JWT on every request

Voice Authentication

1. Enrollment (3 phrases)
   - Extract voice embeddings (SpeechBrain)
   - Create centroid + variance
   - Encrypt and store in PostgreSQL
   
2. Verification
   - User says "Hey SARVIK"
   - Extract embedding
   - Compare with stored profile (cosine similarity)
   - Threshold: 0.70 (configurable)

🌐 API Communication Map

Desktop β†’ Backend

Authentication & User:

  • POST /api/auth/google - Google authentication
  • DELETE /api/auth/account - Delete user account

Voice Processing:

  • POST /api/voice/process - Voice transcription
  • POST /api/voice/enrollment/start - Start voice enrollment
  • POST /api/voice/enrollment/record-phrase - Record enrollment phrase
  • POST /api/voice/enrollment/complete - Complete enrollment
  • POST /api/voice/verify - Verify voice
  • PUT /api/voice-settings/preference - Update voice preference

Query Processing (with Function Calling):

  • POST /api/query/process-llm-stream - Text mode query (supports function calling)
  • POST /api/query/voice-stream-with-audio - Voice mode query (supports function calling)

Conversations:

  • GET /api/conversations/history - Get conversation history
  • DELETE /api/conversations/{id} - Delete conversation

LLM Provider Settings:

  • GET /api/llm-settings - Get current LLM provider settings
  • PUT /api/llm-settings/provider - Switch LLM provider (server_hosted/groq_api)
  • POST /api/llm-settings/groq-key - Save encrypted Groq API key
  • PUT /api/llm-settings/groq-model - Update Groq model selection
  • POST /api/llm-settings/test-connection - Test Groq API connection
  • DELETE /api/llm-settings/groq-key - Delete Groq API key

Service OAuth (Gmail, Drive, Calendar):

  • POST /oauth/connect/{service} - Initiate OAuth flow for service
  • GET /oauth/callback - OAuth callback handler
  • POST /oauth/disconnect/{service} - Disconnect service
  • GET /oauth/status - Get all service connection statuses

Backend β†’ LLM Providers

Server-Hosted (localhost:8001):

  • GET /health - Check model status
  • POST /generate - Non-streaming generation (not currently used)
  • POST /generate-stream - Streaming generation (SSE) ← USED

Groq API (groq.com):

  • Via Groq SDK: client.chat.completions.create(stream=True) ← USED
  • Automatic format conversion: Qwen3 β†’ OpenAI messages

Backend β†’ TTS Server

  • GET /health - Check TTS status
  • GET /voices - List available voices
  • POST /synthesize-sentence - Synthesize sentence with voice and sequence
  • POST /flush - Flush remaining buffer

Desktop β†’ TTS Server (WebSocket)

  • ws://localhost:8002/ws/audio-stream - Audio chunk streaming

πŸ”§ Function Calling System (v2.2.0)

Overview

SARVIK now has real-world action capabilities through an integrated function calling system. The LLM can automatically detect when it needs external data or actions, execute tools, and use the results to provide informed responses.

Available Tools (9+)

Weather Tools:

  • get_weather - Get current weather for any location (with auto IP-based geolocation)

Gmail Tools:

  • gmail_read_emails - Read recent emails
  • gmail_search_emails - Search emails by query
  • gmail_send_email - Send emails

Google Drive Tools:

  • drive_list_files - List files in Drive
  • drive_search_files - Search Drive by name
  • drive_create_folder - Create folders

Google Calendar Tools:

  • calendar_list_events - List upcoming events
  • calendar_search_events - Search calendar events
  • calendar_create_event - Create new events

Function Calling Workflow

1. USER QUERY: "What's the weather in Gandhinagar?"
   ↓
2. BACKEND: Sends query to LLM with tool schemas
   ↓
3. LLM: Detects need for weather data
   β†’ Outputs: <TOOL_CALL>call-123|get_weather|{"city": "cityname"}</TOOL_CALL>
   ↓
4. FUNCTION CALLING ORCHESTRATOR:
   - Detects tool call marker in stream
   - Parses: tool_name="get_weather", args={"city": "cityname"}
   - Executes tool via registry
   ↓
5. WEATHER TOOL:
   - Calls OpenWeatherMap API
   - Returns: {"temperature": 28, "condition": "Clear", ...}
   ↓
6. ORCHESTRATOR:
   - Injects result into context as "[TOOL RESULT]"
   - Sends updated prompt back to LLM
   ↓
7. LLM: Generates natural response
   β†’ "The weather in Gandhinagar is clear with a temperature of 28Β°C."
   ↓
8. USER receives informed response

Key Features

Automatic Detection: LLM decides when tools are needed Parallel Execution: Multiple tool calls in complex queries Error Handling: Graceful fallback if tools fail Context Injection: Tool results seamlessly integrated Security: OAuth-based authentication for Google services IP Geolocation: Auto-detect user location for weather

Google Services Integration

Separate OAuth Flow:

  • Users connect Gmail/Drive/Calendar separately from SARVIK login
  • Can use different Google account than SARVIK login
  • Encrypted token storage with Fernet
  • Automatic token refresh before expiry

Connection Workflow:

1. User opens Settings β†’ Service Connections
2. Clicks "Connect Gmail"
3. Backend generates OAuth URL
4. Opens Google authorization in browser
5. User grants permissions
6. OAuth callback stores encrypted tokens
7. Tools can now access Gmail data

Supported Scopes:

  • Gmail: Read, search, and send emails
  • Drive: List, search, and create files/folders
  • Calendar: Read, search, and create events

πŸ“Š Data Flow: Voice Query Example

Desktop (User speaks "What's the weather?")
   β”‚
   β”œβ”€ Records audio (globalAudioManager)
   β”‚
   └─ POST /api/voice/process (FormData: audio.webm)
        β”‚
        ↓
Backend (myai-backend)
   β”‚
   β”œβ”€ Whisper transcribes β†’ "What's the weather?"
   β”œβ”€ Stores in PostgreSQL (conversations table)
   β”œβ”€ Generates 768D embedding (Sentence Transformers)
   β”œβ”€ Searches Qdrant for semantic matches
   β”œβ”€ Retrieves recent conversations (PostgreSQL)
   β”‚
   └─ POST to LLM Server /generate-stream
        β”‚  Payload: {
        β”‚    "prompt": "<|im_start|>system\n{VOICE_SYSTEM_PROMPT}<|im_end|>...",
        β”‚    "max_tokens": 512,
        β”‚    "temperature": 0.7
        β”‚  }
        ↓
LLM Server (llm-server)
   β”‚
   β”œβ”€ Qwen3-4B processes prompt
   β”œβ”€ Generates tokens: ["I", " don't", " have", " real-time", ...]
   β”‚
   └─ Streams via SSE: data: {"token": "I"}\n\n
        β”‚
        ↓
Backend (Receives tokens, parallel processing)
   β”‚
   β”œβ”€ Streams tokens to Desktop (SSE)
   β”‚    └─ Desktop displays: "I don't have real-time..."
   β”‚
   └─ Buffers into sentences
        β”‚  "I don't have real-time internet access."
        β”‚
        └─ POST to TTS Server /synthesize-sentence
             β”‚  Payload: {
             β”‚    "text": "I don't have real-time internet access.",
             β”‚    "voice": "lessac",
             β”‚    "sequence": 1
             β”‚  }
             ↓
TTS Server (tts-server)
   β”‚
   β”œβ”€ Piper TTS synthesizes sentence
   β”œβ”€ Generates WAV audio bytes
   β”‚
   └─ Broadcasts via WebSocket to Desktop
        β”‚  Message: {"audio": "<base64>", "sequence": 1}
        ↓
Desktop (Receives audio)
   β”‚
   β”œβ”€ Decodes base64 β†’ WAV
   β”œβ”€ Queues for sequential playback
   └─ Plays audio through speakers

🎯 Key Features & Technologies

Voice Processing

  • ASR: OpenAI Whisper (base model)
  • Voice Biometrics: SpeechBrain (ECAPA-TDNN)
  • Audio Quality: WebRTC VAD, noise reduction
  • Encryption: Fernet encryption for voice embeddings

Context Management

  • Embeddings:
    • 768D (all-mpnet-base-v2) for conversations
    • 384D (all-MiniLM-L6-v2) for reminders
    • 512D (SpeechBrain) for voice profiles
  • Storage:
    • PostgreSQL for structured data
    • Qdrant for vector similarity search
  • Context Building:
    • Recent conversations (last 10 messages)
    • Semantic search (top 5 matches)
    • Token counting (max 4000 tokens)

LLM Integration

  • Dual Providers:
    • Server-Hosted: Qwen3-4B (quantized GGUF) on local GPU
    • Groq API: Llama 3.3 70B, GPT OSS 120B/20B, Qwen3 32B (cloud)
  • Runtime Switching: Change providers without restart via user settings
  • Smart Routing: Backend factory pattern selects provider based on user.llm_provider
  • Format Conversion: Automatic Qwen3 β†’ OpenAI format for Groq compatibility
  • Performance:
    • Server-Hosted: 35 tok/sec (GPU-dependent)
    • Groq API: 250 tok/sec (7x faster)
  • Inference: llama-cpp-python (local) + Groq SDK (cloud)
  • Modes:
    • Voice Mode: Brief, conversational responses
    • Text Mode: Detailed, formatted responses
  • Security: Fernet (AES-128) encryption for Groq API keys
  • Concurrency: Semaphore-based request limiting

TTS System

  • Engine: Piper TTS (ONNX-based)
  • Voices: 4 high-quality voices (lessac, ryan, kimberly, amy)
  • Optimization: Parallel sentence synthesis with sequencing
  • Streaming: WebSocket-based real-time audio delivery

πŸ“ Project Structure

sarvik-working/
β”œβ”€β”€ myai-backend/          # Main FastAPI backend
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ api/           # API endpoints
β”‚   β”‚   β”‚   β”œβ”€β”€ auth.py
β”‚   β”‚   β”‚   β”œβ”€β”€ voice.py
β”‚   β”‚   β”‚   β”œβ”€β”€ query.py
β”‚   β”‚   β”‚   └── query_voice_audio.py
β”‚   β”‚   β”œβ”€β”€ core/          # Core configuration
β”‚   β”‚   β”‚   β”œβ”€β”€ config.py
β”‚   β”‚   β”‚   β”œβ”€β”€ database.py
β”‚   β”‚   β”‚   └── security.py
β”‚   β”‚   β”œβ”€β”€ models/        # SQLAlchemy models
β”‚   β”‚   β”œβ”€β”€ services/      # Business logic
β”‚   β”‚   β”‚   β”œβ”€β”€ llm_service.py
β”‚   β”‚   β”‚   β”œβ”€β”€ llm_providers/  # LLM provider architecture
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ base_provider.py
β”‚   β”‚   β”‚   β”‚   β”œβ”€β”€ server_provider.py
β”‚   β”‚   β”‚   β”‚   └── groq_provider.py
β”‚   β”‚   β”‚   β”œβ”€β”€ tts_client.py
β”‚   β”‚   β”‚   β”œβ”€β”€ context_manager.py
β”‚   β”‚   β”‚   β”œβ”€β”€ embedding_service.py
β”‚   β”‚   β”‚   └── voice_service.py
β”‚   β”‚   └── main.py        # FastAPI app
β”‚   └── requirements.txt
β”‚
β”œβ”€β”€ myai-desktop/          # Electron + React app
β”‚   β”œβ”€β”€ public/
β”‚   β”‚   β”œβ”€β”€ electron.js    # Electron main process
β”‚   β”‚   └── preload.js     # Preload script
β”‚   β”œβ”€β”€ src/
β”‚   β”‚   β”œβ”€β”€ components/    # React components
β”‚   β”‚   β”œβ”€β”€ services/      # API & audio services
β”‚   β”‚   β”‚   β”œβ”€β”€ apiService.js
β”‚   β”‚   β”‚   β”œβ”€β”€ globalAudioManager.js
β”‚   β”‚   β”‚   └── deviceManager.js
β”‚   β”‚   β”œβ”€β”€ context/       # React context
β”‚   β”‚   └── App.jsx        # Main app
β”‚   └── package.json
β”‚
β”œβ”€β”€ llm-server/            # LLM microservice
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ main.py        # FastAPI server
β”‚   β”‚   β”œβ”€β”€ llm_service.py # Qwen model wrapper
β”‚   β”‚   └── config.py      # LLM configuration
β”‚   └── requirements.txt
β”‚
β”œβ”€β”€ tts-server/            # TTS microservice
β”‚   β”œβ”€β”€ app/
β”‚   β”‚   β”œβ”€β”€ main.py        # FastAPI + WebSocket
β”‚   β”‚   β”œβ”€β”€ tts_service.py # Piper TTS wrapper
β”‚   β”‚   └── config.py      # TTS configuration
β”‚   └── requirements.txt
β”‚
└── documentation/         # Project documentation
    └── docs/              # NEW comprehensive docs
        β”œβ”€β”€ README.md      # This file
        β”œβ”€β”€ MYAI_BACKEND.md
        β”œβ”€β”€ MYAI_DESKTOP.md
        β”œβ”€β”€ LLM_SERVER.md
        └── TTS_SERVER.md

πŸš€ Startup Sequence

Step 1: Database Services (Manual/Docker)

  • PostgreSQL on default port 5432
  • Redis on default port 6379
  • Qdrant on default port 6333

Step 2: Backend Services (Order matters!)

  • Terminal 1 - LLM Server: Navigate to llm-server folder, run python run.py (Port 8001)
  • Terminal 2 - TTS Server: Navigate to tts-server folder, run docker-compose up -d (Port 8002)
  • Terminal 3 - Main Backend: Navigate to myai-backend folder, run python run.py (Port 8000)

Step 3: Desktop Application

  • Navigate to myai-desktop folder
  • Development mode: Run npm start (port 3000)
  • Full Electron app: Run npm run electron-dev

πŸ”§ Environment Configuration

myai-backend/.env

DATABASE_URL=postgresql://user:pass@localhost:5432/sarvik
REDIS_URL=redis://localhost:6379/0
QDRANT_URL=http://localhost:6333
SECRET_KEY=your-secret-key
GOOGLE_CLIENT_ID=your-google-client-id
GOOGLE_CLIENT_SECRET=your-google-client-secret
LLM_SERVER_URL=http://localhost:8001
ENCRYPTION_KEY=dDkzTmY2YjJRZTF2VU1rZ0hSSnpYMFlhaUN0TGQ3cG8=  # For Groq API key encryption

llm-server/.env

MODEL_PATH=models/qwen3-4b-instruct-q4_0.gguf
GPU_LAYERS=35
MAX_CONCURRENT_REQUESTS=3
CONTEXT_SIZE=8192
BATCH_SIZE=512

tts-server/.env

TTS_MODEL_PATH=models/en_US-lessac-medium.onnx
TTS_PORT=8002
LOG_LEVEL=INFO
MAX_SENTENCE_LENGTH=500

πŸ“ˆ Performance Optimizations

Backend Optimizations

  1. Parallel Operations: Context building + query storage run concurrently
  2. Embedding Reuse: Single embedding generation for storage + search
  3. Async Model Loading: Background model loading with 10-min idle timeout
  4. Connection Pooling: SQLAlchemy connection pool for PostgreSQL

LLM Optimizations

  1. GPU Acceleration: CUDA-enabled llama-cpp-python
  2. Model Quantization: Q4_0 quantization (4-bit) for faster inference
  3. Concurrent Handling: Semaphore limiting to 3 concurrent requests
  4. Streaming: Token-by-token delivery for perceived speed

TTS Optimizations

  1. Parallel Synthesis: All sentences synthesized in parallel
  2. Sequence Numbers: Maintain playback order despite parallel processing
  3. WebSocket Streaming: Real-time audio delivery
  4. Sentence Buffering: Smart segmentation on punctuation

πŸ”„ LLM Provider Integration (v2.1.0)

Overview

SARVIK now supports dual LLM providers with seamless runtime switching between local GPU inference and cloud-based API calls.

Provider Options

Provider Model Speed Context Privacy
Server-Hosted Qwen3-4B (local GPU) 35 tok/sec 256K tokens (depends on hosted resources) 100% Local
Groq API Llama 3.3 70B (cloud) 250 tok/sec 128K tokens Cloud-based

How Provider Switching Works

User Journey:

  1. User opens Settings β†’ LLM Provider
  2. Default: Server-Hosted (uses local GPU)
  3. To switch to Groq:
    • Enter Groq API key (from console.groq.com)
    • Select model (Llama 3.3 70B, GPT OSS 120B, etc.)
    • Click "Save" β†’ Backend encrypts key
    • Click "Groq API" card β†’ Provider switched
  4. Next query automatically uses selected provider (no restart!)

Backend Implementation:

Query arrives β†’ Backend loads user from database
    ↓
Check user.llm_provider column:
    β”œβ”€ "server_hosted" β†’ ServerLLMProvider
    β”‚       ↓
    β”‚  POST localhost:8001/generate-stream
    β”‚       ↓
    β”‚  Local Qwen3-4B (GPU)
    β”‚
    └─ "groq_api" β†’ GroqLLMProvider
            ↓
        Decrypt API key from database
            ↓
        Convert Qwen3 format β†’ OpenAI format
            ↓
        Groq SDK: client.chat.completions.create()
            ↓
        Cloud Llama 3.3 70B

Available Groq Models

  • llama-3.3-70b-versatile (Recommended - 70B params)
  • openai/gpt-oss-120b (Largest - 120B params)
  • openai/gpt-oss-20b (Fast - 20B params)
  • moonshotai/kimi-k2-instruct-0905 (Multilingual)
  • qwen/qwen3-32b (Balance - 32B params)

Security

  • API Key Encryption: Fernet (AES-128-CBC)
  • Auto-Expiry: Keys expire after 30 days
  • Secure Storage: Encrypted in PostgreSQL
  • No Plaintext: Keys only decrypted at request time

Database Schema Updates

-- users table (new columns)
llm_provider              VARCHAR(50)   -- 'server_hosted' or 'groq_api'
groq_api_key_encrypted    TEXT          -- Fernet-encrypted API key
groq_api_key_expires_at   TIMESTAMP     -- 30-day auto-expiry
groq_model                VARCHAR(100)  -- Selected Groq model

Format Conversion

Qwen3 Format (backend sends):

<|im_start|>system
You are SARVIK<|im_end|>
<|im_start|>user
Hello<|im_end|>

OpenAI Format (Groq expects):

[
  {"role": "system", "content": "You are SARVIK"},
  {"role": "user", "content": "Hello"}
]

Conversion happens automatically in GroqLLMProvider._parse_qwen_prompt_to_messages()

Performance Comparison

Metric Server-Hosted Groq API
Speed 35 tok/sec 250 tok/sec (7x faster)
First Token ~200ms ~180ms
Context 256K tokens 128K tokens
Max Output 16,385 tokens (max) 8,192 tokens
Cost Free (GPU) Free (14,400 req/day)
Privacy 100% Local Cloud-based

For detailed implementation: See 3_LLM_SERVER.md - LLM Provider Integration


πŸ“š Additional Documentation

  1. 1_MYAI_BACKEND.md - Backend Server Documentation

    • Architecture layers and components
    • All API endpoints with request/response formats
    • Services detailed explanation
    • Database architecture
    • Security and authentication
    • File locations: myai-backend/app/
  2. 2_MYAI_DESKTOP.md - Desktop Application Documentation

    • Electron + React architecture
    • Component structure and responsibilities
    • Services (API, Audio Manager, Device Manager)
    • State management (Auth and App contexts)
    • Audio system pipeline
    • File locations: myai-desktop/src/
  3. 3_LLM_SERVER.md - LLM Inference Server Documentation

    • Qwen3-4B model configuration
    • GPU offloading strategies
    • API endpoints for generation
    • Prompt engineering (Voice vs Text modes)
    • Streaming implementation
    • Performance optimization
    • File locations: llm-server/app/
  4. 4_TTS_SERVER.md - TTS Synthesis Server Documentation

    • Piper TTS voice models (4 voices)
    • API endpoints for synthesis
    • WebSocket protocol for audio streaming
    • Parallel synthesis architecture
    • Audio format specifications
    • File locations: tts-server/app/

Last Updated: November, 2025

About

A smart personal assistant that listens, remembers context, and interacts with your digital life through natural voice conversation.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors