SARVIK (Smart Assistant for Real-time Voice Interaction and Knowledge) is an advanced AI personal assistant developed by Ganpat University students: Karan, Krish, and Vaibhavi. The project implements a sophisticated microservices architecture combining voice processing, natural language understanding, text-to-speech synthesis, persistent context management, and real-world function calling capabilities.
- Voice & Text Interaction: Natural conversation with real-time audio visualization
- Dual LLM Providers: Switch between local Qwen3-4B (GPU) or cloud Groq API (Llama 3.3 70B)
- Function Calling: 9+ integrated tools for real-world actions (weather, Gmail, Drive, Calendar)
- Google Services Integration: Connect Gmail, Drive, and Calendar with OAuth
- Context-Aware AI: Semantic search across conversation history for relevant responses
- Voice Biometrics: Secure voice enrollment and verification
Interactive web demo to explore UI & features (Frontend Only)
Complete system showcase with backend, function calling & AI responses
SARVIK follows a microservices architecture with 4 main components:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SARVIK ECOSYSTEM β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββββββ β
β β MYAI-DESKTOP β β Electron + React Frontend β
β β (Port 3000) β β’ Voice Input/Output β
β ββββββββββ¬ββββββββββ β’ Chat Interface β
β β β’ Audio Visualization β
β β β
β ββββββββββββββββββββ β
β β MYAI-BACKEND β β FastAPI Backend + AI Services β
β β (Port 8000) β β’ Authentication β
β ββββββββββ¬ββββββββββ β’ Voice Processing (Whisper, SpeechBrain) β
β β β’ Context Management β
β β β’ Conversation Storage β
β β β
β βββββ΄βββββ¬βββββββββββββββββββ β
β β β β β
β βββββββββββ βββββββββββ ββββββββββββ β
β β LLM β β TTS β β DATABASESβ β
β βPROVIDERSβ β SERVER β β β β
β β(8001) β β(8002) β βPostgreSQLβ β
β β β β β β Redis β β
β βQwen3-4B β β Piper β β Qdrant β β
β βGroq API β β β β β β
β βββββββββββ βββββββββββ ββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ- Purpose: User-facing desktop application
- Port: 3000
- Key Features:
- Voice interaction with real-time audio visualization
- Text-based chat interface
- Voice enrollment and authentication
- Conversation history management
- Device audio management
- Technology Stack: Electron, React, Styled Components, Three.js
- Documentation: MYAI_DESKTOP.md
- Purpose: Core API server and AI orchestration
- Port: 8000
- Key Features:
- User authentication (Google OAuth + JWT)
- Voice processing (Whisper ASR + SpeechBrain)
- Context management (embedding-based semantic search)
- Conversation storage (PostgreSQL + Qdrant)
- LLM and TTS orchestration
- Function calling orchestrator (detects and executes tool calls)
- Google Services OAuth (Gmail, Drive, Calendar integration)
- 9+ Built-in Tools (weather, email, calendar, file management)
- Technology Stack: FastAPI, SQLAlchemy, Whisper, SpeechBrain, Sentence Transformers, Google APIs
- Documentation: MYAI_BACKEND.md
- Purpose: Dedicated LLM inference server with dual provider support
- Port: 8001
- Key Features:
- Dual LLM Providers: Server-Hosted (Qwen3-4B GPU) or Cloud (Groq API)
- Runtime Switching: Change providers without restart
- Smart Routing: Automatic provider selection based on user preference
- Streaming token generation (20-350 tokens/sec)
- Concurrent request handling
- Separate system prompts for voice/text modes
- Technology Stack: FastAPI, llama-cpp-python, CUDA, Groq SDK
- Documentation: 3_LLM_SERVER.md (includes provider integration details)
- Purpose: Text-to-speech synthesis server
- Port: 8002
- Key Features:
- Real-time audio synthesis using Piper TTS
- Multiple voice support (lessac, sarah, alba, amy, david)
- WebSocket-based audio streaming
- Parallel sentence synthesis with sequencing
- Technology Stack: FastAPI, Piper TTS, ONNX Runtime
- Documentation: TTS_SERVER.md
1. USER SPEAKS
β
2. DESKTOP: Audio Recording
- globalAudioManager.js captures audio
- Sends to backend via /api/voice/process
β
3. BACKEND: Voice Processing
- Whisper transcribes audio β text
- SpeechBrain verifies user identity
- Stores user query in PostgreSQL
- Generates 768D embedding via Sentence Transformers
β
4. BACKEND: Context Building
- Retrieves recent conversations (PostgreSQL)
- Semantic search in Qdrant (768D vectors)
- Combines context for LLM
β
5. LLM PROVIDER SELECTION:
- Backend checks user.llm_provider setting
- Routes to Server-Hosted (Qwen3-4B) or Groq API (cloud llm)
6. LLM GENERATION: Response Generation
- Server-Hosted: POST /generate-stream (Qwen3-4B on GPU)
- Groq API: client.chat.completions.create() (Cloud LLM)
- Generates streaming response (20-350 tok/sec)
- Returns tokens via SSE
β
7. BACKEND: Parallel Processing
- Buffers tokens into sentences
- Sends sentences to TTS server
- Streams tokens to desktop
β
8. TTS SERVER: Audio Synthesis
- Converts sentences to speech (Piper)
- Broadcasts WAV audio via WebSocket
β
9. DESKTOP: Playback
- Receives text tokens (displays in chat)
- Receives audio chunks (plays sequentially)
- Updates conversation UI
1. USER TYPES
β
2. DESKTOP: Text Submission
- Sends query to /api/query/process-llm-stream
β
3. BACKEND: Context + LLM Provider Selection
- Stores query in PostgreSQL
- Builds context (recent + semantic)
- Formats prompt with TEXT MODE system prompt
- Checks user.llm_provider setting
β
4. LLM GENERATION: Text Generation
- Server-Hosted: Qwen3-4B on GPU (slow)
- Groq API: Llama 3.3 70B cloud (very fast)
- Generates detailed response
- Streams tokens back
β
5. DESKTOP: Rendering
- Markdown rendering with syntax highlighting
- Code blocks, lists, formatting
- Real-time token display
Collections/Tables:
users- User accounts (Google OAuth)llm_provider- Current provider ('server_hosted' or 'groq_api')groq_api_key_encrypted- Fernet-encrypted API keygroq_api_key_expires_at- 30-day auto-expirygroq_model- Selected Groq model
voice_profiles- Encrypted voice embeddingsconversations- All user conversations (user + assistant messages)settings- User preferences (timezone, voice preference, etc.)service_connections- OAuth connections for external servicesservice_name- Service type ('gmail', 'drive', 'calendar')access_token- Encrypted OAuth access tokenrefresh_token- Encrypted OAuth refresh tokentoken_expires_at- Token expiry timestampscopes- Granted OAuth permissionsservice_account_email- Connected Google account for connection
Usage:
- Voice enrollment sessions (temporary)
- Session tokens
- Rate limiting
- Caching
Collections (Per-User):
conversations_{user_id}- 768D embeddings of conversations for semantic search
Why Separate Collections?
- Data isolation per user
- Privacy and security
- Optimized search performance
1. Desktop β Google OAuth Login
2. Backend validates with Google
3. Issues JWT token (10080 min expiry)
4. Token stored in localStorage
5. All API calls include: Authorization: Bearer {token}
6. Backend verifies JWT on every request
1. Enrollment (3 phrases)
- Extract voice embeddings (SpeechBrain)
- Create centroid + variance
- Encrypt and store in PostgreSQL
2. Verification
- User says "Hey SARVIK"
- Extract embedding
- Compare with stored profile (cosine similarity)
- Threshold: 0.70 (configurable)
Authentication & User:
POST /api/auth/google- Google authenticationDELETE /api/auth/account- Delete user account
Voice Processing:
POST /api/voice/process- Voice transcriptionPOST /api/voice/enrollment/start- Start voice enrollmentPOST /api/voice/enrollment/record-phrase- Record enrollment phrasePOST /api/voice/enrollment/complete- Complete enrollmentPOST /api/voice/verify- Verify voicePUT /api/voice-settings/preference- Update voice preference
Query Processing (with Function Calling):
POST /api/query/process-llm-stream- Text mode query (supports function calling)POST /api/query/voice-stream-with-audio- Voice mode query (supports function calling)
Conversations:
GET /api/conversations/history- Get conversation historyDELETE /api/conversations/{id}- Delete conversation
LLM Provider Settings:
GET /api/llm-settings- Get current LLM provider settingsPUT /api/llm-settings/provider- Switch LLM provider (server_hosted/groq_api)POST /api/llm-settings/groq-key- Save encrypted Groq API keyPUT /api/llm-settings/groq-model- Update Groq model selectionPOST /api/llm-settings/test-connection- Test Groq API connectionDELETE /api/llm-settings/groq-key- Delete Groq API key
Service OAuth (Gmail, Drive, Calendar):
POST /oauth/connect/{service}- Initiate OAuth flow for serviceGET /oauth/callback- OAuth callback handlerPOST /oauth/disconnect/{service}- Disconnect serviceGET /oauth/status- Get all service connection statuses
Server-Hosted (localhost:8001):
GET /health- Check model statusPOST /generate- Non-streaming generation (not currently used)POST /generate-stream- Streaming generation (SSE) β USED
Groq API (groq.com):
- Via Groq SDK:
client.chat.completions.create(stream=True)β USED - Automatic format conversion: Qwen3 β OpenAI messages
GET /health- Check TTS statusGET /voices- List available voicesPOST /synthesize-sentence- Synthesize sentence with voice and sequencePOST /flush- Flush remaining buffer
ws://localhost:8002/ws/audio-stream- Audio chunk streaming
SARVIK now has real-world action capabilities through an integrated function calling system. The LLM can automatically detect when it needs external data or actions, execute tools, and use the results to provide informed responses.
Weather Tools:
get_weather- Get current weather for any location (with auto IP-based geolocation)
Gmail Tools:
gmail_read_emails- Read recent emailsgmail_search_emails- Search emails by querygmail_send_email- Send emails
Google Drive Tools:
drive_list_files- List files in Drivedrive_search_files- Search Drive by namedrive_create_folder- Create folders
Google Calendar Tools:
calendar_list_events- List upcoming eventscalendar_search_events- Search calendar eventscalendar_create_event- Create new events
1. USER QUERY: "What's the weather in Gandhinagar?"
β
2. BACKEND: Sends query to LLM with tool schemas
β
3. LLM: Detects need for weather data
β Outputs: <TOOL_CALL>call-123|get_weather|{"city": "cityname"}</TOOL_CALL>
β
4. FUNCTION CALLING ORCHESTRATOR:
- Detects tool call marker in stream
- Parses: tool_name="get_weather", args={"city": "cityname"}
- Executes tool via registry
β
5. WEATHER TOOL:
- Calls OpenWeatherMap API
- Returns: {"temperature": 28, "condition": "Clear", ...}
β
6. ORCHESTRATOR:
- Injects result into context as "[TOOL RESULT]"
- Sends updated prompt back to LLM
β
7. LLM: Generates natural response
β "The weather in Gandhinagar is clear with a temperature of 28Β°C."
β
8. USER receives informed response
Automatic Detection: LLM decides when tools are needed Parallel Execution: Multiple tool calls in complex queries Error Handling: Graceful fallback if tools fail Context Injection: Tool results seamlessly integrated Security: OAuth-based authentication for Google services IP Geolocation: Auto-detect user location for weather
Separate OAuth Flow:
- Users connect Gmail/Drive/Calendar separately from SARVIK login
- Can use different Google account than SARVIK login
- Encrypted token storage with Fernet
- Automatic token refresh before expiry
Connection Workflow:
1. User opens Settings β Service Connections
2. Clicks "Connect Gmail"
3. Backend generates OAuth URL
4. Opens Google authorization in browser
5. User grants permissions
6. OAuth callback stores encrypted tokens
7. Tools can now access Gmail data
Supported Scopes:
- Gmail: Read, search, and send emails
- Drive: List, search, and create files/folders
- Calendar: Read, search, and create events
Desktop (User speaks "What's the weather?")
β
ββ Records audio (globalAudioManager)
β
ββ POST /api/voice/process (FormData: audio.webm)
β
β
Backend (myai-backend)
β
ββ Whisper transcribes β "What's the weather?"
ββ Stores in PostgreSQL (conversations table)
ββ Generates 768D embedding (Sentence Transformers)
ββ Searches Qdrant for semantic matches
ββ Retrieves recent conversations (PostgreSQL)
β
ββ POST to LLM Server /generate-stream
β Payload: {
β "prompt": "<|im_start|>system\n{VOICE_SYSTEM_PROMPT}<|im_end|>...",
β "max_tokens": 512,
β "temperature": 0.7
β }
β
LLM Server (llm-server)
β
ββ Qwen3-4B processes prompt
ββ Generates tokens: ["I", " don't", " have", " real-time", ...]
β
ββ Streams via SSE: data: {"token": "I"}\n\n
β
β
Backend (Receives tokens, parallel processing)
β
ββ Streams tokens to Desktop (SSE)
β ββ Desktop displays: "I don't have real-time..."
β
ββ Buffers into sentences
β "I don't have real-time internet access."
β
ββ POST to TTS Server /synthesize-sentence
β Payload: {
β "text": "I don't have real-time internet access.",
β "voice": "lessac",
β "sequence": 1
β }
β
TTS Server (tts-server)
β
ββ Piper TTS synthesizes sentence
ββ Generates WAV audio bytes
β
ββ Broadcasts via WebSocket to Desktop
β Message: {"audio": "<base64>", "sequence": 1}
β
Desktop (Receives audio)
β
ββ Decodes base64 β WAV
ββ Queues for sequential playback
ββ Plays audio through speakers
- ASR: OpenAI Whisper (base model)
- Voice Biometrics: SpeechBrain (ECAPA-TDNN)
- Audio Quality: WebRTC VAD, noise reduction
- Encryption: Fernet encryption for voice embeddings
- Embeddings:
- 768D (all-mpnet-base-v2) for conversations
- 384D (all-MiniLM-L6-v2) for reminders
- 512D (SpeechBrain) for voice profiles
- Storage:
- PostgreSQL for structured data
- Qdrant for vector similarity search
- Context Building:
- Recent conversations (last 10 messages)
- Semantic search (top 5 matches)
- Token counting (max 4000 tokens)
- Dual Providers:
- Server-Hosted: Qwen3-4B (quantized GGUF) on local GPU
- Groq API: Llama 3.3 70B, GPT OSS 120B/20B, Qwen3 32B (cloud)
- Runtime Switching: Change providers without restart via user settings
- Smart Routing: Backend factory pattern selects provider based on user.llm_provider
- Format Conversion: Automatic Qwen3 β OpenAI format for Groq compatibility
- Performance:
- Server-Hosted: 35 tok/sec (GPU-dependent)
- Groq API: 250 tok/sec (7x faster)
- Inference: llama-cpp-python (local) + Groq SDK (cloud)
- Modes:
- Voice Mode: Brief, conversational responses
- Text Mode: Detailed, formatted responses
- Security: Fernet (AES-128) encryption for Groq API keys
- Concurrency: Semaphore-based request limiting
- Engine: Piper TTS (ONNX-based)
- Voices: 4 high-quality voices (lessac, ryan, kimberly, amy)
- Optimization: Parallel sentence synthesis with sequencing
- Streaming: WebSocket-based real-time audio delivery
sarvik-working/
βββ myai-backend/ # Main FastAPI backend
β βββ app/
β β βββ api/ # API endpoints
β β β βββ auth.py
β β β βββ voice.py
β β β βββ query.py
β β β βββ query_voice_audio.py
β β βββ core/ # Core configuration
β β β βββ config.py
β β β βββ database.py
β β β βββ security.py
β β βββ models/ # SQLAlchemy models
β β βββ services/ # Business logic
β β β βββ llm_service.py
β β β βββ llm_providers/ # LLM provider architecture
β β β β βββ base_provider.py
β β β β βββ server_provider.py
β β β β βββ groq_provider.py
β β β βββ tts_client.py
β β β βββ context_manager.py
β β β βββ embedding_service.py
β β β βββ voice_service.py
β β βββ main.py # FastAPI app
β βββ requirements.txt
β
βββ myai-desktop/ # Electron + React app
β βββ public/
β β βββ electron.js # Electron main process
β β βββ preload.js # Preload script
β βββ src/
β β βββ components/ # React components
β β βββ services/ # API & audio services
β β β βββ apiService.js
β β β βββ globalAudioManager.js
β β β βββ deviceManager.js
β β βββ context/ # React context
β β βββ App.jsx # Main app
β βββ package.json
β
βββ llm-server/ # LLM microservice
β βββ app/
β β βββ main.py # FastAPI server
β β βββ llm_service.py # Qwen model wrapper
β β βββ config.py # LLM configuration
β βββ requirements.txt
β
βββ tts-server/ # TTS microservice
β βββ app/
β β βββ main.py # FastAPI + WebSocket
β β βββ tts_service.py # Piper TTS wrapper
β β βββ config.py # TTS configuration
β βββ requirements.txt
β
βββ documentation/ # Project documentation
βββ docs/ # NEW comprehensive docs
βββ README.md # This file
βββ MYAI_BACKEND.md
βββ MYAI_DESKTOP.md
βββ LLM_SERVER.md
βββ TTS_SERVER.md
Step 1: Database Services (Manual/Docker)
- PostgreSQL on default port 5432
- Redis on default port 6379
- Qdrant on default port 6333
Step 2: Backend Services (Order matters!)
- Terminal 1 - LLM Server: Navigate to llm-server folder, run
python run.py(Port 8001) - Terminal 2 - TTS Server: Navigate to tts-server folder, run
docker-compose up -d(Port 8002) - Terminal 3 - Main Backend: Navigate to myai-backend folder, run
python run.py(Port 8000)
Step 3: Desktop Application
- Navigate to myai-desktop folder
- Development mode: Run
npm start(port 3000) - Full Electron app: Run
npm run electron-dev
DATABASE_URL=postgresql://user:pass@localhost:5432/sarvik
REDIS_URL=redis://localhost:6379/0
QDRANT_URL=http://localhost:6333
SECRET_KEY=your-secret-key
GOOGLE_CLIENT_ID=your-google-client-id
GOOGLE_CLIENT_SECRET=your-google-client-secret
LLM_SERVER_URL=http://localhost:8001
ENCRYPTION_KEY=dDkzTmY2YjJRZTF2VU1rZ0hSSnpYMFlhaUN0TGQ3cG8= # For Groq API key encryptionMODEL_PATH=models/qwen3-4b-instruct-q4_0.gguf
GPU_LAYERS=35
MAX_CONCURRENT_REQUESTS=3
CONTEXT_SIZE=8192
BATCH_SIZE=512TTS_MODEL_PATH=models/en_US-lessac-medium.onnx
TTS_PORT=8002
LOG_LEVEL=INFO
MAX_SENTENCE_LENGTH=500- Parallel Operations: Context building + query storage run concurrently
- Embedding Reuse: Single embedding generation for storage + search
- Async Model Loading: Background model loading with 10-min idle timeout
- Connection Pooling: SQLAlchemy connection pool for PostgreSQL
- GPU Acceleration: CUDA-enabled llama-cpp-python
- Model Quantization: Q4_0 quantization (4-bit) for faster inference
- Concurrent Handling: Semaphore limiting to 3 concurrent requests
- Streaming: Token-by-token delivery for perceived speed
- Parallel Synthesis: All sentences synthesized in parallel
- Sequence Numbers: Maintain playback order despite parallel processing
- WebSocket Streaming: Real-time audio delivery
- Sentence Buffering: Smart segmentation on punctuation
SARVIK now supports dual LLM providers with seamless runtime switching between local GPU inference and cloud-based API calls.
| Provider | Model | Speed | Context | Privacy |
|---|---|---|---|---|
| Server-Hosted | Qwen3-4B (local GPU) | 35 tok/sec | 256K tokens (depends on hosted resources) | 100% Local |
| Groq API | Llama 3.3 70B (cloud) | 250 tok/sec | 128K tokens | Cloud-based |
User Journey:
- User opens Settings β LLM Provider
- Default: Server-Hosted (uses local GPU)
- To switch to Groq:
- Enter Groq API key (from console.groq.com)
- Select model (Llama 3.3 70B, GPT OSS 120B, etc.)
- Click "Save" β Backend encrypts key
- Click "Groq API" card β Provider switched
- Next query automatically uses selected provider (no restart!)
Backend Implementation:
Query arrives β Backend loads user from database
β
Check user.llm_provider column:
ββ "server_hosted" β ServerLLMProvider
β β
β POST localhost:8001/generate-stream
β β
β Local Qwen3-4B (GPU)
β
ββ "groq_api" β GroqLLMProvider
β
Decrypt API key from database
β
Convert Qwen3 format β OpenAI format
β
Groq SDK: client.chat.completions.create()
β
Cloud Llama 3.3 70B
llama-3.3-70b-versatile(Recommended - 70B params)openai/gpt-oss-120b(Largest - 120B params)openai/gpt-oss-20b(Fast - 20B params)moonshotai/kimi-k2-instruct-0905(Multilingual)qwen/qwen3-32b(Balance - 32B params)
- API Key Encryption: Fernet (AES-128-CBC)
- Auto-Expiry: Keys expire after 30 days
- Secure Storage: Encrypted in PostgreSQL
- No Plaintext: Keys only decrypted at request time
-- users table (new columns)
llm_provider VARCHAR(50) -- 'server_hosted' or 'groq_api'
groq_api_key_encrypted TEXT -- Fernet-encrypted API key
groq_api_key_expires_at TIMESTAMP -- 30-day auto-expiry
groq_model VARCHAR(100) -- Selected Groq modelQwen3 Format (backend sends):
<|im_start|>system
You are SARVIK<|im_end|>
<|im_start|>user
Hello<|im_end|>
OpenAI Format (Groq expects):
[
{"role": "system", "content": "You are SARVIK"},
{"role": "user", "content": "Hello"}
]Conversion happens automatically in GroqLLMProvider._parse_qwen_prompt_to_messages()
| Metric | Server-Hosted | Groq API |
|---|---|---|
| Speed | 35 tok/sec | 250 tok/sec (7x faster) |
| First Token | ~200ms | ~180ms |
| Context | 256K tokens | 128K tokens |
| Max Output | 16,385 tokens (max) | 8,192 tokens |
| Cost | Free (GPU) | Free (14,400 req/day) |
| Privacy | 100% Local | Cloud-based |
For detailed implementation: See 3_LLM_SERVER.md - LLM Provider Integration
-
1_MYAI_BACKEND.md - Backend Server Documentation
- Architecture layers and components
- All API endpoints with request/response formats
- Services detailed explanation
- Database architecture
- Security and authentication
- File locations:
myai-backend/app/
-
2_MYAI_DESKTOP.md - Desktop Application Documentation
- Electron + React architecture
- Component structure and responsibilities
- Services (API, Audio Manager, Device Manager)
- State management (Auth and App contexts)
- Audio system pipeline
- File locations:
myai-desktop/src/
-
3_LLM_SERVER.md - LLM Inference Server Documentation
- Qwen3-4B model configuration
- GPU offloading strategies
- API endpoints for generation
- Prompt engineering (Voice vs Text modes)
- Streaming implementation
- Performance optimization
- File locations:
llm-server/app/
-
4_TTS_SERVER.md - TTS Synthesis Server Documentation
- Piper TTS voice models (4 voices)
- API endpoints for synthesis
- WebSocket protocol for audio streaming
- Parallel synthesis architecture
- Audio format specifications
- File locations:
tts-server/app/
Last Updated: November, 2025