Local voice assistant service using low-resource satellite devices (Raspberry Pi or similar) for multi-room voice control. Audio processing runs on a local server with GPU acceleration while lightweight clients handle audio capture/playback. 🚀
Transcribed speech is sent to a webhook endpoint (n8n, Home Assistant, etc.) for command processing and optional TTS responses (audio playback). All audio data is encrypted using AES-256 for secure transmission over local networks. 🔐
Why VoiceLink AI?
- Local Processing - All audio and transcription happen on your local network, fast with no cloud dependencies.
- Custom Wake Words - Use Porcupine to create personalized wake words for your assistant
- Easy Custom Integration - Works seamlessly with n8n, Home Assistant, and other webhook-compatible platforms
- 🎯 Wake Word Detection - Porcupine custom wake word
- 🗣️ Speech-to-Text - Faster Whisper with GPU support
- 🏠 Multi-Room Support - Multiple simultaneous client devices
- 📍 Location-Aware - Device ID/name included in webhook payload
- 🔒 AES-256 Encryption - Secure audio transmission
- 🎵 Voice Activity Detection - Silero VAD for smart silence detection
- 🙍 Speaker Identification - Recognize different users (coming soon)
- 🔊 Audio Feedback - Success/error sounds and TTS responses routed to originating device
- 🔄 Auto-Reconnection - Clients reconnect automatically on network issues
graph LR
A[🎙️ Client Devices<br/>Raspberry Pi / PC] -->|🔒 Audio Stream| B[🖥️ Server<br/>STT Processing]
B -->|Transcribed Text| C[🔗 Webhook<br/>n8n / HA]
C -->|TTS Response| B
B -->|🔒 Audio Response| A
style A fill:#667eea,stroke:#333,stroke-width:2px,color:#fff
style B fill:#f093fb,stroke:#333,stroke-width:3px,color:#fff
style C fill:#4facfe,stroke:#333,stroke-width:2px,color:#fff
My personal setup uses Home Assistant and n8n for processing voice commands and controlling smart home devices. Here's a brief overview of my setup:
- n8n webhook receives transcribed text from VoiceLink AI server (this project) running on a Windows PC. My personalized AI agent response is sent to Eleven Labs TTS API for TTS audio generation. The webhook responds with the audio (b64 in
datafield). ElevenLabs is not free, but the quality is excellent. You can substitute any TTS service that returns audio data. - n8n agent uses Home Assistant MCP tool (Home Assistant's built-in MCP server enabled) to access HA services and devices.
- My 'satellite' client devices are Raspberry Pi 3 running the client service.
- Satillite Pi's also run MPD (music player daemon) as an easy way for my AI agent to make TTS announcements (via the MCP server). I haven't been able to get Music Assistant to work with MPD yet...although they are visible as players in Music Assistant.
cd server
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your configuration
python server.pycd client
pip install -r requirements.txt
cp .env.example .env
# Edit .env with your configuration
python client.pyThis project uses Porcupine by Picovoice for wake word detection. Porcupine provides accurate, low-latency wake word recognition that runs efficiently on both server and edge devices. 🎪
Picovoice Porcupine is an excellent choice for wake word detection due to its accuracy, low resource usage, and ease of customization. Up to 3 devices and custom wake words are FREE (this project only needs 1 device - the server).
- Create Account - Sign up at Picovoice Console 👤
- Get Access Key - Copy your free access key from the dashboard 🎫
- Add to .env - Set
PORCUPINE_ACCESS_KEYin your server's.envfile ⚙️
You can create custom wake words for your assistant:
Option 1: Pre-Built Wake Words (Easiest) ⚡
- Use built-in wake words like "Hey Siri", "Alexa", "Computer", etc.
- Available in the Porcupine library without customization
Option 2: Custom Wake Words (Recommended) ⭐
- Go to Picovoice Console 🌐
- Navigate to Porcupine → Wake Words 📋
- Click Train New Wake Word 🎓
- Enter your custom phrase (e.g., "Hey Dude", "Jarvis", "Computer") 💬
- Download the
.ppnmodel file 📥 - Place it in
server/resources/📁 - Update your server code to reference the new model file 🔧
Tips for Custom Wake Words: 💡
- Choose 2-3 syllable phrases for best accuracy ✅
- Avoid common words said in regular conversation 🚫
- Test with different accents and speaking styles 🌍
- The free tier includes custom wake word training 🆓
The included Hey-Dude_en_windows_v3_0_0.ppn model is a custom wake word example.
See .env.example files in server/ and client/ directories. Copy to .env and configure:
- 🖥️ Server: Set
WEBHOOK_URL,PORCUPINE_ACCESS_KEY, andENCRYPTION_KEY - 📱 Client: Set server connection details, audio devices, and matching
ENCRYPTION_KEY
This encryption implementation uses AES-256-CBC with a random IV for each message, ensuring strong confidentiality for audio data. 🛡️ The shared secret key must be securely managed by the user. This implementation does not cover key exchange or advanced security features like authentication or certificates. This is a basic encryption layer to protect audio data in transit on local networks.
1. Generate Key 🔑
python support\encryption.pyCopy the output key (looks like: Zy8xK3R5bGVzaGVldCBhbmQgdGhlIGZpbGU=)
2. Add to Both .env Files 📝
Server .env:
ENCRYPTION_KEY=paste_your_key_hereClient .env:
ENCRYPTION_KEY=paste_the_same_key_here3. Restart Services 🔄
✅ Verify it's working - Look for on startup:
Encryption enabled - audio data will be encrypted
- ✅ Audio stream data (client ↔ server)
- ✅ TTS responses and feedback sounds
- ❌ JSON control messages (unencrypted for debugging)
"Invalid padding" error → Keys don't match. Ensure both server and client have the exact same ENCRYPTION_KEY.
"No ENCRYPTION_KEY" warning → Add key to .env file and restart.
No encryption key? System will run with unencrypted audio and log a warning.
The server sends transcribed speech to a webhook endpoint (n8n, Home Assistant, etc.) and expects optional audio responses for TTS playback. 📡
{
"speaker": "John",
"text": "turn on the kitchen lights",
"timestamp": "2025-11-15T10:30:45.123456",
"device_id": "kitchen_pi",
"device_name": "Kitchen Pi"
}{
"data": "base64_encoded_wav_audio_for_tts"
}The data field is optional. If provided, the audio will be played back on the originating client device.
- 📥 Webhook Node - Receive POST requests
- ⚙️ Process Text - Parse command, trigger actions (Home Assistant nodes, HTTP requests, etc.)
- 🗣️ Text-to-Speech - Optional: generate response audio (e.g., ElevenLabs, Google TTS)
- 📤 Respond to Webhook - Return JSON with base64 audio in
datafield
- 💻 Server: Windows 10/11, Python 3.8+, CUDA-compatible GPU (optional)
- 🥧 Client: Raspberry Pi 3+ or any Linux/Windows device with microphone
- 🌐 Network: WebSocket connectivity between clients and server
The server can be run as a Windows service using NSSM (Non-Sucking Service Manager). 🛠️ This allows the server to start automatically on boot and run in the background (no login required). 🔄
nssm install VoiceAssistant "C:\path\to\server\.venv\Scripts\python.exe" "C:\path\to\server\server.py"- Picovoice Porcupine - Wake word detection
- Faster Whisper - Speech-to-text transcription TTS response generation
- Silero VAD - Voice activity detection