Add personalized voice recognition to Home Assistant Assist Pipeline
Give your Home Assistant voice assistant the ability to recognize who's speaking! Speaker-ID integrates seamlessly with the Home Assistant Assist Pipeline to enable truly personalized voice-controlled smart home experiences.
Speaker-ID is a lightweight AI service that identifies people by their voice - think facial recognition, but for voices! It's specifically designed to work with Home Assistant's Assist Pipeline, enabling your voice assistant to know who's speaking and respond accordingly.
Speaker-ID enhances your Home Assistant voice assistant by adding speaker identification:
"Hey Jarvis, good morning"
β Speaker-ID identifies it's Alice
β "Good morning Alice! It's 7:30 AM. Would you like me to start your morning routine?"
Key Features:
- Personalized responses: Different greetings, news, music for each family member
- Smart automation: Lights, climate, and scenes adjust based on who's speaking
- Voice-based access control: Only authorized voices can unlock doors or arm security
- Multi-user calendar/schedules: "What's on my calendar?" gets the right person's schedule
- Room presence tracking: Know who's in which room based on voice interactions
- Enrollment: Each family member says a few sentences to train the system
- Integration: Connect Speaker-ID to your Home Assistant Assist Pipeline
- Voice Commands: When anyone speaks to your voice assistant, Speaker-ID identifies them
- Personalization: Home Assistant uses this info to provide personalized responses and automations
100% local - no cloud required! All voice processing happens on your network.
Don't worry if you're not familiar with Python! We provide simple Docker-based installation that works on any system.
- A computer/server running Docker (Raspberry Pi 4, NAS, or any Linux machine)
- Home Assistant installation
- Microphone or voice recordings
This is the easiest way to get started. It installs both Speaker-ID and the required database.
-
Create a directory and download the compose file:
mkdir speaker-id cd speaker-id curl -O https://raw.githubusercontent.com/YOUR-REPO/speaker-id/main/docker/compose.yaml -
Start the services:
docker compose up -d
-
Verify it's running:
curl http://localhost:8080/health
You should see:
{"status":"ok"}
That's it! The service is now running on port 8080.
Note: The AI model is pre-downloaded during Docker build, so the service is ready immediately with no first-request delays.
If you prefer more control or already have a Qdrant database:
-
Start Qdrant database (if you don't have one):
docker run -d -p 6333:6333 --name qdrant qdrant/qdrant
-
Start Speaker-ID:
# Use Resemblyzer (default - faster, lower memory) docker run -d \ -p 8080:8080 \ -e QDRANT_URL=http://qdrant:6333 \ --name speaker-id \ --link qdrant \ ghcr.io/identitry/speaker-id:latest # Or use ECAPA model (more accurate) docker run -d \ -p 8080:8080 \ -e QDRANT_URL=http://qdrant:6333 \ -e USE_ECAPA=true \ --name speaker-id \ --link qdrant \ ghcr.io/identitry/speaker-id:latest
Modern, user-friendly interface with dark blue/violet theme.
Open your browser to http://localhost:8080
Features:
- π― Identify - Upload or record audio to identify speakers
- β Enroll - Add new speakers with multiple voice samples
- π₯ Manage - View and delete enrolled speakers
- βοΈ Settings - System configuration and maintenance
- π‘ API - Documentation and code examples
Key Features:
- Record directly from microphone or upload files
- Adjustable confidence threshold with slider
- Visual results with confidence bars
- Bulk speaker management
- Real-time status indicators
See Web Interface Guide for complete details.
Enroll a new speaker:
curl -F "file=@alice_voice.wav" "http://localhost:8080/api/enroll?name=Alice"Identify a speaker:
curl -F "file=@unknown_voice.wav" "http://localhost:8080/api/identify?threshold=0.82"Response example:
{
"speaker": "Alice",
"confidence": 0.91,
"topN": [
{"name": "Alice", "score": 0.91},
{"name": "Bob", "score": 0.65}
]
}Once running, visit:
- Interactive API docs: http://localhost:8080/docs
- Alternative docs: http://localhost:8080/redoc
This is the primary integration method - connect Speaker-ID directly to your Home Assistant voice assistant!
Speaker-ID works seamlessly with Home Assistant's Assist Pipeline to identify who's speaking during voice interactions. This enables:
- Natural conversations: "Hey Jarvis, turn on my lights" β automatically turns on the right person's lights
- Personalized responses: Each family member gets their own greetings, music, calendar, etc.
- Voice enrollment: Just say "enroll my voice" to add yourself to the system
- Multi-room tracking: Know who's in each room based on voice interactions
Quick Setup Example:
# configuration.yaml
assist_pipeline:
rest_command:
identify_speaker_assist:
url: "http://localhost:8080/api/identify"
method: POST
payload: "file={{ audio_file }}&threshold=0.82"
input_text:
current_speaker:
name: "Current Speaker"
initial: "unknown"Then create automations that respond differently based on input_text.current_speaker:
automation:
- alias: "Personalized Good Morning via Voice"
trigger:
- platform: conversation
command: "good morning"
action:
- choose:
- conditions:
- condition: state
entity_id: input_text.current_speaker
state: "Alice"
sequence:
- service: tts.speak
data:
message: "Good morning Alice! Your calendar shows {{ state_attr('calendar.alice', 'message') }}"
- service: light.turn_on
target:
entity_id: light.bedroom_alice
data:
brightness_pct: 50
- conditions:
- condition: state
entity_id: input_text.current_speaker
state: "Bob"
sequence:
- service: tts.speak
data:
message: "Good morning Bob! Traffic to work is {{ states('sensor.commute_time') }} minutes"π Complete Assist Pipeline guide: docs/ha-integration.md
If you're not using the Assist Pipeline, you can also integrate via:
- REST Commands: Direct API calls from automations
- RESTful Sensors: Monitor enrolled speakers
- Node-RED: Visual workflow programming
See the full integration guide for all methods and detailed examples.
| Variable | Default | Description |
|---|---|---|
QDRANT_URL |
http://localhost:6333 |
Qdrant database connection |
USE_ECAPA |
false |
Use advanced ECAPA model (more accurate, slower) |
AUDIO_ENHANCEMENT |
true |
Master switch: enable all audio preprocessing |
SELECT_BEST_SEGMENT |
true |
Select most energetic 3-second segment |
SCORE_CALIBRATION |
true |
Calibrate similarity scores for better discrimination |
MIN_AUDIO_DURATION |
1.0 |
Minimum audio duration in seconds |
SAMPLE_RATE |
16000 |
Audio sample rate (16kHz recommended) |
DEFAULT_THRESHOLD |
0.82 |
Confidence threshold (0-1, higher = stricter) |
LOG_LEVEL |
INFO |
Logging verbosity (DEBUG/INFO/WARNING) |
The Docker image includes both models - you choose which one to use at runtime with the USE_ECAPA environment variable.
Resemblyzer (Default: USE_ECAPA=false):
- β Faster inference (~100ms)
- β Lower memory usage (~500 MB)
- β Good for Raspberry Pi
- β Good accuracy for most use cases
β οΈ Slightly less accurate than ECAPA
ECAPA-TDNN (Set USE_ECAPA=true):
- β Significantly more accurate
- β Better with noisy audio
- β Better discrimination between similar voices
β οΈ Requires more CPU/RAM (~1.5 GB)β οΈ Slower inference (~300ms)
Switch models anytime:
# Stop current container
docker stop speaker-id && docker rm speaker-id
# Start with ECAPA
docker run -d -e USE_ECAPA=true ... ghcr.io/identitry/speaker-id:latestImportant: When switching models, you must re-enroll all speakers as embeddings are incompatible between models.
When identifying a speaker, you get a confidence score between 0 and 1:
- 0.90+: Very confident match (same person)
- 0.80-0.89: Good match (likely the same person)
- 0.70-0.79: Possible match (similar voice, might be the same person)
- Below 0.70: Poor match (probably different person)
Default threshold is 0.82 - you can adjust based on your needs:
- Higher threshold (0.90): Fewer false positives, but might miss correct identifications
- Lower threshold (0.70): More identifications, but more false positives
If you're getting lower confidence scores than expected, try these improvements:
1. Use ECAPA-TDNN Model (Most Impactful)
# Docker
docker run -e USE_ECAPA=true ...
# Docker Compose
environment:
- USE_ECAPA=trueECAPA is significantly more accurate than the default Resemblyzer model, especially in noisy conditions.
2. Enroll Multiple Samples (Critical)
- Enroll 5-10 samples per person (not just 1-2)
- Use different sentences and speaking styles
- The system averages all samples to create a more robust profile
3. Audio Enhancement (Enabled by Default) The system automatically applies multiple advanced preprocessing techniques:
- Voice Activity Detection (VAD): Removes silence and non-speech segments
- Smart Segment Selection: Automatically selects the best 3-second segment based on energy (for recordings > 3.5s)
- Normalization: Ensures consistent volume levels across all recordings
- Pre-emphasis: Boosts high frequencies that are critical for speaker characteristics
- Score Calibration: Improves discrimination when similarity scores are compressed
Configure these features:
# All enabled by default
AUDIO_ENHANCEMENT=true # Master switch for all audio processing
SELECT_BEST_SEGMENT=true # Select most energetic segment
SCORE_CALIBRATION=true # Calibrate similarity scores
MIN_AUDIO_DURATION=1.0 # Minimum seconds of speech required4. Audio Quality Tips
- Record in a quiet environment
- Speak clearly at normal volume
- Avoid background music or TV
- Use a decent microphone (phone mic is usually fine)
- Ensure 3-5 seconds of actual speech (not just silence)
5. Re-enroll After System Changes If you change models (Resemblyzer β ECAPA), you must re-enroll all speakers as the embeddings are incompatible.
Expected Confidence Ranges:
- With ECAPA + 5+ samples: 0.85-0.95 for correct matches
- With Resemblyzer + 3 samples: 0.75-0.88 for correct matches
- Lower scores suggest: similar voices, poor audio quality, or insufficient training samples
- Multiple samples: Record 3-5 different samples per person
- Variety: Different sentences, tones, and contexts
- Quality: Clear audio, minimal background noise
- Length: At least 3-5 seconds per sample
- Natural speech: Use normal speaking voice, not shouting or whispering
- Recommended: WAV, 16kHz, mono
- Supported: MP3, WAV, FLAC (auto-converted)
- Channels: Mono preferred, stereo auto-converted
- β All processing happens locally - no cloud connection
- β Voice data never leaves your network
- β No audio is stored (only mathematical "fingerprints")
β οΈ Backup your Qdrant database to preserve enrolled voices
-
Check if port 8080 is available:
netstat -an | grep 8080 -
Check Docker logs:
docker logs speaker-id
- Check confidence score: It might be just below threshold
- Re-enroll: Try enrolling with more/better voice samples
- Lower threshold: Try
threshold=0.75instead of 0.82 - Check audio quality: Ensure clear audio with minimal noise
- Use ECAPA model: Set
USE_ECAPA=true - More training data: Enroll 5-10 samples per person
- Rebuild centroids: Call
/api/rebuild_centroidsafter adding samples - Check audio format: Ensure 16kHz sample rate
- π Full API documentation
- π Technical details
- π Report issues
- Web Interface Guide - Using the modern web UI
- API Reference - Complete API documentation with examples
- Home Assistant Integration - Detailed integration guide with Assist Pipeline
- AI Models Guide - Model selection, performance, and troubleshooting
- Solution Details - How it works under the hood
- Deployment Guide - Production deployment options
- Qdrant Database - Database management and backups
If you want to contribute or modify the code:
# Clone the repository
git clone https://github.com/YOUR-REPO/speaker-id.git
cd speaker-id
# Install dependencies
poetry install
# Run tests
poetry run pytest
# Start development server
poetry run uvicorn app.main:APP --reloadSee solution_details.md or CONTRIBUTING.md for detailed development documentation.
Speaker-ID includes built-in Prometheus metrics at /metrics:
- Request counts and latencies
- Identification success/failure rates
- Per-speaker identification counts
Perfect for Grafana dashboards!
- Core speaker identification
- Docker deployment
- Prometheus metrics
- Web UI for testing
- Home Assistant Add-on (maybe)
- Multi-language support (if requested)
- Real-time audio streaming
- Speaker diarization (who spoke when)
- MCP Server for allowing additional integrations
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
Check CONTRIBUTING.md for details
MIT License - see LICENSE for details
- Resemblyzer - Voice encoding
- SpeechBrain - ECAPA-TDNN model
- Qdrant - Vector database
- FastAPI - API framework
Made with β€οΈ for the Home Assistant community

