A Model Context Protocol (MCP) server for general-purpose audio analysis. Provides transcription, speaker diarization, prosody analysis, speech pattern detection, and sentiment analysis as composable tools for Claude and other MCP clients.
- Transcription — Word-level timestamps using OpenAI Whisper (tiny through large-v3)
- Speaker Diarization — Identify and label individual speakers using pyannote.audio
- Prosody Analysis — Pitch, energy, and speaking pace extracted via librosa
- Speech Pattern Detection — Pause detection, filler word counting, and speaker overlap identification
- Sentiment Analysis — Per-segment sentiment scoring using HuggingFace Transformers
- Full Pipeline — Single
full_analysiscall runs all enabled stages end-to-end - GPU Acceleration — CUDA-accelerated inference with automatic CPU fallback
- Low-VRAM Mode — Sequential model loading keeps peak VRAM under ~2 GB per step (supports 4–6 GB GPUs)
- Feature Flags — Enable only the capabilities you need to reduce resource usage
The server is built on FastMCP using Streamable HTTP transport. Each analysis capability is implemented as an independent tool backed by a dedicated processor, with a clean separation between the MCP interface layer and the ML inference layer. See docs/ARCHITECTURE.md for the component diagram and design decisions.
| Tool | Description |
|---|---|
health_check |
Server health, GPU status, and feature availability |
feature_status |
Detailed status of all optional features and their models |
full_analysis |
Primary tool — Complete pipeline: transcription + diarization + prosody + patterns + sentiment |
transcribe_audio |
Whisper speech-to-text with word-level timestamps |
identify_speakers |
pyannote.audio speaker diarization |
analyze_tone |
Prosody analysis: pitch, energy, and pace |
detect_speech_patterns |
Pauses, filler words, and speaker overlaps |
analyze_sentiment |
Per-segment sentiment analysis |
- Python 3.10+
- ffmpeg (
apt install ffmpegorbrew install ffmpeg) - NVIDIA GPU recommended; CPU fallback available
- HuggingFace token (required for speaker diarization)
git clone https://github.com/krisoye/audio-analysis-mcp.git
cd audio-analysis-mcp
pip install -e ".[full]"export HF_TOKEN="hf_your_token" # Required for speaker diarization
export WHISPER_MODEL=medium # tiny / base / small / medium / large-v3
export DEPLOYMENT_MODE=developmentpython -m audio_analysis_mcp
# Server starts at http://localhost:8420claude mcp add --transport http audio-analysis http://localhost:8420/mcp -s userKey environment variables:
| Variable | Default | Description |
|---|---|---|
AUDIO_ANALYSIS_HOST |
127.0.0.1 |
Bind address (0.0.0.0 for external access) |
AUDIO_ANALYSIS_PORT |
8420 |
Server port |
WHISPER_MODEL |
large-v3 |
Model size: tiny, base, small, medium, large-v3 |
HF_TOKEN |
— | HuggingFace token (required for diarization) |
LOW_VRAM_MODE |
false |
Sequential model loading for 4–6 GB GPUs |
ENABLE_TRANSCRIPTION |
true |
Enable/disable Whisper transcription |
ENABLE_DIARIZATION |
true |
Enable/disable speaker diarization |
ENABLE_SENTIMENT |
true |
Enable/disable sentiment analysis |
ENABLE_PROSODY |
true |
Enable/disable prosody analysis |
ENABLE_PATTERNS |
true |
Enable/disable speech pattern detection |
See docs/CONFIGURATION.md for the complete reference including cache settings, timeouts, and deployment examples.
| Mode | Peak VRAM | Speed | Best For |
|---|---|---|---|
Default (LOW_VRAM_MODE=false) |
~4–6 GB | Fastest | GPUs with 8+ GB VRAM |
Low-VRAM (LOW_VRAM_MODE=true) |
~2 GB per step | Slower (sequential) | GPUs with 4–6 GB VRAM |
With LOW_VRAM_MODE=true, each analysis step loads its model, runs inference, unloads the model, and frees GPU memory before the next step begins. This adds ~30–60 seconds of overhead but eliminates out-of-memory errors on constrained hardware.
# Run tests (440+ tests, 15 test files)
pytest
# Lint
ruff check src/
# Format
ruff format src/
# Type check
mypy src/CI runs automatically on Python 3.10, 3.11, and 3.12 via GitHub Actions.
| Library | Purpose |
|---|---|
| FastMCP | MCP server framework (Streamable HTTP) |
| OpenAI Whisper | Speech-to-text with word-level timestamps |
| pyannote.audio | Speaker diarization |
| librosa | Prosody analysis (pitch, energy, pace) |
| HuggingFace Transformers | Sentiment analysis |
| PyTorch | GPU inference backend |
| Pydantic | Data models and settings |
MIT — see LICENSE.