Skip to content

krisoye/audio-analysis-mcp

Repository files navigation

Audio Analysis MCP Server

CI Python 3.10+ License: MIT

A Model Context Protocol (MCP) server for general-purpose audio analysis. Provides transcription, speaker diarization, prosody analysis, speech pattern detection, and sentiment analysis as composable tools for Claude and other MCP clients.

Features

  • Transcription — Word-level timestamps using OpenAI Whisper (tiny through large-v3)
  • Speaker Diarization — Identify and label individual speakers using pyannote.audio
  • Prosody Analysis — Pitch, energy, and speaking pace extracted via librosa
  • Speech Pattern Detection — Pause detection, filler word counting, and speaker overlap identification
  • Sentiment Analysis — Per-segment sentiment scoring using HuggingFace Transformers
  • Full Pipeline — Single full_analysis call runs all enabled stages end-to-end
  • GPU Acceleration — CUDA-accelerated inference with automatic CPU fallback
  • Low-VRAM Mode — Sequential model loading keeps peak VRAM under ~2 GB per step (supports 4–6 GB GPUs)
  • Feature Flags — Enable only the capabilities you need to reduce resource usage

Architecture

The server is built on FastMCP using Streamable HTTP transport. Each analysis capability is implemented as an independent tool backed by a dedicated processor, with a clean separation between the MCP interface layer and the ML inference layer. See docs/ARCHITECTURE.md for the component diagram and design decisions.

MCP Tools

Tool Description
health_check Server health, GPU status, and feature availability
feature_status Detailed status of all optional features and their models
full_analysis Primary tool — Complete pipeline: transcription + diarization + prosody + patterns + sentiment
transcribe_audio Whisper speech-to-text with word-level timestamps
identify_speakers pyannote.audio speaker diarization
analyze_tone Prosody analysis: pitch, energy, and pace
detect_speech_patterns Pauses, filler words, and speaker overlaps
analyze_sentiment Per-segment sentiment analysis

Quick Start

Prerequisites

  • Python 3.10+
  • ffmpeg (apt install ffmpeg or brew install ffmpeg)
  • NVIDIA GPU recommended; CPU fallback available
  • HuggingFace token (required for speaker diarization)

Installation

git clone https://github.com/krisoye/audio-analysis-mcp.git
cd audio-analysis-mcp
pip install -e ".[full]"

Configuration

export HF_TOKEN="hf_your_token"        # Required for speaker diarization
export WHISPER_MODEL=medium            # tiny / base / small / medium / large-v3
export DEPLOYMENT_MODE=development

Run

python -m audio_analysis_mcp
# Server starts at http://localhost:8420

Register with Claude Code

claude mcp add --transport http audio-analysis http://localhost:8420/mcp -s user

Configuration

Key environment variables:

Variable Default Description
AUDIO_ANALYSIS_HOST 127.0.0.1 Bind address (0.0.0.0 for external access)
AUDIO_ANALYSIS_PORT 8420 Server port
WHISPER_MODEL large-v3 Model size: tiny, base, small, medium, large-v3
HF_TOKEN HuggingFace token (required for diarization)
LOW_VRAM_MODE false Sequential model loading for 4–6 GB GPUs
ENABLE_TRANSCRIPTION true Enable/disable Whisper transcription
ENABLE_DIARIZATION true Enable/disable speaker diarization
ENABLE_SENTIMENT true Enable/disable sentiment analysis
ENABLE_PROSODY true Enable/disable prosody analysis
ENABLE_PATTERNS true Enable/disable speech pattern detection

See docs/CONFIGURATION.md for the complete reference including cache settings, timeouts, and deployment examples.

GPU Support

Mode Peak VRAM Speed Best For
Default (LOW_VRAM_MODE=false) ~4–6 GB Fastest GPUs with 8+ GB VRAM
Low-VRAM (LOW_VRAM_MODE=true) ~2 GB per step Slower (sequential) GPUs with 4–6 GB VRAM

With LOW_VRAM_MODE=true, each analysis step loads its model, runs inference, unloads the model, and frees GPU memory before the next step begins. This adds ~30–60 seconds of overhead but eliminates out-of-memory errors on constrained hardware.

Development

# Run tests (440+ tests, 15 test files)
pytest

# Lint
ruff check src/

# Format
ruff format src/

# Type check
mypy src/

CI runs automatically on Python 3.10, 3.11, and 3.12 via GitHub Actions.

Tech Stack

Library Purpose
FastMCP MCP server framework (Streamable HTTP)
OpenAI Whisper Speech-to-text with word-level timestamps
pyannote.audio Speaker diarization
librosa Prosody analysis (pitch, energy, pace)
HuggingFace Transformers Sentiment analysis
PyTorch GPU inference backend
Pydantic Data models and settings

License

MIT — see LICENSE.

About

MCP server for general-purpose audio analysis with transcription, diarization, prosody analysis, and pattern detection

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages