Audio Analysis MCP Server

A Model Context Protocol (MCP) server for general-purpose audio analysis. Provides transcription, speaker diarization, prosody analysis, speech pattern detection, and sentiment analysis as composable tools for Claude and other MCP clients.

Features

Transcription — Word-level timestamps using OpenAI Whisper (tiny through large-v3)
Speaker Diarization — Identify and label individual speakers using pyannote.audio
Prosody Analysis — Pitch, energy, and speaking pace extracted via librosa
Speech Pattern Detection — Pause detection, filler word counting, and speaker overlap identification
Sentiment Analysis — Per-segment sentiment scoring using HuggingFace Transformers
Full Pipeline — Single full_analysis call runs all enabled stages end-to-end
GPU Acceleration — CUDA-accelerated inference with automatic CPU fallback
Low-VRAM Mode — Sequential model loading keeps peak VRAM under ~2 GB per step (supports 4–6 GB GPUs)
Feature Flags — Enable only the capabilities you need to reduce resource usage

Architecture

The server is built on FastMCP using Streamable HTTP transport. Each analysis capability is implemented as an independent tool backed by a dedicated processor, with a clean separation between the MCP interface layer and the ML inference layer. See docs/ARCHITECTURE.md for the component diagram and design decisions.

MCP Tools

Tool	Description
`health_check`	Server health, GPU status, and feature availability
`feature_status`	Detailed status of all optional features and their models
`full_analysis`	Primary tool — Complete pipeline: transcription + diarization + prosody + patterns + sentiment
`transcribe_audio`	Whisper speech-to-text with word-level timestamps
`identify_speakers`	pyannote.audio speaker diarization
`analyze_tone`	Prosody analysis: pitch, energy, and pace
`detect_speech_patterns`	Pauses, filler words, and speaker overlaps
`analyze_sentiment`	Per-segment sentiment analysis

Quick Start

Prerequisites

Python 3.10+
ffmpeg (apt install ffmpeg or brew install ffmpeg)
NVIDIA GPU recommended; CPU fallback available
HuggingFace token (required for speaker diarization)

Installation

git clone https://github.com/krisoye/audio-analysis-mcp.git
cd audio-analysis-mcp
pip install -e ".[full]"

Configuration

export HF_TOKEN="hf_your_token"        # Required for speaker diarization
export WHISPER_MODEL=medium            # tiny / base / small / medium / large-v3
export DEPLOYMENT_MODE=development

Run

python -m audio_analysis_mcp
# Server starts at http://localhost:8420

Register with Claude Code

claude mcp add --transport http audio-analysis http://localhost:8420/mcp -s user

Configuration

Key environment variables:

Variable	Default	Description
`AUDIO_ANALYSIS_HOST`	`127.0.0.1`	Bind address (`0.0.0.0` for external access)
`AUDIO_ANALYSIS_PORT`	`8420`	Server port
`WHISPER_MODEL`	`large-v3`	Model size: `tiny`, `base`, `small`, `medium`, `large-v3`
`HF_TOKEN`	—	HuggingFace token (required for diarization)
`LOW_VRAM_MODE`	`false`	Sequential model loading for 4–6 GB GPUs
`ENABLE_TRANSCRIPTION`	`true`	Enable/disable Whisper transcription
`ENABLE_DIARIZATION`	`true`	Enable/disable speaker diarization
`ENABLE_SENTIMENT`	`true`	Enable/disable sentiment analysis
`ENABLE_PROSODY`	`true`	Enable/disable prosody analysis
`ENABLE_PATTERNS`	`true`	Enable/disable speech pattern detection

See docs/CONFIGURATION.md for the complete reference including cache settings, timeouts, and deployment examples.

GPU Support

Mode	Peak VRAM	Speed	Best For
Default (`LOW_VRAM_MODE=false`)	~4–6 GB	Fastest	GPUs with 8+ GB VRAM
Low-VRAM (`LOW_VRAM_MODE=true`)	~2 GB per step	Slower (sequential)	GPUs with 4–6 GB VRAM

With LOW_VRAM_MODE=true, each analysis step loads its model, runs inference, unloads the model, and frees GPU memory before the next step begins. This adds ~30–60 seconds of overhead but eliminates out-of-memory errors on constrained hardware.

Development

# Run tests (440+ tests, 15 test files)
pytest

# Lint
ruff check src/

# Format
ruff format src/

# Type check
mypy src/

CI runs automatically on Python 3.10, 3.11, and 3.12 via GitHub Actions.

Tech Stack

Library	Purpose
FastMCP	MCP server framework (Streamable HTTP)
OpenAI Whisper	Speech-to-text with word-level timestamps
pyannote.audio	Speaker diarization
librosa	Prosody analysis (pitch, energy, pace)
HuggingFace Transformers	Sentiment analysis
PyTorch	GPU inference backend
Pydantic	Data models and settings

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
.github		.github
deploy		deploy
docs		docs
src/audio_analysis_mcp		src/audio_analysis_mcp
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
IMPLEMENTATION_PLAN.md		IMPLEMENTATION_PLAN.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Audio Analysis MCP Server

Features

Architecture

MCP Tools

Quick Start

Prerequisites

Installation

Configuration

Run

Register with Claude Code

Configuration

GPU Support

Development

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Audio Analysis MCP Server

Features

Architecture

MCP Tools

Quick Start

Prerequisites

Installation

Configuration

Run

Register with Claude Code

Configuration

GPU Support

Development

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages