Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,34 @@ QUICK_ACTIONS_TIMEOUT=120
# Git operations timeout in seconds
GIT_OPERATIONS_TIMEOUT=30

# === VOICE TRANSCRIPTION ===
# Enable voice message transcription
ENABLE_VOICE_MESSAGES=true

# Voice transcription provider: mistral, openai, or local
# - mistral: Uses Mistral Voxtral (requires MISTRAL_API_KEY)
# - openai: Uses OpenAI Whisper API (requires OPENAI_API_KEY)
# - local: Uses whisper.cpp binary (requires ffmpeg + whisper.cpp installed)
VOICE_PROVIDER=mistral

# API keys (only needed for cloud providers)
MISTRAL_API_KEY=
OPENAI_API_KEY=

# Override transcription model (optional)
# Defaults: voxtral-mini-latest (mistral), whisper-1 (openai), base (local)
VOICE_TRANSCRIPTION_MODEL=

# Maximum voice message size in MB
VOICE_MAX_FILE_SIZE_MB=20

# Local whisper.cpp settings (only used when VOICE_PROVIDER=local)
# Path to whisper.cpp binary (auto-detected from PATH if unset)
WHISPER_CPP_BINARY_PATH=
# Path to GGML model file, or model name like "base", "small", "medium"
# Named models look for ~/.cache/whisper-cpp/ggml-{name}.bin
WHISPER_CPP_MODEL_PATH=base

# === PROJECT THREAD MODE ===
# Enable strict routing by Telegram project topics
ENABLE_PROJECT_THREADS=false
Expand Down
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ Multi-project topics: `ENABLE_PROJECT_THREADS` (default false), `PROJECT_THREADS

Output verbosity: `VERBOSE_LEVEL` (default 1, range 0-2). Controls how much of Claude's background activity is shown to the user in real-time. 0 = quiet (only final response, typing indicator still active), 1 = normal (tool names + reasoning snippets shown during execution), 2 = detailed (tool names with input summaries + longer reasoning text). Users can override per-session via `/verbose 0|1|2`. A persistent typing indicator is refreshed every ~2 seconds at all levels.

Voice transcription: `ENABLE_VOICE_MESSAGES` (default true), `VOICE_PROVIDER` (`mistral`|`openai`, default `mistral`), `MISTRAL_API_KEY`, `OPENAI_API_KEY`, `VOICE_TRANSCRIPTION_MODEL`. Provider implementation is in `src/bot/features/voice_handler.py`.
Voice transcription: `ENABLE_VOICE_MESSAGES` (default true), `VOICE_PROVIDER` (`mistral`|`openai`|`local`, default `mistral`), `MISTRAL_API_KEY`, `OPENAI_API_KEY`, `VOICE_TRANSCRIPTION_MODEL`. For local provider: `WHISPER_CPP_BINARY_PATH`, `WHISPER_CPP_MODEL_PATH` (requires ffmpeg + whisper.cpp installed). Provider implementation is in `src/bot/features/voice_handler.py`.

Feature flags in `src/config/features.py` control: MCP, git integration, file uploads, quick actions, session export, image uploads, voice messages, conversation mode, agentic mode, API server, scheduler.

Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -194,7 +194,7 @@ Enable with `ENABLE_API_SERVER=true` and `ENABLE_SCHEDULER=true`. See [docs/setu
- Directory sandboxing with path traversal prevention
- File upload handling with archive extraction
- Image/screenshot upload with analysis
- Voice message transcription (Mistral Voxtral / OpenAI Whisper)
- Voice message transcription (Mistral Voxtral / OpenAI Whisper / [local whisper.cpp](docs/local-whisper-cpp.md))
- Git integration with safe repository operations
- Quick actions system with context-aware buttons
- Session export in Markdown, HTML, and JSON formats
Expand Down
6 changes: 5 additions & 1 deletion docs/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -135,11 +135,15 @@ ENABLE_QUICK_ACTIONS=true

# Enable voice message transcription
ENABLE_VOICE_MESSAGES=true
VOICE_PROVIDER=mistral # 'mistral' (default) or 'openai'
VOICE_PROVIDER=mistral # 'mistral', 'openai', or 'local'
MISTRAL_API_KEY= # Required when VOICE_PROVIDER=mistral
OPENAI_API_KEY= # Required when VOICE_PROVIDER=openai
VOICE_TRANSCRIPTION_MODEL= # Default: voxtral-mini-latest (Mistral) or whisper-1 (OpenAI)
VOICE_MAX_FILE_SIZE_MB=20 # Max Telegram voice file size to download (1-200MB)

# Local whisper.cpp settings (only used when VOICE_PROVIDER=local)
WHISPER_CPP_BINARY_PATH= # Path to whisper.cpp binary (auto-detected from PATH if unset)
WHISPER_CPP_MODEL_PATH=base # Path to GGML model file or model name (base, small, medium, large)
```

#### Agentic Platform
Expand Down
170 changes: 170 additions & 0 deletions docs/local-whisper-cpp.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
# Local Voice Transcription with whisper.cpp

This guide explains how to build and configure [whisper.cpp](https://github.com/ggerganov/whisper.cpp) for **offline** voice message transcription — no API keys or cloud services required.

## Overview

When `VOICE_PROVIDER=local` the bot transcribes Telegram voice messages entirely on your machine using:

| Component | Purpose |
|---|---|
| **ffmpeg** | Converts Telegram OGG/Opus audio to 16 kHz mono WAV |
| **whisper.cpp** | Runs OpenAI's Whisper model locally via optimised C/C++ |
| **GGML model** | Quantised model weights (downloaded once) |

## Prerequisites

- A C/C++ toolchain (`gcc`/`clang`, `cmake`, `make`)
- `ffmpeg` installed and on PATH
- ~400 MB disk space for the `base` model (~1.5 GB for `medium`)

## 1. Install ffmpeg

### Ubuntu / Debian

```bash
sudo apt update && sudo apt install -y ffmpeg
```

### macOS (Homebrew)

```bash
brew install ffmpeg
```

### Alpine

```bash
apk add ffmpeg
```

Verify:

```bash
ffmpeg -version
```

## 2. Build whisper.cpp from source

```bash
# Clone the repository
git clone https://github.com/ggerganov/whisper.cpp.git
cd whisper.cpp

# Build with CMake (recommended)
cmake -B build
cmake --build build --config Release

# The binary is at build/bin/whisper-cli (or build/bin/main on older versions)
ls build/bin/whisper-cli
```

> **Tip:** For GPU acceleration add `-DWHISPER_CUBLAS=ON` (NVIDIA) or `-DWHISPER_METAL=ON` (Apple Silicon) to the cmake configure step.

### Install system-wide (optional)

```bash
sudo cp build/bin/whisper-cli /usr/local/bin/whisper-cpp
```

Or add the build directory to your `PATH`:

```bash
export PATH="$PWD/build/bin:$PATH"
```

## 3. Download a GGML model

Models are hosted on Hugging Face. Pick one based on your hardware:

| Model | Size | RAM (approx.) | Quality |
|---|---|---|---|
| `tiny` | ~75 MB | ~400 MB | Fast but lower accuracy |
| `base` | ~142 MB | ~500 MB | Good balance (default) |
| `small` | ~466 MB | ~1 GB | Better accuracy |
| `medium` | ~1.5 GB | ~2.5 GB | High accuracy |
| `large-v3` | ~3 GB | ~5 GB | Best accuracy, slow on CPU |

```bash
# Create the model cache directory
mkdir -p ~/.cache/whisper-cpp

# Download the base model (recommended starting point)
curl -L -o ~/.cache/whisper-cpp/ggml-base.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin

# Or download small for better accuracy
curl -L -o ~/.cache/whisper-cpp/ggml-small.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin
```

## 4. Configure the bot

Add the following to your `.env`:

```bash
# Enable voice transcription with local provider
ENABLE_VOICE_MESSAGES=true
VOICE_PROVIDER=local

# Path to the whisper.cpp binary (omit if already on PATH as "whisper-cpp")
WHISPER_CPP_BINARY_PATH=/usr/local/bin/whisper-cpp

# Model: a name like "base", "small", "medium" or a full file path
# Named models resolve to ~/.cache/whisper-cpp/ggml-{name}.bin
WHISPER_CPP_MODEL_PATH=base
```

### Minimal configuration

If `whisper-cpp` is on your PATH and you downloaded the `base` model to the default location, you only need:

```bash
VOICE_PROVIDER=local
```

## 5. Verify the setup

```bash
# Test ffmpeg conversion
ffmpeg -f lavfi -i "sine=frequency=440:duration=2" -ar 16000 -ac 1 /tmp/test.wav -y

# Test whisper.cpp
whisper-cpp -m ~/.cache/whisper-cpp/ggml-base.bin -f /tmp/test.wav --no-timestamps
```

You should see a transcription attempt (it will be empty or nonsensical for a sine wave, but the binary should run without errors).

## Troubleshooting

### `whisper.cpp binary not found on PATH`

The bot could not locate the binary. Either:
- Install it system-wide: `sudo cp build/bin/whisper-cli /usr/local/bin/whisper-cpp`
- Or set the full path: `WHISPER_CPP_BINARY_PATH=/path/to/whisper-cli`

### `whisper.cpp model not found`

The model file does not exist at the expected path. Download it:

```bash
mkdir -p ~/.cache/whisper-cpp
curl -L -o ~/.cache/whisper-cpp/ggml-base.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-base.bin
```

### `ffmpeg is required but was not found`

Install ffmpeg for your platform (see step 1 above).

### Poor transcription quality

- Try a larger model (`small` or `medium` instead of `base`)
- Ensure audio is not too short (< 1 second) or too noisy
- whisper.cpp uses `--language auto` by default; this works well for most languages

### High CPU usage / slow transcription

- Use a smaller model (`tiny` or `base`)
- Enable GPU acceleration when building whisper.cpp (CUDA / Metal)
- Consider using the `mistral` or `openai` cloud providers for faster results on low-powered machines
15 changes: 13 additions & 2 deletions docs/setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -197,12 +197,23 @@ VOICE_PROVIDER=openai
OPENAI_API_KEY=your-openai-api-key
```

If you installed via pip/uv, make sure voice extras are installed:
**Local whisper.cpp (offline, no API key needed):**
```bash
VOICE_PROVIDER=local
# Optional — auto-detected from PATH if unset
WHISPER_CPP_BINARY_PATH=/usr/local/bin/whisper-cpp
# Model name ("base", "small", "medium") or full path to .bin file
WHISPER_CPP_MODEL_PATH=base
```

Requires `ffmpeg` and a locally built `whisper.cpp` binary. See the full [local whisper.cpp setup guide](local-whisper-cpp.md) for build instructions and model downloads.

If you installed via pip/uv, make sure voice extras are installed (cloud providers only):
```bash
pip install "claude-code-telegram[voice]"
```

Optionally override the transcription model with `VOICE_TRANSCRIPTION_MODEL` (defaults to `voxtral-mini-latest` for Mistral, `whisper-1` for OpenAI).
Optionally override the transcription model with `VOICE_TRANSCRIPTION_MODEL` (defaults to `voxtral-mini-latest` for Mistral, `whisper-1` for OpenAI, `base` for local).

### Notification Recipients

Expand Down
8 changes: 6 additions & 2 deletions src/bot/features/registry.py
Original file line number Diff line number Diff line change
Expand Up @@ -78,10 +78,14 @@ def _initialize_features(self):
except Exception as e:
logger.error("Failed to initialize image handler", error=str(e))

# Voice transcription - requires provider-specific API key
# Voice transcription - requires provider-specific API key (or local)
voice_key_available = (
self.config.voice_provider == "local"
) or (
self.config.voice_provider == "openai" and self.config.openai_api_key
) or (self.config.voice_provider == "mistral" and self.config.mistral_api_key)
) or (
self.config.voice_provider == "mistral" and self.config.mistral_api_key
)
if self.config.enable_voice_messages and voice_key_available:
try:
self.features["voice_handler"] = VoiceHandler(config=self.config)
Expand Down
Loading