diff --git a/README.md b/README.md
index b733fbf8..241477b5 100644
--- a/README.md
+++ b/README.md
@@ -1,318 +1,265 @@
-# MimiClaw: Pocket AI Assistant on a $5 Chip
+# reSpeaker-claw: Voice AI Agent for ReSpeaker XVF3800
-
-
-
-
-
-
-
-
+
+
+
+
+
English | 中文 | 日本語
-**The world's first AI assistant(OpenClaw) on a $5 chip. No Linux. No Node.js. Just pure C**
-
-MimiClaw turns a tiny ESP32-S3 board into a personal AI assistant. Plug it into USB power, connect to WiFi, and talk to it through Telegram — it handles any task you throw at it and evolves over time with local memory — all on a chip the size of a thumb.
+reSpeaker-claw turns a ReSpeaker XVF3800–based device into a voice-first AI agent. It captures audio over I2S, performs local VAD, sends utterances to STT, and processes them through an embedded agent loop. The system combines real-time speech interaction with local memory, tool calling, scheduling, heartbeat processes, OTA updates, and proxy support, and returns responses via TTS through the speaker.
-## Meet MimiClaw
+## Meet reSpeaker-claw
- **Tiny** — No Linux, no Node.js, no bloat — just pure C
-- **Handy** — Message it from Telegram, it handles the rest
- **Loyal** — Learns from memory, remembers across reboots
-- **Energetic** — USB power, 0.5 W, runs 24/7
-- **Lovable** — One ESP32-S3 board, $5, nothing else
+- **Energetic** — USB power, lower power consumption, runs 24/7
+- **Freedom** — ReSpeaker XVF3800's mic array + your choice of speaker amp/DAC
+- **Handy** — Built-in voice channel, no extra hardware needed beyond the XVF3800 and a speaker path
-## How It Works
+## Highlights
-
+- Voice input: ReSpeaker XVF3800 microphone array over I2S
+- Voice output: TTS audio download, WAV decode, resample, and speaker playback over I2S
+- Multi-channel agent: voice, Telegram, Feishu, WebSocket
+- Local persistence: SPIFFS stores memory, profiles, sessions, cron jobs, and daily notes
+- Compatible LLM backends: official Anthropic/OpenAI APIs or third-party gateways that expose Anthropic-compatible or OpenAI-compatible endpoints
+- Configurable STT/TTS: plug in your own service URL, API key, model, voice, and language
+- Runtime overrides: change WiFi, provider, model, API base, proxy, and tokens from the serial CLI without editing code
-You send a message on Telegram. The ESP32-S3 picks it up over WiFi, feeds it into an agent loop — the LLM thinks, calls tools, reads memory — and sends the reply back. Supports both **Anthropic (Claude)** and **OpenAI (GPT)** as providers, switchable at runtime. Everything runs on a single $5 chip with all your data stored locally on flash.
## Quick Start
-### What You Need
+### Requirements
-- An **ESP32-S3 dev board** with 16 MB flash and 8 MB PSRAM (e.g. Xiaozhi AI board, ~$10)
-- A **USB Type-C cable**
-- A **Telegram bot token** — talk to [@BotFather](https://t.me/BotFather) on Telegram to create one
-- An **Anthropic API key** — from [console.anthropic.com](https://console.anthropic.com), or an **OpenAI API key** — from [platform.openai.com](https://platform.openai.com)
+- A reSpeaker XVF3800 USB 4 Microphone Array with XIAO ESP32S3 board
+- A speaker / DAC / amp path on I2S output
+- A USB cable for flashing and serial monitoring
+- WiFi access
+- ESP-IDF v5.5+
+- Optional: Telegram bot token if you want Telegram
+- Optional: Feishu app credentials if you want Feishu
+- One LLM API key for an Anthropic-compatible or OpenAI-compatible endpoint
+- One STT service and one TTS service for voice mode
-### Install
+### Clone and Build Environment
-```bash
-# You need ESP-IDF v5.5+ installed first:
-# https://docs.espressif.com/projects/esp-idf/en/v5.5.2/esp32s3/get-started/
+Refer to the official guide to flash the I2S firmware:
+[SeeedStudio wiki](https://wiki.seeedstudio.com/respeaker_xvf3800_introduction/#flash-firmware)
+
+Then clone this project and set the target:
-git clone https://github.com/memovai/mimiclaw.git
-cd mimiclaw
+```bash
+git clone https://github.com/Seeed-Projects/reSpeaker-claw
+cd reSpeaker-claw
idf.py set-target esp32s3
```
-
-Ubuntu Install
+Install ESP-IDF first: [ESP-IDF Install](https://docs.espressif.com/projects/esp-idf/en/v5.5.3/esp32s3/get-started/)
-Recommended baseline:
-
-- Ubuntu 22.04/24.04
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb-1.0-0`, `libffi-dev`, `libssl-dev`
-
-Install and build on Ubuntu:
+Ubuntu helper scripts:
```bash
-sudo apt-get update
-sudo apt-get install -y git wget flex bison gperf python3 python3-pip python3-venv \
- cmake ninja-build ccache libffi-dev libssl-dev dfu-util libusb-1.0-0
-
./scripts/setup_idf_ubuntu.sh
./scripts/build_ubuntu.sh
```
-
-
-
-macOS Install
-
-Recommended baseline:
-
-- macOS 12/13/14
-- Xcode Command Line Tools
-- Homebrew
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb`, `libffi`, `openssl`
-
-Install and build on macOS:
+macOS helper scripts:
```bash
-xcode-select --install
-/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
-
./scripts/setup_idf_macos.sh
./scripts/build_macos.sh
```
-
-
-### Configure
+## Configure
-MimiClaw uses a **two-layer config** system: build-time defaults in `mimi_secrets.h`, with runtime overrides via the serial CLI. CLI values are stored in NVS flash and take priority over build-time values.
+Copy the example secrets file:
```bash
-cp main/mimi_secrets.h.example main/mimi_secrets.h
+cp "main/mimi_secrets.h.example" "main/mimi_secrets.h"
```
-Edit `main/mimi_secrets.h`:
+Edit `main/mimi_secrets.h` and set the fields you actually use:
```c
+/* WiFi */
#define MIMI_SECRET_WIFI_SSID "YourWiFiName"
#define MIMI_SECRET_WIFI_PASS "YourWiFiPassword"
-#define MIMI_SECRET_TG_TOKEN "123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11"
-#define MIMI_SECRET_API_KEY "sk-ant-api03-xxxxx"
-#define MIMI_SECRET_MODEL_PROVIDER "anthropic" // "anthropic" or "openai"
-#define MIMI_SECRET_SEARCH_KEY "" // optional: Brave Search API key
-#define MIMI_SECRET_TAVILY_KEY "" // optional: Tavily API key (preferred)
-#define MIMI_SECRET_PROXY_HOST "" // optional: e.g. "10.0.0.1"
-#define MIMI_SECRET_PROXY_PORT "" // optional: e.g. "7897"
-```
-
-Then build and flash:
-
-```bash
-# Clean build (required after any mimi_secrets.h change)
-idf.py fullclean && idf.py build
-
-# Find your serial port
-ls /dev/cu.usb* # macOS
-ls /dev/ttyACM* # Linux
-# Flash and monitor (replace PORT with your port)
-# USB adapter: likely /dev/cu.usbmodem11401 (macOS) or /dev/ttyACM0 (Linux)
-idf.py -p PORT flash monitor
+/* Optional text channels */
+#define MIMI_SECRET_TG_TOKEN ""
+#define MIMI_SECRET_FEISHU_APP_ID ""
+#define MIMI_SECRET_FEISHU_APP_SECRET ""
+
+/* LLM */
+#define MIMI_SECRET_API_KEY "your-llm-key"
+#define MIMI_SECRET_MODEL "your-model"
+#define MIMI_SECRET_MODEL_PROVIDER "openai" /* or "anthropic" */
+
+/* Search and proxy */
+#define MIMI_SECRET_TAVILY_KEY ""
+#define MIMI_SECRET_SEARCH_KEY ""
+#define MIMI_SECRET_PROXY_HOST ""
+#define MIMI_SECRET_PROXY_PORT ""
+#define MIMI_SECRET_PROXY_TYPE "" /* "http" or "socks5" */
+
+/* Voice STT / TTS */
+#define MIMI_SECRET_STT_URL "https://your-stt-endpoint"
+#define MIMI_SECRET_STT_API_KEY "your-stt-key"
+#define MIMI_SECRET_STT_MODEL "your-stt-model"
+#define MIMI_SECRET_TTS_URL "https://your-tts-endpoint"
+#define MIMI_SECRET_TTS_API_KEY "your-tts-key"
+#define MIMI_SECRET_TTS_MODEL "your-tts-model"
+#define MIMI_SECRET_TTS_VOICE ""
+#define MIMI_SECRET_TTS_LANGUAGE "English"
+
+/* ReSpeaker XVF3800 I2S pin map */
+#define MIMI_VOICE_I2S_PORT 0
+#define MIMI_VOICE_I2S_BCLK GPIO_NUM_8
+#define MIMI_VOICE_I2S_WS GPIO_NUM_7
+#define MIMI_VOICE_I2S_DIN GPIO_NUM_43
+#define MIMI_VOICE_I2S_DOUT GPIO_NUM_44
```
-> **Important: Plug into the correct USB port!** Most ESP32-S3 boards have two USB-C ports. You must use the one labeled **USB** (native USB Serial/JTAG), **not** the one labeled **COM** (external UART bridge). Plugging into the wrong port will cause flash/monitor failures.
->
->
-> Show reference photo
->
->
->
->
+Notes:
-### CLI Commands (via UART/COM port)
+- `MIMI_SECRET_MODEL_PROVIDER` selects the request protocol, not just the vendor name.
+- Use `openai` for OpenAI-compatible gateways.
+- Use `anthropic` for Anthropic-compatible gateways.
+- Voice mode requires STT and TTS URL/key pairs to be configured.
+- LLM API base can be changed at runtime with `set_api_base`.
-Connect via serial to configure or debug. **Config commands** let you change settings without recompiling — just plug in a USB cable anywhere.
+## Adding STT and TTS
-**Runtime config** (saved to NVS, overrides build-time defaults):
+This project no longer treats speech as an afterthought. To enable the full ReSpeaker experience:
-```
-mimi> wifi_set MySSID MyPassword # change WiFi network
-mimi> set_tg_token 123456:ABC... # change Telegram bot token
-mimi> set_api_key sk-ant-api03-... # change API key (Anthropic or OpenAI)
-mimi> set_model_provider openai # switch provider (anthropic|openai)
-mimi> set_model gpt-4o # change LLM model
-mimi> set_proxy 127.0.0.1 7897 # set HTTP proxy
-mimi> clear_proxy # remove proxy
-mimi> set_search_key BSA... # set Brave Search API key
-mimi> set_tavily_key tvly-... # set Tavily API key (preferred)
-mimi> config_show # show all config (masked)
-mimi> config_reset # clear NVS, revert to build-time defaults
-```
+1. Configure `MIMI_SECRET_STT_URL`, `MIMI_SECRET_STT_API_KEY`, and `MIMI_SECRET_STT_MODEL`.
+2. Configure `MIMI_SECRET_TTS_URL`, `MIMI_SECRET_TTS_API_KEY`, `MIMI_SECRET_TTS_MODEL`, `MIMI_SECRET_TTS_VOICE`, and `MIMI_SECRET_TTS_LANGUAGE`.
+3. Set the XVF3800 input pins and your speaker output pins in the I2S section.
+4. If your DAC or amp sounds noisy, set `MIMI_VOICE_I2S_STD_SLOT_STYLE` to match the hardware timing style.
+5. If your room causes false triggers, tune `MIMI_VOICE_VAD_START_FRAMES`, `MIMI_VOICE_VAD_MIN_FRAMES`, and `MIMI_VOICE_STT_COOLDOWN_MS`.
+6. If your TTS audio is too long, tune `MIMI_VOICE_TTS_MAX_SECONDS`, `MIMI_VOICE_TTS_CHARS_PER_SEC`, and `MIMI_VOICE_TTS_MAX_CHARS`.
-**Debug & maintenance:**
+The current firmware already contains the full voice channel:
-```
-mimi> wifi_status # am I connected?
-mimi> memory_read # see what the bot remembers
-mimi> memory_write "content" # write to MEMORY.md
-mimi> heap_info # how much RAM is free?
-mimi> session_list # list all chat sessions
-mimi> session_clear 12345 # wipe a conversation
-mimi> heartbeat_trigger # manually trigger a heartbeat check
-mimi> cron_start # start cron scheduler now
-mimi> restart # reboot
-```
-
-### USB (JTAG) vs UART: Which Port for What
-
-Most ESP32-S3 dev boards expose **two USB-C ports**:
-
-| Port | Use for |
-|------|---------|
-| **USB** (JTAG) | `idf.py flash`, JTAG debugging |
-| **COM** (UART) | **REPL CLI**, serial console |
-
-> **REPL requires the UART (COM) port.** The USB (JTAG) port does not support interactive REPL input.
+- inbound: mic PCM -> VAD -> STT -> message bus
+- outbound: agent text -> TTS -> playback
-
-Port details & recommended workflow
+## Flash and Monitor
-| Port | Label | Protocol |
-|------|-------|----------|
-| **USB** | USB / JTAG | Native USB Serial/JTAG |
-| **COM** | UART / COM | External UART bridge (CP2102/CH340) |
+After changing `main/mimi_secrets.h`, rebuild from a clean state:
-The ESP-IDF console/REPL is configured to use UART by default (`CONFIG_ESP_CONSOLE_UART_DEFAULT=y`).
-
-**If you have both ports connected simultaneously:**
-
-- USB (JTAG) handles flash/download and provides secondary serial output
-- UART (COM) provides the primary interactive console for the REPL
-- macOS: both appear as `/dev/cu.usbmodem*` or `/dev/cu.usbserial-*` — run `ls /dev/cu.usb*` to identify
-- Linux: USB (JTAG) → `/dev/ttyACM0`, UART → `/dev/ttyUSB0`
+```bash
+idf.py fullclean
+idf.py build
+```
-**Recommended workflow:**
+Find your serial port:
```bash
-# Flash via USB (JTAG) port
-idf.py -p /dev/cu.usbmodem11401 flash
-
-# Open REPL via UART (COM) port
-idf.py -p /dev/cu.usbserial-110 monitor
-# or use any serial terminal: screen, minicom, PuTTY at 115200 baud
+ls /dev/cu.usb* # macOS
+ls /dev/ttyACM* # Linux
```
-
+Flash and monitor:
-## Memory
+```bash
+idf.py -p PORT flash monitor
+```
-MimiClaw stores everything as plain text files you can read and edit:
+Replace `PORT` with your actual device path.
-| File | What it is |
-|------|------------|
-| `SOUL.md` | The bot's personality — edit this to change how it behaves |
-| `USER.md` | Info about you — name, preferences, language |
-| `MEMORY.md` | Long-term memory — things the bot should always remember |
-| `HEARTBEAT.md` | Task list the bot checks periodically and acts on autonomously |
-| `cron.json` | Scheduled jobs — recurring or one-shot tasks created by the AI |
-| `2026-02-05.md` | Daily notes — what happened today |
-| `tg_12345.jsonl` | Chat history — your conversation with the bot |
+## Serial CLI
-## Tools
+The serial CLI is the fastest way to change runtime settings stored in NVS:
-MimiClaw supports tool calling for both Anthropic and OpenAI — the LLM can call tools during a conversation and loop until the task is done (ReAct pattern).
+```
+mimi> wifi_set MySSID MyPassword
+mimi> set_tg_token 123456:ABC...
+mimi> set_api_key your-llm-key
+mimi> set_api_base https://your-compatible-endpoint/v1
+mimi> set_model_provider openai
+mimi> set_model gpt-5.2
+mimi> set_proxy 127.0.0.1 7897
+mimi> clear_proxy
+mimi> set_search_key BSA...
+mimi> set_tavily_key tvly-...
+mimi> config_show
+mimi> config_reset
+```
-| Tool | Description |
-|------|-------------|
-| `web_search` | Search the web via Tavily (preferred) or Brave for current information |
-| `get_current_time` | Fetch current date/time via HTTP and set the system clock |
-| `cron_add` | Schedule a recurring or one-shot task (the LLM creates cron jobs on its own) |
-| `cron_list` | List all scheduled cron jobs |
-| `cron_remove` | Remove a cron job by ID |
+Maintenance commands:
+
+```text
+mimi> wifi_status
+mimi> memory_read
+mimi> memory_write "remember this"
+mimi> heap_info
+mimi> session_list
+mimi> session_clear 12345
+mimi> heartbeat_trigger
+mimi> cron_start
+mimi> restart
+```
-To enable web search, set a [Tavily API key](https://app.tavily.com/home) via `MIMI_SECRET_TAVILY_KEY` (preferred), or a [Brave Search API key](https://brave.com/search/api/) via `MIMI_SECRET_SEARCH_KEY` in `mimi_secrets.h`.
+## Compatible Provider Model
-## Cron Tasks
+`reSpeaker-claw` is not limited to the official Anthropic and OpenAI endpoints.
-MimiClaw has a built-in cron scheduler that lets the AI schedule its own tasks. The LLM can create recurring jobs ("every N seconds") or one-shot jobs ("at unix timestamp") via the `cron_add` tool. When a job fires, its message is injected into the agent loop — so the AI wakes up, processes the task, and responds.
+It supports:
-Jobs are persisted to SPIFFS (`cron.json`) and survive reboots. Example use cases: daily summaries, periodic reminders, scheduled check-ins.
+- Anthropic protocol compatible services, selected with `set_model_provider anthropic`
+- OpenAI protocol compatible services, selected with `set_model_provider openai`
+- Custom API bases through `set_api_base`
-## Heartbeat
+This makes it practical to use local gateways, regional cloud vendors, or unified API platforms without changing the agent loop.
-The heartbeat service periodically reads `HEARTBEAT.md` from SPIFFS and checks for actionable tasks. If uncompleted items are found (anything that isn't an empty line, a header, or a checked `- [x]` box), it sends a prompt to the agent loop so the AI can act on them autonomously.
+## Memory and Automation
-This turns MimiClaw into a proactive assistant — write tasks to `HEARTBEAT.md` and the bot will pick them up on the next heartbeat cycle (default: every 30 minutes).
+The agent persists its state in plain files on SPIFFS:
-## Also Included
+| File | Purpose |
+|------|---------|
+| `SOUL.md` | Assistant persona |
+| `USER.md` | User profile |
+| `MEMORY.md` | Long-term memory |
+| `HEARTBEAT.md` | Periodic autonomous task list |
+| `cron.json` | Scheduled jobs |
+| `tg_12345.jsonl` | Session history |
-- **WebSocket gateway** on port 18789 — connect from your LAN with any WebSocket client
-- **OTA updates** — flash new firmware over WiFi, no USB needed
-- **Dual-core** — network I/O and AI processing run on separate CPU cores
-- **HTTP proxy** — CONNECT tunnel support for restricted networks
-- **Multi-provider** — supports both Anthropic (Claude) and OpenAI (GPT), switchable at runtime
-- **Cron scheduler** — the AI can schedule its own recurring and one-shot tasks, persisted across reboots
-- **Heartbeat** — periodically checks a task file and prompts the AI to act autonomously
-- **Tool use** — ReAct agent loop with tool calling for both providers
+Built-in automation features:
-## For Developers
+- `cron_add`, `cron_list`, `cron_remove`
+- heartbeat-driven proactive task handling
+- tool calling in the ReAct loop
+- local storage that survives reboot
-Technical details live in the `docs/` folder:
+## Tooling
-- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — system design, module map, task layout, memory budget, protocols, flash partitions
-- **[docs/TODO.md](docs/TODO.md)** — feature gap tracker and roadmap
-- **[docs/tool-setup/](docs/tool-setup/README.md)** — configuration guides for external service integrations (Tavily, etc.)
+Built-in tools include:
-## Contributing
+- `web_search`
+- `get_current_time`
+- `cron_add`
+- `cron_list`
+- `cron_remove`
+- SPIFFS file tools used by the agent runtime
-Please read **[CONTRIBUTING.md](CONTRIBUTING.md)** before opening issues or pull requests.
+For web search, configure either:
-## Contributors
+- `MIMI_SECRET_TAVILY_KEY`
+- `MIMI_SECRET_SEARCH_KEY`
-Thanks to everyone who has contributed to MimiClaw.
+## Acknowledgments
-
-
-
+This project builds on the original [mimiclaw](https://github.com/memovai/mimiclaw). reSpeaker-claw adapts that embedded agent foundation to ReSpeaker XVF3800 voice hardware, extends the STT / TTS pipeline, and continues the multi-channel agent architecture.
## License
MIT
-
-## Acknowledgments
-
-Inspired by [OpenClaw](https://github.com/openclaw/openclaw) and [Nanobot](https://github.com/HKUDS/nanobot). MimiClaw reimplements the core AI agent architecture for embedded hardware — no Linux, no server, just a $5 chip.
-
-## Star History
-
-[](https://star-history.com/#memovai/mimiclaw&Date)
diff --git a/README_CN.md b/README_CN.md
index f1cefa51..4a36f65f 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -1,339 +1,264 @@
-# MimiClaw: $5 芯片上的口袋 AI 助理
+# reSpeaker-claw:面向 ReSpeaker XVF3800 的语音 AI Agent
-
-
-
-
-
-
-
-
+
+
+
+
+
English | 中文 | 日本語
-**$5 芯片上的 AI 助理(OpenClaw)。没有 Linux,没有 Node.js,纯 C。**
+reSpeaker-claw 将基于 ReSpeaker XVF3800 的设备变成一个以语音为主入口的 AI Agent。它通过 I2S 采集音频,在本地执行 VAD,将话语送入 STT,并通过嵌入式 agent loop 处理。系统把实时语音交互、本地记忆、工具调用、调度、heartbeat、OTA 更新和代理支持整合在一起,最后通过 TTS 从扬声器返回响应。
-MimiClaw 把一块小小的 ESP32-S3 开发板变成你的私人 AI 助理。插上 USB 供电,连上 WiFi,通过 Telegram 跟它对话 — 它能处理你丢给它的任何任务,还会随时间积累本地记忆不断进化 — 全部跑在一颗拇指大小的芯片上。
+## 认识 reSpeaker-claw
-## 认识 MimiClaw
+- **小巧**:没有 Linux,没有 Node.js,没有臃肿依赖,只有纯 C
+- **忠诚**:从记忆中学习,重启后依然保留上下文
+- **高效**:USB 供电,功耗更低,可 24/7 运行
+- **自由**:ReSpeaker XVF3800 麦克风阵列,配合你自己选择的功放或 DAC
+- **顺手**:内置语音通道,除了 XVF3800 和扬声器链路,不需要额外硬件
-- **小巧** — 没有 Linux,没有 Node.js,没有臃肿依赖 — 纯 C
-- **好用** — 在 Telegram 发消息,剩下的它来搞定
-- **忠诚** — 从记忆中学习,跨重启也不会忘
-- **能干** — USB 供电,0.5W,24/7 运行
-- **可爱** — 一块 ESP32-S3 开发板,$5,没了
+## 亮点
-## 工作原理
+- 语音输入:ReSpeaker XVF3800 麦克风阵列,通过 I2S 接入
+- 语音输出:TTS 音频下载、WAV 解码、重采样与 I2S 播放
+- 多通道 Agent:语音、Telegram、飞书、WebSocket
+- 本地持久化:SPIFFS 保存记忆、配置、会话、cron 任务和每日笔记
+- 兼容 LLM 后端:支持官方 Anthropic / OpenAI API,也支持兼容 Anthropic 或 OpenAI 协议的第三方网关
+- 可配置 STT / TTS:可接入你自己的服务 URL、API Key、模型、音色和语言
+- 运行时覆盖:可通过串口 CLI 修改 WiFi、provider、model、API base、代理和 token,无需改代码
-
+## 快速开始
-你在 Telegram 发一条消息,ESP32-S3 通过 WiFi 收到后送进 Agent 循环 — LLM 思考、调用工具、读取记忆 — 再把回复发回来。同时支持 **Anthropic (Claude)** 和 **OpenAI (GPT)** 两种提供商,运行时可切换。一切都跑在一颗 $5 的芯片上,所有数据存在本地 Flash。
+### 依赖条件
-## 快速开始
+- 一套 reSpeaker XVF3800 USB 4 Microphone Array 搭配 XIAO ESP32S3 开发板
+- 一路 I2S 输出到扬声器 / DAC / 功放
+- 一根用于烧录和串口监控的 USB 线
+- 可用的 WiFi
+- ESP-IDF v5.5+
+- 可选:如果你要使用 Telegram,需要 Telegram Bot Token
+- 可选:如果你要使用飞书,需要飞书应用凭证
+- 一个兼容 Anthropic 或 OpenAI 协议的 LLM API Key
+- 一套用于语音模式的 STT 服务和 TTS 服务
-### 你需要
+### 克隆与构建环境
-- 一块 **ESP32-S3 开发板**,16MB Flash + 8MB PSRAM(如小智 AI 开发板,~¥30)
-- 一根 **USB Type-C 数据线**
-- 一个 **Telegram Bot Token** — 在 Telegram 找 [@BotFather](https://t.me/BotFather) 创建
-- 一个 **Anthropic API Key** — 从 [console.anthropic.com](https://console.anthropic.com) 获取,或一个 **OpenAI API Key** — 从 [platform.openai.com](https://platform.openai.com) 获取
+先参考官方指南刷入 I2S 固件:
+[SeeedStudio wiki](https://wiki.seeedstudio.com/respeaker_xvf3800_introduction/#flash-firmware)
-### 安装
+然后克隆本项目并设置目标:
```bash
-# 需要先安装 ESP-IDF v5.5+:
-# https://docs.espressif.com/projects/esp-idf/en/v5.5.2/esp32s3/get-started/
-
-git clone https://github.com/memovai/mimiclaw.git
-cd mimiclaw
+git clone https://github.com/Seeed-Projects/reSpeaker-claw
+cd reSpeaker-claw
idf.py set-target esp32s3
```
-
-Ubuntu 安装
-
-建议基线:
-
-- Ubuntu 22.04/24.04
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb-1.0-0`、`libffi-dev`、`libssl-dev`
+先安装 ESP-IDF:[ESP-IDF 安装](https://docs.espressif.com/projects/esp-idf/en/v5.5.3/esp32s3/get-started/)
-Ubuntu 安装与构建:
+Ubuntu 辅助脚本:
```bash
-sudo apt-get update
-sudo apt-get install -y git wget flex bison gperf python3 python3-pip python3-venv \
- cmake ninja-build ccache libffi-dev libssl-dev dfu-util libusb-1.0-0
-
./scripts/setup_idf_ubuntu.sh
./scripts/build_ubuntu.sh
```
-
-
-
-macOS 安装
-
-建议基线:
-
-- macOS 12/13/14
-- Xcode Command Line Tools
-- Homebrew
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb`、`libffi`、`openssl`
-
-macOS 安装与构建:
+macOS 辅助脚本:
```bash
-xcode-select --install
-/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
-
./scripts/setup_idf_macos.sh
./scripts/build_macos.sh
```
-
-
-### 配置
+## 配置
-MimiClaw 使用**两层配置**:`mimi_secrets.h` 提供编译时默认值,串口 CLI 可在运行时覆盖。CLI 设置的值存在 NVS Flash 中,优先级高于编译时值。
+复制示例 secrets 文件:
```bash
-cp main/mimi_secrets.h.example main/mimi_secrets.h
+cp "main/mimi_secrets.h.example" "main/mimi_secrets.h"
```
-编辑 `main/mimi_secrets.h`:
+编辑 `main/mimi_secrets.h`,填写你实际需要的配置项:
```c
-#define MIMI_SECRET_WIFI_SSID "你的WiFi名"
-#define MIMI_SECRET_WIFI_PASS "你的WiFi密码"
-#define MIMI_SECRET_TG_TOKEN "123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11"
-#define MIMI_SECRET_API_KEY "sk-ant-api03-xxxxx"
-#define MIMI_SECRET_MODEL_PROVIDER "anthropic" // "anthropic" 或 "openai"
-#define MIMI_SECRET_SEARCH_KEY "" // 可选:Brave Search API key
-#define MIMI_SECRET_TAVILY_KEY "" // 可选:Tavily API key(优先)
-#define MIMI_SECRET_PROXY_HOST "10.0.0.1" // 可选:代理地址
-#define MIMI_SECRET_PROXY_PORT "7897" // 可选:代理端口
+/* WiFi */
+#define MIMI_SECRET_WIFI_SSID "YourWiFiName"
+#define MIMI_SECRET_WIFI_PASS "YourWiFiPassword"
+
+/* Optional text channels */
+#define MIMI_SECRET_TG_TOKEN ""
+#define MIMI_SECRET_FEISHU_APP_ID ""
+#define MIMI_SECRET_FEISHU_APP_SECRET ""
+
+/* LLM */
+#define MIMI_SECRET_API_KEY "your-llm-key"
+#define MIMI_SECRET_MODEL "your-model"
+#define MIMI_SECRET_MODEL_PROVIDER "openai" /* or "anthropic" */
+
+/* Search and proxy */
+#define MIMI_SECRET_TAVILY_KEY ""
+#define MIMI_SECRET_SEARCH_KEY ""
+#define MIMI_SECRET_PROXY_HOST ""
+#define MIMI_SECRET_PROXY_PORT ""
+#define MIMI_SECRET_PROXY_TYPE "" /* "http" or "socks5" */
+
+/* Voice STT / TTS */
+#define MIMI_SECRET_STT_URL "https://your-stt-endpoint"
+#define MIMI_SECRET_STT_API_KEY "your-stt-key"
+#define MIMI_SECRET_STT_MODEL "your-stt-model"
+#define MIMI_SECRET_TTS_URL "https://your-tts-endpoint"
+#define MIMI_SECRET_TTS_API_KEY "your-tts-key"
+#define MIMI_SECRET_TTS_MODEL "your-tts-model"
+#define MIMI_SECRET_TTS_VOICE ""
+#define MIMI_SECRET_TTS_LANGUAGE "English"
+
+/* ReSpeaker XVF3800 I2S pin map */
+#define MIMI_VOICE_I2S_PORT 0
+#define MIMI_VOICE_I2S_BCLK GPIO_NUM_8
+#define MIMI_VOICE_I2S_WS GPIO_NUM_7
+#define MIMI_VOICE_I2S_DIN GPIO_NUM_43
+#define MIMI_VOICE_I2S_DOUT GPIO_NUM_44
```
-然后编译烧录:
+说明:
-```bash
-# 完整编译(修改 mimi_secrets.h 后必须 fullclean)
-idf.py fullclean && idf.py build
-
-# 查找串口
-ls /dev/cu.usb* # macOS
-ls /dev/ttyACM* # Linux
-
-# 烧录并监控(将 PORT 替换为你的串口)
-# USB 转接器:大概率是 /dev/cu.usbmodem11401(macOS)或 /dev/ttyACM0(Linux)
-idf.py -p PORT flash monitor
-```
-
-> **注意:请插对 USB 口!** 大多数 ESP32-S3 开发板有两个 Type-C 接口,必须插标有 **USB** 的那个口(原生 USB Serial/JTAG),**不要**插标有 **COM** 的口(外部 UART 桥接)。插错口会导致烧录/监控失败。
->
->
-> 查看参考图片
->
->
->
->
+- `MIMI_SECRET_MODEL_PROVIDER` 选择的是请求协议,而不只是厂商名
+- 兼容 OpenAI 协议的网关使用 `openai`
+- 兼容 Anthropic 协议的网关使用 `anthropic`
+- 语音模式要求 STT 与 TTS 的 URL / Key 成对配置
+- LLM API base 可在运行时通过 `set_api_base` 修改
-### 代理配置(国内用户)
+## 添加 STT 和 TTS
-在国内需要代理才能访问 Telegram 和 Anthropic API。MimiClaw 内置 HTTP CONNECT 隧道支持。
+这个项目不再把语音当成附属功能。要启用完整的 ReSpeaker 体验:
-**前提**:局域网内有一个支持 HTTP CONNECT 的代理(Clash Verge、V2Ray 等),并开启了「允许局域网连接」。
+1. 配置 `MIMI_SECRET_STT_URL`、`MIMI_SECRET_STT_API_KEY` 和 `MIMI_SECRET_STT_MODEL`
+2. 配置 `MIMI_SECRET_TTS_URL`、`MIMI_SECRET_TTS_API_KEY`、`MIMI_SECRET_TTS_MODEL`、`MIMI_SECRET_TTS_VOICE` 和 `MIMI_SECRET_TTS_LANGUAGE`
+3. 在 I2S 配置段中设置 XVF3800 的输入引脚和扬声器输出引脚
+4. 如果 DAC 或功放播放出来像噪音,设置 `MIMI_VOICE_I2S_STD_SLOT_STYLE` 以匹配硬件时序
+5. 如果房间环境导致误触发,调节 `MIMI_VOICE_VAD_START_FRAMES`、`MIMI_VOICE_VAD_MIN_FRAMES` 和 `MIMI_VOICE_STT_COOLDOWN_MS`
+6. 如果 TTS 音频过长,调节 `MIMI_VOICE_TTS_MAX_SECONDS`、`MIMI_VOICE_TTS_CHARS_PER_SEC` 和 `MIMI_VOICE_TTS_MAX_CHARS`
-可以在 `mimi_secrets.h` 中编译时设置,也可以通过串口 CLI 随时修改:
+当前固件已经包含完整的语音通道:
-```
-mimi> set_proxy 192.168.1.83 7897 # 设置代理
-mimi> clear_proxy # 清除代理
-```
+- 输入方向:mic PCM -> VAD -> STT -> message bus
+- 输出方向:agent text -> TTS -> playback
-> **提示**:确保 ESP32-S3 和代理机器在同一局域网。Clash Verge 在「设置 → 允许局域网」中开启。
+## 烧录与监控
-### CLI 命令(通过 UART/COM 口连接)
+修改 `main/mimi_secrets.h` 后,建议从干净状态重新构建:
-通过串口连接即可配置和调试。**配置命令**让你无需重新编译就能修改设置 — 随时随地插上 USB 线就能改。
-
-**运行时配置**(存入 NVS,覆盖编译时默认值):
-
-```
-mimi> wifi_set MySSID MyPassword # 换 WiFi
-mimi> set_tg_token 123456:ABC... # 换 Telegram Bot Token
-mimi> set_api_key sk-ant-api03-... # 换 API Key(Anthropic 或 OpenAI)
-mimi> set_model_provider openai # 切换提供商(anthropic|openai)
-mimi> set_model gpt-4o # 换模型
-mimi> set_proxy 192.168.1.83 7897 # 设置代理
-mimi> clear_proxy # 清除代理
-mimi> set_search_key BSA... # 设置 Brave Search API Key
-mimi> set_tavily_key tvly-... # 设置 Tavily API Key(优先)
-mimi> config_show # 查看所有配置(脱敏显示)
-mimi> config_reset # 清除 NVS,恢复编译时默认值
+```bash
+idf.py fullclean
+idf.py build
```
-**调试与运维:**
+查找串口:
+```bash
+ls /dev/cu.usb* # macOS
+ls /dev/ttyACM* # Linux
```
-mimi> wifi_status # 连上了吗?
-mimi> memory_read # 看看它记住了什么
-mimi> memory_write "内容" # 写入 MEMORY.md
-mimi> heap_info # 还剩多少内存?
-mimi> session_list # 列出所有会话
-mimi> session_clear 12345 # 删除一个会话
-mimi> heartbeat_trigger # 手动触发一次心跳检查
-mimi> cron_start # 立即启动 cron 调度器
-mimi> restart # 重启
-```
-
-### USB (JTAG) 与 UART:哪个口做什么
-大多数 ESP32-S3 开发板有 **两个 USB-C 口**:
-
-| 端口 | 用途 |
-|------|------|
-| **USB**(JTAG) | `idf.py flash`、JTAG 调试 |
-| **COM**(UART) | **REPL 命令行**、串口控制台 |
-
-> **REPL 必须连接 UART(COM)口。** USB(JTAG)口不支持交互式 REPL 输入。
-
-
-端口详情与推荐工作流
-
-| 端口 | 标注 | 协议 |
-|------|------|------|
-| **USB** | USB / JTAG | 原生 USB Serial/JTAG |
-| **COM** | UART / COM | 外置 UART 桥接芯片(CP2102/CH340) |
-
-ESP-IDF 控制台默认配置为 UART 输出(`CONFIG_ESP_CONSOLE_UART_DEFAULT=y`)。
-
-**同时连接两个口时:**
-
-- USB(JTAG)口负责烧录/下载,并提供辅助串口输出
-- UART(COM)口提供主要的交互式控制台,用于 REPL
-- macOS 下两个口都会显示为 `/dev/cu.usbmodem*` 或 `/dev/cu.usbserial-*`,用 `ls /dev/cu.usb*` 区分
-- Linux 下 USB(JTAG)通常是 `/dev/ttyACM0`,UART 通常是 `/dev/ttyUSB0`
-
-**推荐工作流:**
+烧录并监控:
```bash
-# 通过 USB(JTAG)口烧录
-idf.py -p /dev/cu.usbmodem11401 flash
-
-# 通过 UART(COM)口打开 REPL
-idf.py -p /dev/cu.usbserial-110 monitor
-# 或使用任意串口工具:screen、minicom、PuTTY,波特率 115200
+idf.py -p PORT flash monitor
```
-
-
-## 记忆
-
-MimiClaw 把所有数据存为纯文本文件,可以直接读取和编辑:
-
-| 文件 | 说明 |
-|------|------|
-| `SOUL.md` | 机器人的人设 — 编辑它来改变行为方式 |
-| `USER.md` | 关于你的信息 — 姓名、偏好、语言 |
-| `MEMORY.md` | 长期记忆 — 它应该一直记住的事 |
-| `HEARTBEAT.md` | 待办清单 — 机器人定期检查并自主执行 |
-| `cron.json` | 定时任务 — AI 创建的周期性或一次性任务 |
-| `2026-02-05.md` | 每日笔记 — 今天发生了什么 |
-| `tg_12345.jsonl` | 聊天记录 — 你和它的对话 |
-
-## 工具
-
-MimiClaw 同时支持 Anthropic 和 OpenAI 的工具调用 — LLM 在对话中可以调用工具,循环执行直到任务完成(ReAct 模式)。
+将 `PORT` 替换为你的实际设备路径。
+
+## 串口 CLI
+
+串口 CLI 是修改 NVS 运行时配置的最快方式:
+
+```text
+mimi> wifi_set MySSID MyPassword
+mimi> set_tg_token 123456:ABC...
+mimi> set_api_key your-llm-key
+mimi> set_api_base https://your-compatible-endpoint/v1
+mimi> set_model_provider openai
+mimi> set_model gpt-5.2
+mimi> set_proxy 127.0.0.1 7897
+mimi> clear_proxy
+mimi> set_search_key BSA...
+mimi> set_tavily_key tvly-...
+mimi> config_show
+mimi> config_reset
+```
-| 工具 | 说明 |
-|------|------|
-| `web_search` | 通过 Tavily(优先)或 Brave 搜索网页,获取实时信息 |
-| `get_current_time` | 通过 HTTP 获取当前日期和时间,并设置系统时钟 |
-| `cron_add` | 创建定时或一次性任务(LLM 自主创建 cron 任务) |
-| `cron_list` | 列出所有已调度的 cron 任务 |
-| `cron_remove` | 按 ID 删除 cron 任务 |
+维护命令:
+
+```text
+mimi> wifi_status
+mimi> memory_read
+mimi> memory_write "remember this"
+mimi> heap_info
+mimi> session_list
+mimi> session_clear 12345
+mimi> heartbeat_trigger
+mimi> cron_start
+mimi> restart
+```
-启用网页搜索可在 `mimi_secrets.h` 中设置 [Tavily API key](https://app.tavily.com/home)(优先,`MIMI_SECRET_TAVILY_KEY`),或 [Brave Search API key](https://brave.com/search/api/)(`MIMI_SECRET_SEARCH_KEY`)。
+## 兼容 Provider 模型
-## 定时任务(Cron)
+`reSpeaker-claw` 不局限于官方 Anthropic 和 OpenAI 端点。
-MimiClaw 内置 cron 调度器,让 AI 可以自主安排任务。LLM 可以通过 `cron_add` 工具创建周期性任务("每 N 秒")或一次性任务("在某个时间戳")。任务触发时,消息会注入到 Agent 循环 — AI 自动醒来、处理任务并回复。
+它支持:
-任务持久化存储在 SPIFFS(`cron.json`),重启后不会丢失。典型用途:每日总结、定时提醒、定期巡检。
+- 兼容 Anthropic 协议的服务,通过 `set_model_provider anthropic` 选择
+- 兼容 OpenAI 协议的服务,通过 `set_model_provider openai` 选择
+- 通过 `set_api_base` 指向任意兼容 API base
-## 心跳(Heartbeat)
+这让你可以在不修改 agent loop 的情况下,直接使用本地网关、区域云厂商或统一 API 平台。
-心跳服务会定期读取 SPIFFS 上的 `HEARTBEAT.md`,检查是否有待办事项。如果发现未完成的条目(非空行、非标题、非已勾选的 `- [x]`),就会向 Agent 循环发送提示,让 AI 自主处理。
+## 记忆与自动化
-这让 MimiClaw 变成一个主动型助理 — 把任务写入 `HEARTBEAT.md`,机器人会在下一次心跳周期自动拾取执行(默认每 30 分钟)。
+Agent 会将状态以纯文本文件形式持久化到 SPIFFS:
-## 其他功能
+| 文件 | 用途 |
+|------|------|
+| `SOUL.md` | 助手人格 |
+| `USER.md` | 用户资料 |
+| `MEMORY.md` | 长期记忆 |
+| `HEARTBEAT.md` | 周期性自主任务列表 |
+| `cron.json` | 调度任务 |
+| `tg_12345.jsonl` | 会话历史 |
-- **WebSocket 网关** — 端口 18789,局域网内用任意 WebSocket 客户端连接
-- **OTA 更新** — WiFi 远程刷固件,无需 USB
-- **双核** — 网络 I/O 和 AI 处理分别跑在不同 CPU 核心
-- **HTTP 代理** — CONNECT 隧道,适配受限网络
-- **多提供商** — 同时支持 Anthropic (Claude) 和 OpenAI (GPT),运行时可切换
-- **定时任务** — AI 可自主创建周期性和一次性任务,重启后持久保存
-- **心跳服务** — 定期检查任务文件,驱动 AI 自主执行
-- **工具调用** — ReAct Agent 循环,两种提供商均支持工具调用
+内置自动化能力:
-## 开发者
+- `cron_add`、`cron_list`、`cron_remove`
+- heartbeat 驱动的主动任务处理
+- ReAct loop 中的工具调用
+- 重启后仍可保留的本地状态
-技术细节在 `docs/` 文件夹:
+## 工具
-- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — 系统设计、模块划分、任务布局、内存分配、协议、Flash 分区
-- **[docs/TODO.md](docs/TODO.md)** — 功能差距和路线图
-- **[docs/im-integration/](docs/im-integration/README.md)** — IM 通道集成指南(飞书等)
+内置工具包括:
-## 贡献
+- `web_search`
+- `get_current_time`
+- `cron_add`
+- `cron_list`
+- `cron_remove`
+- Agent 运行时使用的 SPIFFS 文件工具
-提交 Issue 或 Pull Request 前,请先阅读 **[CONTRIBUTING.md](CONTRIBUTING.md)**。
+如需启用网页搜索,配置以下任一项:
-## 贡献者
+- `MIMI_SECRET_TAVILY_KEY`
+- `MIMI_SECRET_SEARCH_KEY`
-感谢所有为 MimiClaw 做出贡献的开发者。
+## 致谢
-
-
-
+本项目基于原始的 [mimiclaw](https://github.com/memovai/mimiclaw)。reSpeaker-claw 将那套嵌入式 agent 基础适配到 ReSpeaker XVF3800 语音硬件之上,扩展了 STT / TTS 流程,并延续了多通道 agent 架构。
## 许可证
MIT
-
-## 致谢
-
-灵感来自 [OpenClaw](https://github.com/openclaw/openclaw) 和 [Nanobot](https://github.com/HKUDS/nanobot)。MimiClaw 为嵌入式硬件重新实现了核心 AI Agent 架构 — 没有 Linux,没有服务器,只有一颗 $5 的芯片。
-
-## Star History
-
-
-
-
-
-
-
-
diff --git a/README_JA.md b/README_JA.md
index fe91a8a9..cfd2e2d7 100644
--- a/README_JA.md
+++ b/README_JA.md
@@ -1,324 +1,264 @@
-# MimiClaw: $5チップで動くポケットAIアシスタント
+# reSpeaker-claw: ReSpeaker XVF3800 向け音声 AI Agent
-
-
-
-
-
-
-
-
+
+
+
+
+
English | 中文 | 日本語
-**$5チップ上の世界初のAIアシスタント(OpenClaw)。Linuxなし、Node.jsなし、純粋なCのみ。**
-
-MimiClawは小さなESP32-S3ボードをパーソナルAIアシスタントに変えます。USB電源に接続し、WiFiにつなげて、Telegramから話しかけるだけ — どんなタスクも処理し、ローカルメモリで時間とともに成長します — すべて親指サイズのチップ上で。
+reSpeaker-claw は、ReSpeaker XVF3800 ベースのデバイスを音声ファーストの AI Agent に変えるプロジェクトです。I2S で音声を取り込み、ローカル VAD を実行し、発話を STT に送って組み込みの agent loop で処理します。システムはリアルタイム音声対話に加えて、ローカルメモリ、ツール呼び出し、スケジューリング、heartbeat、OTA 更新、プロキシ対応を統合し、最終的に TTS でスピーカーから応答を返します。
-## MimiClawの特徴
+## reSpeaker-claw とは
-- **超小型** — Linux不要、Node.js不要、無駄なし — 純粋なCのみ
-- **便利** — Telegramでメッセージを送るだけ、あとはお任せ
-- **忠実** — メモリから学習し、再起動しても忘れない
-- **省エネ** — USB給電、0.5W、24時間365日稼働
-- **お手頃** — ESP32-S3ボード1枚、$5、それだけ
+- **小さい**: Linux なし、Node.js なし、無駄な依存なし、純粋な C のみ
+- **記憶する**: メモリから学習し、再起動後も文脈を保持
+- **省電力**: USB 給電、より低消費電力で 24/7 稼働可能
+- **自由度が高い**: ReSpeaker XVF3800 のマイクアレイに、好みのアンプや DAC を組み合わせ可能
+- **扱いやすい**: 音声チャネルを内蔵し、XVF3800 とスピーカー経路以外の追加ハードウェアをほぼ必要としない
-## 仕組み
+## 特長
-
-
-Telegramでメッセージを送ると、ESP32-S3がWiFi経由で受信し、エージェントループに送ります — LLMが思考し、ツールを呼び出し、メモリを読み取り — 返答を送り返します。**Anthropic (Claude)** と **OpenAI (GPT)** の両方をサポートし、実行時に切り替え可能です。すべてが$5のチップ上で動作し、データはすべてローカルのFlashに保存されます。
+- 音声入力: ReSpeaker XVF3800 マイクアレイを I2S で接続
+- 音声出力: TTS 音声のダウンロード、WAV デコード、リサンプル、I2S 再生
+- マルチチャネル Agent: 音声、Telegram、Feishu、WebSocket
+- ローカル永続化: SPIFFS にメモリ、設定、セッション、cron ジョブ、日次メモを保存
+- 互換 LLM バックエンド: 公式 Anthropic / OpenAI API に加え、Anthropic 互換または OpenAI 互換エンドポイントも利用可能
+- STT / TTS を柔軟に設定可能: URL、API Key、モデル、音色、言語を自由に差し替え可能
+- 実行時オーバーライド: WiFi、provider、model、API base、proxy、token をシリアル CLI から変更可能
## クイックスタート
### 必要なもの
-- **ESP32-S3開発ボード**(16MB Flash + 8MB PSRAM搭載、例:小智AIボード、約$10)
-- **USB Type-Cケーブル**
-- **Telegram Botトークン** — Telegramで[@BotFather](https://t.me/BotFather)に話しかけて作成
-- **Anthropic APIキー** — [console.anthropic.com](https://console.anthropic.com)から取得、または **OpenAI APIキー** — [platform.openai.com](https://platform.openai.com)から取得
+- reSpeaker XVF3800 USB 4 Microphone Array と XIAO ESP32S3 ボード
+- I2S 出力で接続するスピーカー / DAC / アンプ経路
+- 書き込みとシリアルモニタ用の USB ケーブル
+- WiFi 接続
+- ESP-IDF v5.5+
+- 任意: Telegram を使う場合は Telegram Bot Token
+- 任意: Feishu を使う場合は Feishu アプリ認証情報
+- Anthropic 互換または OpenAI 互換エンドポイント向けの LLM API Key
+- 音声モード用の STT サービスと TTS サービス
-### インストール
+### クローンとビルド環境
-```bash
-# まずESP-IDF v5.5+をインストールしてください:
-# https://docs.espressif.com/projects/esp-idf/en/v5.5.2/esp32s3/get-started/
+まず公式ガイドを参照して I2S ファームウェアを書き込んでください:
+[SeeedStudio wiki](https://wiki.seeedstudio.com/respeaker_xvf3800_introduction/#flash-firmware)
+
+その後、このプロジェクトをクローンしてターゲットを設定します:
-git clone https://github.com/memovai/mimiclaw.git
-cd mimiclaw
+```bash
+git clone https://github.com/Seeed-Projects/reSpeaker-claw
+cd reSpeaker-claw
idf.py set-target esp32s3
```
-
-Ubuntu インストール
-
-推奨ベースライン:
+ESP-IDF は先にインストールしてください: [ESP-IDF Install](https://docs.espressif.com/projects/esp-idf/en/v5.5.3/esp32s3/get-started/)
-- Ubuntu 22.04/24.04
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb-1.0-0`, `libffi-dev`, `libssl-dev`
-
-Ubuntu でのインストールとビルド:
+Ubuntu 用ヘルパースクリプト:
```bash
-sudo apt-get update
-sudo apt-get install -y git wget flex bison gperf python3 python3-pip python3-venv \
- cmake ninja-build ccache libffi-dev libssl-dev dfu-util libusb-1.0-0
-
./scripts/setup_idf_ubuntu.sh
./scripts/build_ubuntu.sh
```
-
-
-
-macOS インストール
-
-推奨ベースライン:
-
-- macOS 12/13/14
-- Xcode Command Line Tools
-- Homebrew
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb`, `libffi`, `openssl`
-
-macOS でのインストールとビルド:
+macOS 用ヘルパースクリプト:
```bash
-xcode-select --install
-/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
-
./scripts/setup_idf_macos.sh
./scripts/build_macos.sh
```
-
+## 設定
-### 設定
-
-MimiClawは**2層設定**を採用しています:`mimi_secrets.h`でビルド時のデフォルト値を設定し、シリアルCLIで実行時にオーバーライドできます。CLI設定値はNVS Flashに保存され、ビルド時の値より優先されます。
+まず secrets のサンプルファイルをコピーします:
```bash
-cp main/mimi_secrets.h.example main/mimi_secrets.h
+cp "main/mimi_secrets.h.example" "main/mimi_secrets.h"
```
-`main/mimi_secrets.h`を編集:
+`main/mimi_secrets.h` を編集し、実際に使う項目を設定します:
```c
-#define MIMI_SECRET_WIFI_SSID "WiFi名"
-#define MIMI_SECRET_WIFI_PASS "WiFiパスワード"
-#define MIMI_SECRET_TG_TOKEN "123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11"
-#define MIMI_SECRET_API_KEY "sk-ant-api03-xxxxx"
-#define MIMI_SECRET_MODEL_PROVIDER "anthropic" // "anthropic" または "openai"
-#define MIMI_SECRET_SEARCH_KEY "" // 任意:Brave Search APIキー
-#define MIMI_SECRET_TAVILY_KEY "" // 任意:Tavily APIキー(優先)
-#define MIMI_SECRET_PROXY_HOST "" // 任意:例 "10.0.0.1"
-#define MIMI_SECRET_PROXY_PORT "" // 任意:例 "7897"
+/* WiFi */
+#define MIMI_SECRET_WIFI_SSID "YourWiFiName"
+#define MIMI_SECRET_WIFI_PASS "YourWiFiPassword"
+
+/* Optional text channels */
+#define MIMI_SECRET_TG_TOKEN ""
+#define MIMI_SECRET_FEISHU_APP_ID ""
+#define MIMI_SECRET_FEISHU_APP_SECRET ""
+
+/* LLM */
+#define MIMI_SECRET_API_KEY "your-llm-key"
+#define MIMI_SECRET_MODEL "your-model"
+#define MIMI_SECRET_MODEL_PROVIDER "openai" /* or "anthropic" */
+
+/* Search and proxy */
+#define MIMI_SECRET_TAVILY_KEY ""
+#define MIMI_SECRET_SEARCH_KEY ""
+#define MIMI_SECRET_PROXY_HOST ""
+#define MIMI_SECRET_PROXY_PORT ""
+#define MIMI_SECRET_PROXY_TYPE "" /* "http" or "socks5" */
+
+/* Voice STT / TTS */
+#define MIMI_SECRET_STT_URL "https://your-stt-endpoint"
+#define MIMI_SECRET_STT_API_KEY "your-stt-key"
+#define MIMI_SECRET_STT_MODEL "your-stt-model"
+#define MIMI_SECRET_TTS_URL "https://your-tts-endpoint"
+#define MIMI_SECRET_TTS_API_KEY "your-tts-key"
+#define MIMI_SECRET_TTS_MODEL "your-tts-model"
+#define MIMI_SECRET_TTS_VOICE ""
+#define MIMI_SECRET_TTS_LANGUAGE "English"
+
+/* ReSpeaker XVF3800 I2S pin map */
+#define MIMI_VOICE_I2S_PORT 0
+#define MIMI_VOICE_I2S_BCLK GPIO_NUM_8
+#define MIMI_VOICE_I2S_WS GPIO_NUM_7
+#define MIMI_VOICE_I2S_DIN GPIO_NUM_43
+#define MIMI_VOICE_I2S_DOUT GPIO_NUM_44
```
-ビルドとフラッシュ:
+補足:
-```bash
-# フルビルド(mimi_secrets.h変更後はfullclean必須)
-idf.py fullclean && idf.py build
+- `MIMI_SECRET_MODEL_PROVIDER` はベンダ名ではなく、リクエストプロトコルを選択します
+- OpenAI 互換ゲートウェイには `openai` を使用します
+- Anthropic 互換ゲートウェイには `anthropic` を使用します
+- 音声モードでは STT と TTS の URL / Key を両方設定する必要があります
+- LLM API base は実行時に `set_api_base` で変更できます
-# シリアルポートを確認
-ls /dev/cu.usb* # macOS
-ls /dev/ttyACM* # Linux
+## STT と TTS の追加
-# フラッシュとモニター(PORTをあなたのポートに置き換え)
-# USBアダプタ:おそらく /dev/cu.usbmodem11401(macOS)または /dev/ttyACM0(Linux)
-idf.py -p PORT flash monitor
-```
+このプロジェクトでは、音声を後付け機能として扱っていません。完全な ReSpeaker 体験を有効にするには:
-> **重要:正しいUSBポートに接続してください!** ほとんどのESP32-S3ボードには2つのUSB-Cポートがあります。**USB**(ネイティブUSB Serial/JTAG)と書かれたポートを使用してください。**COM**(外部UARTブリッジ)と書かれたポートは使わないでください。間違ったポートに接続するとフラッシュ/モニターが失敗します。
->
->
-> 参考画像を表示
->
->
->
->
+1. `MIMI_SECRET_STT_URL`、`MIMI_SECRET_STT_API_KEY`、`MIMI_SECRET_STT_MODEL` を設定します
+2. `MIMI_SECRET_TTS_URL`、`MIMI_SECRET_TTS_API_KEY`、`MIMI_SECRET_TTS_MODEL`、`MIMI_SECRET_TTS_VOICE`、`MIMI_SECRET_TTS_LANGUAGE` を設定します
+3. I2S セクションで XVF3800 の入力ピンとスピーカー側の出力ピンを設定します
+4. DAC やアンプの音がノイズになる場合は、`MIMI_VOICE_I2S_STD_SLOT_STYLE` をハードウェアのタイミングに合わせて設定します
+5. 室内環境で誤検知が多い場合は、`MIMI_VOICE_VAD_START_FRAMES`、`MIMI_VOICE_VAD_MIN_FRAMES`、`MIMI_VOICE_STT_COOLDOWN_MS` を調整します
+6. TTS 音声が長すぎる場合は、`MIMI_VOICE_TTS_MAX_SECONDS`、`MIMI_VOICE_TTS_CHARS_PER_SEC`、`MIMI_VOICE_TTS_MAX_CHARS` を調整します
-### CLIコマンド(UART/COMポート経由)
+現在のファームウェアには、すでに完全な音声チャネルが含まれています:
-シリアル接続で設定やデバッグができます。**設定コマンド**により再コンパイル不要で設定変更可能 — USBケーブルを挿すだけ。
+- 入力方向: mic PCM -> VAD -> STT -> message bus
+- 出力方向: agent text -> TTS -> playback
-**実行時設定**(NVSに保存、ビルド時のデフォルト値をオーバーライド):
+## 書き込みとモニタ
-```
-mimi> wifi_set MySSID MyPassword # WiFiネットワークを変更
-mimi> set_tg_token 123456:ABC... # Telegram Botトークンを変更
-mimi> set_api_key sk-ant-api03-... # APIキーを変更(AnthropicまたはOpenAI)
-mimi> set_model_provider openai # プロバイダーを切替(anthropic|openai)
-mimi> set_model gpt-4o # LLMモデルを変更
-mimi> set_proxy 127.0.0.1 7897 # HTTPプロキシを設定
-mimi> clear_proxy # プロキシを削除
-mimi> set_search_key BSA... # Brave Search APIキーを設定
-mimi> set_tavily_key tvly-... # Tavily APIキーを設定(優先)
-mimi> config_show # 全設定を表示(マスク付き)
-mimi> config_reset # NVSをクリア、ビルド時デフォルトに戻す
-```
-
-**デバッグ・メンテナンス:**
+`main/mimi_secrets.h` を変更した後は、クリーンな状態から再ビルドしてください:
+```bash
+idf.py fullclean
+idf.py build
```
-mimi> wifi_status # 接続されていますか?
-mimi> memory_read # ボットが何を覚えているか確認
-mimi> memory_write "内容" # MEMORY.mdに書き込み
-mimi> heap_info # 空きRAMはどれくらい?
-mimi> session_list # 全チャットセッションを一覧
-mimi> session_clear 12345 # 会話を削除
-mimi> heartbeat_trigger # ハートビートチェックを手動トリガー
-mimi> cron_start # cronスケジューラを今すぐ開始
-mimi> restart # 再起動
-```
-
-### USB(JTAG)vs UART:どのポートで何をするか
-
-ほとんどの ESP32-S3 開発ボードには **2つの USB-C ポート**があります:
-| ポート | 用途 |
-|--------|------|
-| **USB**(JTAG) | `idf.py flash`、JTAGデバッグ |
-| **COM**(UART) | **REPL CLI**、シリアルコンソール |
-
-> **REPLにはUART(COM)ポートが必要です。** USB(JTAG)ポートは対話的なREPL入力をサポートしません。
-
-
-ポート詳細と推奨ワークフロー
-
-| ポート | ラベル | プロトコル |
-|--------|--------|------------|
-| **USB** | USB / JTAG | ネイティブ USB Serial/JTAG |
-| **COM** | UART / COM | 外部 UART ブリッジ(CP2102/CH340) |
-
-ESP-IDFコンソールはデフォルトでUART出力に設定されています(`CONFIG_ESP_CONSOLE_UART_DEFAULT=y`)。
-
-**両方のポートを同時に接続している場合:**
-
-- USB(JTAG)ポートはフラッシュ/ダウンロードを処理し、補助シリアル出力を提供
-- UART(COM)ポートはREPL用のメインインタラクティブコンソールを提供
-- macOS では両ポートとも `/dev/cu.usbmodem*` または `/dev/cu.usbserial-*` として表示 — `ls /dev/cu.usb*` で確認
-- Linux では USB(JTAG)は通常 `/dev/ttyACM0`、UART は通常 `/dev/ttyUSB0`
-
-**推奨ワークフロー:**
+シリアルポートを確認します:
```bash
-# USB(JTAG)ポートでフラッシュ
-idf.py -p /dev/cu.usbmodem11401 flash
-
-# UART(COM)ポートでREPLを開く
-idf.py -p /dev/cu.usbserial-110 monitor
-# または任意のシリアルターミナル:screen、minicom、PuTTY(ボーレート 115200)
+ls /dev/cu.usb* # macOS
+ls /dev/ttyACM* # Linux
```
-
-
-## メモリ
-
-MimiClawはすべてのデータをプレーンテキストファイルとして保存します。直接読み取り・編集可能です:
+書き込みとモニタ:
-| ファイル | 説明 |
-|----------|------|
-| `SOUL.md` | ボットの性格 — 編集して振る舞いを変更 |
-| `USER.md` | あなたの情報 — 名前、好み、言語 |
-| `MEMORY.md` | 長期記憶 — ボットが常に覚えておくべきこと |
-| `HEARTBEAT.md` | タスクリスト — ボットが定期的にチェックして自律的に実行 |
-| `cron.json` | スケジュールジョブ — AIが作成した定期・単発タスク |
-| `2026-02-05.md` | 日次メモ — 今日あったこと |
-| `tg_12345.jsonl` | チャット履歴 — ボットとの会話 |
-
-## ツール
+```bash
+idf.py -p PORT flash monitor
+```
-MimiClawはAnthropicとOpenAI両方のツール呼び出しをサポート — LLMは会話中にツールを呼び出し、タスクが完了するまでループします(ReActパターン)。
+`PORT` は実際のデバイスパスに置き換えてください。
+
+## シリアル CLI
+
+シリアル CLI は、NVS に保存される実行時設定を最も素早く変更する方法です:
+
+```text
+mimi> wifi_set MySSID MyPassword
+mimi> set_tg_token 123456:ABC...
+mimi> set_api_key your-llm-key
+mimi> set_api_base https://your-compatible-endpoint/v1
+mimi> set_model_provider openai
+mimi> set_model gpt-5.2
+mimi> set_proxy 127.0.0.1 7897
+mimi> clear_proxy
+mimi> set_search_key BSA...
+mimi> set_tavily_key tvly-...
+mimi> config_show
+mimi> config_reset
+```
-| ツール | 説明 |
-|--------|------|
-| `web_search` | Tavily(優先)またはBraveでウェブ検索し、最新情報を取得 |
-| `get_current_time` | HTTP経由で現在の日時を取得し、システムクロックを設定 |
-| `cron_add` | 定期または単発タスクをスケジュール(LLMが自律的にcronジョブを作成) |
-| `cron_list` | スケジュール済みのcronジョブを一覧表示 |
-| `cron_remove` | IDでcronジョブを削除 |
+メンテナンス用コマンド:
+
+```text
+mimi> wifi_status
+mimi> memory_read
+mimi> memory_write "remember this"
+mimi> heap_info
+mimi> session_list
+mimi> session_clear 12345
+mimi> heartbeat_trigger
+mimi> cron_start
+mimi> restart
+```
-ウェブ検索を有効にするには、`mimi_secrets.h`で[Tavily APIキー](https://app.tavily.com/home)(優先、`MIMI_SECRET_TAVILY_KEY`)または[Brave Search APIキー](https://brave.com/search/api/)(`MIMI_SECRET_SEARCH_KEY`)を設定してください。
+## 互換 Provider モデル
-## Cronタスク
+`reSpeaker-claw` は公式の Anthropic と OpenAI のエンドポイントだけに限定されません。
-MimiClawにはcronスケジューラが内蔵されており、AIが自律的にタスクをスケジュールできます。LLMは`cron_add`ツールで定期ジョブ(「N秒ごと」)や単発ジョブ(「UNIXタイムスタンプで指定」)を作成できます。ジョブが発火すると、メッセージがエージェントループに注入され、AIが起動してタスクを処理・応答します。
+対応内容:
-ジョブはSPIFFS(`cron.json`)に永続化され、再起動後も保持されます。活用例:日次サマリー、定期リマインダー、スケジュールチェック。
+- `set_model_provider anthropic` で選択する Anthropic 互換サービス
+- `set_model_provider openai` で選択する OpenAI 互換サービス
+- `set_api_base` で切り替える任意の API base
-## ハートビート
+これにより、agent loop を変更せずに、ローカルゲートウェイ、地域クラウド、統合 API プラットフォームを利用できます。
-ハートビートサービスはSPIFFS上の`HEARTBEAT.md`を定期的に読み取り、アクション可能なタスクがあるかチェックします。未完了の項目(空行、見出し、チェック済み`- [x]`以外)が見つかると、エージェントループにプロンプトを送信し、AIが自律的に処理します。
+## メモリと自動化
-これによりMimiClawはプロアクティブなアシスタントになります — `HEARTBEAT.md`にタスクを書き込めば、次のハートビートサイクルで自動的に拾い上げて実行します(デフォルト:30分ごと)。
+Agent は SPIFFS 上に状態をプレーンテキストファイルとして保存します:
-## その他の機能
+| ファイル | 用途 |
+|----------|------|
+| `SOUL.md` | アシスタント人格 |
+| `USER.md` | ユーザープロファイル |
+| `MEMORY.md` | 長期記憶 |
+| `HEARTBEAT.md` | 定期実行する自律タスクリスト |
+| `cron.json` | スケジュールジョブ |
+| `tg_12345.jsonl` | セッション履歴 |
-- **WebSocketゲートウェイ** — ポート18789、LAN内から任意のWebSocketクライアントで接続
-- **OTAアップデート** — WiFi経由でファームウェア更新、USB不要
-- **デュアルコア** — ネットワークI/OとAI処理が別々のCPUコアで動作
-- **HTTPプロキシ** — CONNECTトンネル対応、制限付きネットワークに対応
-- **マルチプロバイダー** — Anthropic (Claude) と OpenAI (GPT) の両方をサポート、実行時に切り替え可能
-- **Cronスケジューラ** — AIが定期・単発タスクを自律的にスケジュール、再起動後も永続化
-- **ハートビート** — タスクファイルを定期チェックし、AIを自律的に駆動
-- **ツール呼び出し** — ReActエージェントループ、両プロバイダーでツール呼び出し対応
+組み込みの自動化機能:
-## 開発者向け
+- `cron_add`、`cron_list`、`cron_remove`
+- heartbeat 駆動の能動的タスク処理
+- ReAct loop におけるツール呼び出し
+- 再起動後も保持されるローカル状態
-技術的な詳細は`docs/`フォルダにあります:
+## ツール
-- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — システム設計、モジュール構成、タスクレイアウト、メモリバジェット、プロトコル、Flashパーティション
-- **[docs/TODO.md](docs/TODO.md)** — 機能ギャップとロードマップ
-- **[docs/im-integration/](docs/im-integration/README.md)** — IMチャネル統合ガイド(Feishuなど)
+組み込みツール:
-## 貢献
+- `web_search`
+- `get_current_time`
+- `cron_add`
+- `cron_list`
+- `cron_remove`
+- Agent ランタイムが使う SPIFFS ファイル操作ツール
-Issue や Pull Request を作成する前に、**[CONTRIBUTING.md](CONTRIBUTING.md)** をご確認ください。
+Web 検索を有効にするには、次のいずれかを設定します:
-## コントリビューター
+- `MIMI_SECRET_TAVILY_KEY`
+- `MIMI_SECRET_SEARCH_KEY`
-MimiClaw に貢献してくれた皆さんに感謝します。
+## 謝辞
-
-
-
+本プロジェクトは元の [mimiclaw](https://github.com/memovai/mimiclaw) を基盤としています。reSpeaker-claw は、その組み込み agent 基盤を ReSpeaker XVF3800 の音声ハードウェア向けに適応し、STT / TTS パイプラインを拡張しつつ、マルチチャネル agent アーキテクチャを継承しています。
## ライセンス
MIT
-
-## 謝辞
-
-[OpenClaw](https://github.com/openclaw/openclaw)と[Nanobot](https://github.com/HKUDS/nanobot)にインスパイアされました。MimiClawはコアAIエージェントアーキテクチャを組み込みハードウェア向けに再実装しました — Linuxなし、サーバーなし、$5のチップだけ。
-
-## Star History
-
-
-
-
-
-
-
-
diff --git a/assets/banner.png b/assets/banner.png
deleted file mode 100644
index c3cc4255..00000000
Binary files a/assets/banner.png and /dev/null differ
diff --git a/assets/esp32s3-usb-port.jpg b/assets/esp32s3-usb-port.jpg
deleted file mode 100644
index 706f6107..00000000
Binary files a/assets/esp32s3-usb-port.jpg and /dev/null differ
diff --git a/assets/mimiclaw.png b/assets/mimiclaw.png
deleted file mode 100644
index e22246e7..00000000
Binary files a/assets/mimiclaw.png and /dev/null differ
diff --git a/main/CMakeLists.txt b/main/CMakeLists.txt
index 5f3fe1ea..47afb9ec 100644
--- a/main/CMakeLists.txt
+++ b/main/CMakeLists.txt
@@ -21,10 +21,11 @@ idf_component_register(
"tools/tool_get_time.c"
"tools/tool_files.c"
"skills/skill_loader.c"
+ "voice/voice_channel.c"
INCLUDE_DIRS
"."
REQUIRES
nvs_flash esp_wifi esp_netif esp_http_client esp_http_server
esp_https_ota esp_event json spiffs console vfs app_update esp-tls
- esp_timer esp_websocket_client
+ esp_timer esp_websocket_client driver
)
diff --git a/main/agent/agent_loop.c b/main/agent/agent_loop.c
index 7e5eae64..2a513078 100644
--- a/main/agent/agent_loop.c
+++ b/main/agent/agent_loop.c
@@ -86,6 +86,30 @@ static void append_turn_context_prompt(char *prompt, size_t size, const mimi_msg
if (n < 0 || (size_t)n >= (size - off)) {
prompt[size - 1] = '\0';
}
+
+ if (msg->channel[0] && strcmp(msg->channel, MIMI_CHAN_VOICE) == 0) {
+ off = strnlen(prompt, size - 1);
+ if (off >= size - 1) {
+ return;
+ }
+
+ n = snprintf(
+ prompt + off, size - off,
+ "\n## Voice Output Constraints\n"
+ "This reply will be converted to speech (TTS) and played on a small speaker.\n"
+ "- Use English, natural spoken style.\n"
+ "- Keep it short: keep playback within ~%d seconds.\n"
+ "- Structure: at most 2 sentences + 1 short follow-up question.\n"
+ "- Length: <= %d characters total.\n"
+ "- No markdown, no lists, no code blocks, no URLs.\n"
+ "- Avoid long explanations; if the answer is long, give a 1–2 sentence summary and ask if the user wants more.\n",
+ (int)MIMI_VOICE_TTS_MAX_SECONDS,
+ (int)MIMI_VOICE_LLM_MAX_CHARS);
+
+ if (n < 0 || (size_t)n >= (size - off)) {
+ prompt[size - 1] = '\0';
+ }
+ }
}
static char *patch_tool_input_with_context(const llm_tool_call_t *call, const mimi_msg_t *msg)
@@ -218,7 +242,7 @@ static void agent_loop_task(void *arg)
while (iteration < MIMI_AGENT_MAX_TOOL_ITER) {
/* Send "working" indicator before each API call */
#if MIMI_AGENT_SEND_WORKING_STATUS
- if (!sent_working_status && strcmp(msg.channel, MIMI_CHAN_SYSTEM) != 0) {
+ if (!sent_working_status && strcmp(msg.channel, MIMI_CHAN_SYSTEM) != 0 && strcmp(msg.channel, MIMI_CHAN_VOICE) != 0) {
mimi_msg_t status = {0};
strncpy(status.channel, msg.channel, sizeof(status.channel) - 1);
strncpy(status.chat_id, msg.chat_id, sizeof(status.chat_id) - 1);
diff --git a/main/bus/message_bus.h b/main/bus/message_bus.h
index 1fc2d31d..b0fe8edb 100644
--- a/main/bus/message_bus.h
+++ b/main/bus/message_bus.h
@@ -10,6 +10,7 @@
#define MIMI_CHAN_WEBSOCKET "websocket"
#define MIMI_CHAN_CLI "cli"
#define MIMI_CHAN_SYSTEM "system"
+#define MIMI_CHAN_VOICE "voice"
/* Message types on the bus */
typedef struct {
diff --git a/main/cli/serial_cli.c b/main/cli/serial_cli.c
index 4968ff7d..904d90d4 100644
--- a/main/cli/serial_cli.c
+++ b/main/cli/serial_cli.c
@@ -123,6 +123,12 @@ static struct {
struct arg_end *end;
} api_key_args;
+/* --- set_api_base command --- */
+static struct {
+ struct arg_str *base;
+ struct arg_end *end;
+} api_base_args;
+
static int cmd_set_api_key(int argc, char **argv)
{
int nerrors = arg_parse(argc, argv, (void **)&api_key_args);
@@ -135,6 +141,18 @@ static int cmd_set_api_key(int argc, char **argv)
return 0;
}
+static int cmd_set_api_base(int argc, char **argv)
+{
+ int nerrors = arg_parse(argc, argv, (void **)&api_base_args);
+ if (nerrors != 0) {
+ arg_print_errors(stderr, api_base_args.end, argv[0]);
+ return 1;
+ }
+ llm_set_api_base(api_base_args.base->sval[0]);
+ printf("API base set.\n");
+ return 0;
+}
+
/* --- set_model command --- */
static struct {
struct arg_str *model;
@@ -535,6 +553,7 @@ static int cmd_config_show(int argc, char **argv)
print_config("WiFi Pass", MIMI_NVS_WIFI, MIMI_NVS_KEY_PASS, MIMI_SECRET_WIFI_PASS, true);
print_config("TG Token", MIMI_NVS_TG, MIMI_NVS_KEY_TG_TOKEN, MIMI_SECRET_TG_TOKEN, true);
print_config("API Key", MIMI_NVS_LLM, MIMI_NVS_KEY_API_KEY, MIMI_SECRET_API_KEY, true);
+ print_config("API Base", MIMI_NVS_LLM, MIMI_NVS_KEY_API_BASE, MIMI_SECRET_API_BASE, false);
print_config("Model", MIMI_NVS_LLM, MIMI_NVS_KEY_MODEL, MIMI_SECRET_MODEL, false);
print_config("Provider", MIMI_NVS_LLM, MIMI_NVS_KEY_PROVIDER, MIMI_SECRET_MODEL_PROVIDER, false);
print_config("Proxy Host", MIMI_NVS_PROXY, MIMI_NVS_KEY_PROXY_HOST, MIMI_SECRET_PROXY_HOST, false);
@@ -849,6 +868,17 @@ esp_err_t serial_cli_init(void)
};
esp_console_cmd_register(&api_key_cmd);
+ /* set_api_base */
+ api_base_args.base = arg_str1(NULL, NULL, " ", "LLM API base (http(s)://host[:port][/path])");
+ api_base_args.end = arg_end(1);
+ esp_console_cmd_t api_base_cmd = {
+ .command = "set_api_base",
+ .help = "Set LLM API base (e.g. https://api.anthropic.com/v1)",
+ .func = &cmd_set_api_base,
+ .argtable = &api_base_args,
+ };
+ esp_console_cmd_register(&api_base_cmd);
+
/* set_model */
model_args.model = arg_str1(NULL, NULL, "", "Model identifier");
model_args.end = arg_end(1);
@@ -1054,4 +1084,4 @@ esp_err_t serial_cli_init(void)
ESP_LOGI(TAG, "Serial CLI started");
return ESP_OK;
-}
+}
\ No newline at end of file
diff --git a/main/llm/llm_proxy.c b/main/llm/llm_proxy.c
index c6fa1b88..adcd74d5 100644
--- a/main/llm/llm_proxy.c
+++ b/main/llm/llm_proxy.c
@@ -15,12 +15,37 @@ static const char *TAG = "llm";
#define LLM_API_KEY_MAX_LEN 320
#define LLM_MODEL_MAX_LEN 64
+#define LLM_API_BASE_MAX_LEN 256
+#define LLM_HOST_MAX_LEN 128
+#define LLM_PATH_MAX_LEN 128
#define LLM_DUMP_MAX_BYTES (16 * 1024)
#define LLM_DUMP_CHUNK_BYTES 320
static char s_api_key[LLM_API_KEY_MAX_LEN] = {0};
static char s_model[LLM_MODEL_MAX_LEN] = MIMI_LLM_DEFAULT_MODEL;
+static char s_model_id[LLM_MODEL_MAX_LEN] = {0};
static char s_provider[16] = MIMI_LLM_PROVIDER_DEFAULT;
+static char s_api_base[LLM_API_BASE_MAX_LEN] = {0};
+
+typedef enum {
+ LLM_PROTOCOL_ANTHROPIC = 0,
+ LLM_PROTOCOL_OPENAI = 1,
+} llm_protocol_t;
+
+static llm_protocol_t s_protocol = LLM_PROTOCOL_ANTHROPIC;
+static bool s_api_tls = true;
+static char s_api_host[LLM_HOST_MAX_LEN] = {0};
+static uint16_t s_api_port = 443;
+static char s_api_base_path[LLM_PATH_MAX_LEN] = {0};
+static char s_api_req_path[LLM_PATH_MAX_LEN + 32] = {0};
+static char s_api_host_header[LLM_HOST_MAX_LEN + 8] = {0};
+static char s_api_url[LLM_API_BASE_MAX_LEN + 64] = {0};
+static bool s_logged_proxy_bypass_warning = false;
+
+static const char *llm_protocol_name(llm_protocol_t p)
+{
+ return (p == LLM_PROTOCOL_OPENAI) ? "openai" : "anthropic";
+}
static void llm_log_payload(const char *label, const char *payload)
{
@@ -180,29 +205,157 @@ static esp_err_t http_event_handler(esp_http_client_event_t *evt)
return ESP_OK;
}
-/* ── Provider helpers ──────────────────────────────────────────── */
+/* ── Protocol config ─────────────────────────────────────────── */
-static bool provider_is_openai(void)
-{
- return strcmp(s_provider, "openai") == 0;
+typedef struct {
+ llm_protocol_t protocol;
+ const char *label; /* "openai" */
+ const char *prefix; /* "openai/" */
+ const char *suffix; /* "/chat/completions" */
+ const char *base; /* Default API base */
+} llm_proto_cfg_t;
+
+static const llm_proto_cfg_t PROTO_MAP[] = {
+ {LLM_PROTOCOL_OPENAI, "openai", "openai/", "/chat/completions", MIMI_LLM_API_BASE_OPENAI},
+ {LLM_PROTOCOL_ANTHROPIC, "anthropic", "anthropic/", "/messages", MIMI_LLM_API_BASE_ANTHROPIC}
+};
+
+static const llm_proto_cfg_t* get_current_proto(void) {
+ return &PROTO_MAP[s_protocol == LLM_PROTOCOL_OPENAI ? 0 : 1];
}
-static const char *llm_api_url(void)
-{
- return provider_is_openai() ? MIMI_OPENAI_API_URL : MIMI_LLM_API_URL;
+/* ── Helpers ─────────────────────────────────────────────────── */
+
+static bool llm_protocol_is_openai(void) {
+ return s_protocol == LLM_PROTOCOL_OPENAI;
}
-static const char *llm_api_host(void)
-{
- return provider_is_openai() ? "api.openai.com" : "api.anthropic.com";
+/* Validate api_base format without modifying global state */
+static esp_err_t llm_validate_api_base(const char *api_base) {
+ if (!api_base || api_base[0] == '\0') return ESP_ERR_INVALID_ARG;
+
+ /* Check for valid scheme */
+ const char *p;
+ if (strncmp(api_base, "https://", 8) == 0) {
+ p = api_base + 8;
+ } else if (strncmp(api_base, "http://", 7) == 0) {
+ p = api_base + 7;
+ } else {
+ return ESP_ERR_INVALID_ARG;
+ }
+
+ /* Basic format validation - ensure there's content after the scheme */
+ if (p[0] == '\0' || p[0] == '/' || p[0] == ':') {
+ return ESP_ERR_INVALID_ARG;
+ }
+
+ /* Check for valid host part (before colon or slash) */
+ const char *slash = strchr(p, '/');
+ const char *colon = strchr(p, ':');
+ if (colon && slash && colon > slash) colon = NULL; /* Colon is part of path */
+
+ const char *host_end = colon ? colon : (slash ? slash : p + strlen(p));
+ if (host_end == p) return ESP_ERR_INVALID_ARG; /* Empty host */
+
+ /* Validate port if present */
+ if (colon) {
+ char *endptr;
+ long port = strtol(colon + 1, &endptr, 10);
+ if (endptr == colon + 1 || (*endptr != '\0' && *endptr != '/') ||
+ port < 1 || port > 65535) {
+ return ESP_ERR_INVALID_ARG;
+ }
+ }
+
+ return ESP_OK;
}
-static const char *llm_api_path(void)
-{
- return provider_is_openai() ? "/v1/chat/completions" : "/v1/messages";
+/* Parse api_base: scheme (http/https), host[:port], optional base path. */
+static esp_err_t llm_parse_api_base(const char *api_base) {
+ if (!api_base || api_base[0] == '\0') return ESP_ERR_INVALID_ARG;
+
+ const char *p;
+ if (strncmp(api_base, "https://", 8) == 0) {
+ s_api_tls = true; p = api_base + 8; s_api_port = 443;
+ } else if (strncmp(api_base, "http://", 7) == 0) {
+ s_api_tls = false; p = api_base + 7; s_api_port = 80;
+ } else return ESP_ERR_INVALID_ARG;
+
+ const char *slash = strchr(p, '/');
+ const char *colon = strchr(p, ':');
+ if (colon && slash && colon > slash) colon = NULL; /* Colon is part of path */
+
+ const char *host_end = colon ? colon : (slash ? slash : p + strlen(p));
+ snprintf(s_api_host, sizeof(s_api_host), "%.*s", (int)(host_end - p), p);
+
+ if (colon) {
+ char *endptr;
+ long port = strtol(colon + 1, &endptr, 10);
+ if (endptr != colon + 1 && (*endptr == '\0' || *endptr == '/') &&
+ port >= 1 && port <= 65535) {
+ s_api_port = (uint16_t)port;
+ }
+ /* If port parsing fails, keep the default port (443 for HTTPS, 80 for HTTP) */
+ }
+
+ s_api_base_path[0] = '\0';
+ if (slash) {
+ safe_copy(s_api_base_path, sizeof(s_api_base_path), slash);
+ size_t len = strlen(s_api_base_path);
+ while (len > 0 && s_api_base_path[len - 1] == '/') s_api_base_path[--len] = '\0';
+ }
+ return ESP_OK;
+}
+
+/* Build derived request path, Host header, and full URL strings. */
+static void llm_build_request_targets(void) {
+ const llm_proto_cfg_t *cfg = get_current_proto();
+
+ snprintf(s_api_req_path, sizeof(s_api_req_path), "%s%s", s_api_base_path, cfg->suffix);
+ if (s_api_req_path[0] == '\0') strcpy(s_api_req_path, "/");
+
+ bool is_std = (s_api_tls && s_api_port == 443) || (!s_api_tls && s_api_port == 80);
+ if (is_std) {
+ snprintf(s_api_host_header, sizeof(s_api_host_header), "%s", s_api_host);
+ } else {
+ snprintf(s_api_host_header, sizeof(s_api_host_header), "%s:%u", s_api_host, s_api_port);
+ }
+
+ snprintf(s_api_url, sizeof(s_api_url), "%s://%s%s",
+ s_api_tls ? "https" : "http", s_api_host_header, s_api_req_path);
}
-/* ── Init ─────────────────────────────────────────────────────── */
+/* ── Derived config ──────────────────────────────────────────── */
+
+static void llm_recompute_effective_config(void) {
+ /* Determine protocol + model_id (prefix overrides provider), and update request targets. */
+ s_logged_proxy_bypass_warning = false; /* Reset warning flag when config changes */
+ s_protocol = (strcmp(s_provider, "openai") == 0) ? LLM_PROTOCOL_OPENAI : LLM_PROTOCOL_ANTHROPIC;
+ const char *model_id = s_model;
+
+ for (int i = 0; i < 2; i++) {
+ size_t len = strlen(PROTO_MAP[i].prefix);
+ if (strncmp(s_model, PROTO_MAP[i].prefix, len) == 0 && s_model[len] != '\0') {
+ s_protocol = PROTO_MAP[i].protocol;
+ model_id = s_model + len;
+ break;
+ }
+ }
+ safe_copy(s_model_id, sizeof(s_model_id), model_id);
+
+ const char *default_base = get_current_proto()->base;
+ const char *base = (s_api_base[0] != '\0') ? s_api_base : default_base;
+
+ if (llm_parse_api_base(base) != ESP_OK) {
+ ESP_LOGE(TAG, "Failed to parse API base: %s. Using default.", base);
+ llm_parse_api_base(default_base);
+ }
+
+ llm_build_request_targets();
+
+ ESP_LOGI(TAG, "Configured: Protocol=%s, Model=%s, URL=%s",
+ get_current_proto()->label, s_model_id, s_api_url);
+}
esp_err_t llm_proxy_init(void)
{
@@ -210,6 +363,9 @@ esp_err_t llm_proxy_init(void)
if (MIMI_SECRET_API_KEY[0] != '\0') {
safe_copy(s_api_key, sizeof(s_api_key), MIMI_SECRET_API_KEY);
}
+ if (MIMI_SECRET_API_BASE[0] != '\0') {
+ safe_copy(s_api_base, sizeof(s_api_base), MIMI_SECRET_API_BASE);
+ }
if (MIMI_SECRET_MODEL[0] != '\0') {
safe_copy(s_model, sizeof(s_model), MIMI_SECRET_MODEL);
}
@@ -225,6 +381,11 @@ esp_err_t llm_proxy_init(void)
if (nvs_get_str(nvs, MIMI_NVS_KEY_API_KEY, tmp, &len) == ESP_OK && tmp[0]) {
safe_copy(s_api_key, sizeof(s_api_key), tmp);
}
+ char base_tmp[LLM_API_BASE_MAX_LEN] = {0};
+ len = sizeof(base_tmp);
+ if (nvs_get_str(nvs, MIMI_NVS_KEY_API_BASE, base_tmp, &len) == ESP_OK && base_tmp[0]) {
+ safe_copy(s_api_base, sizeof(s_api_base), base_tmp);
+ }
char model_tmp[LLM_MODEL_MAX_LEN] = {0};
len = sizeof(model_tmp);
if (nvs_get_str(nvs, MIMI_NVS_KEY_MODEL, model_tmp, &len) == ESP_OK && model_tmp[0]) {
@@ -238,9 +399,9 @@ esp_err_t llm_proxy_init(void)
nvs_close(nvs);
}
- if (s_api_key[0]) {
- ESP_LOGI(TAG, "LLM proxy initialized (provider: %s, model: %s)", s_provider, s_model);
- } else {
+ llm_recompute_effective_config();
+
+ if (s_api_key[0] == '\0') {
ESP_LOGW(TAG, "No API key. Use CLI: set_api_key ");
}
return ESP_OK;
@@ -251,7 +412,7 @@ esp_err_t llm_proxy_init(void)
static esp_err_t llm_http_direct(const char *post_data, resp_buf_t *rb, int *out_status)
{
esp_http_client_config_t config = {
- .url = llm_api_url(),
+ .url = s_api_url,
.event_handler = http_event_handler,
.user_data = rb,
.timeout_ms = 120 * 1000,
@@ -265,14 +426,16 @@ static esp_err_t llm_http_direct(const char *post_data, resp_buf_t *rb, int *out
esp_http_client_set_method(client, HTTP_METHOD_POST);
esp_http_client_set_header(client, "Content-Type", "application/json");
- if (provider_is_openai()) {
+ if (llm_protocol_is_openai()) {
if (s_api_key[0]) {
char auth[LLM_API_KEY_MAX_LEN + 16];
snprintf(auth, sizeof(auth), "Bearer %s", s_api_key);
esp_http_client_set_header(client, "Authorization", auth);
}
} else {
- esp_http_client_set_header(client, "x-api-key", s_api_key);
+ if (s_api_key[0] != '\0') {
+ esp_http_client_set_header(client, "x-api-key", s_api_key);
+ }
esp_http_client_set_header(client, "anthropic-version", MIMI_LLM_API_VERSION);
}
esp_http_client_set_post_field(client, post_data, strlen(post_data));
@@ -287,80 +450,71 @@ static esp_err_t llm_http_direct(const char *post_data, resp_buf_t *rb, int *out
static esp_err_t llm_http_via_proxy(const char *post_data, resp_buf_t *rb, int *out_status)
{
- proxy_conn_t *conn = proxy_conn_open(llm_api_host(), 443, 30000);
+ proxy_conn_t *conn = proxy_conn_open(s_api_host, s_api_port, 30000);
if (!conn) return ESP_ERR_HTTP_CONNECT;
- int body_len = strlen(post_data);
- char header[1024];
- int hlen = 0;
- if (provider_is_openai()) {
- hlen = snprintf(header, sizeof(header),
- "POST %s HTTP/1.1\r\n"
- "Host: %s\r\n"
- "Content-Type: application/json\r\n"
- "Authorization: Bearer %s\r\n"
- "Content-Length: %d\r\n"
- "Connection: close\r\n\r\n",
- llm_api_path(), llm_api_host(), s_api_key, body_len);
+ /* Build request headers */
+ char h[1024];
+ int off = snprintf(h, sizeof(h), "POST %s HTTP/1.1\r\nHost: %s\r\nContent-Type: application/json\r\n",
+ s_api_req_path, s_api_host_header);
+
+ if (llm_protocol_is_openai()) {
+ if (s_api_key[0] != '\0') {
+ off += snprintf(h + off, sizeof(h) - off, "Authorization: Bearer %s\r\n", s_api_key);
+ }
} else {
- hlen = snprintf(header, sizeof(header),
- "POST %s HTTP/1.1\r\n"
- "Host: %s\r\n"
- "Content-Type: application/json\r\n"
- "x-api-key: %s\r\n"
- "anthropic-version: %s\r\n"
- "Content-Length: %d\r\n"
- "Connection: close\r\n\r\n",
- llm_api_path(), llm_api_host(), s_api_key, MIMI_LLM_API_VERSION, body_len);
- }
-
- if (proxy_conn_write(conn, header, hlen) < 0 ||
- proxy_conn_write(conn, post_data, body_len) < 0) {
+ if (s_api_key[0] != '\0') {
+ off += snprintf(h + off, sizeof(h) - off, "x-api-key: %s\r\n", s_api_key);
+ }
+ off += snprintf(h + off, sizeof(h) - off, "anthropic-version: %s\r\n", MIMI_LLM_API_VERSION);
+ }
+
+ off += snprintf(h + off, sizeof(h) - off, "Content-Length: %zu\r\nConnection: close\r\n\r\n", strlen(post_data));
+
+ /* Send */
+ if (off >= sizeof(h) || proxy_conn_write(conn, h, off) < 0 ||
+ proxy_conn_write(conn, post_data, strlen(post_data)) < 0) {
proxy_conn_close(conn);
return ESP_ERR_HTTP_WRITE_DATA;
}
- /* Read full response into buffer */
- char tmp[4096];
- while (1) {
- int n = proxy_conn_read(conn, tmp, sizeof(tmp), 120000);
- if (n <= 0) break;
+ /* Receive full response */
+ char tmp[1024];
+ int n;
+ while ((n = proxy_conn_read(conn, tmp, sizeof(tmp), 120000)) > 0) {
if (resp_buf_append(rb, tmp, n) != ESP_OK) break;
+ vTaskDelay(pdMS_TO_TICKS(1));
}
proxy_conn_close(conn);
- /* Parse status line */
- *out_status = 0;
- if (rb->len > 5 && strncmp(rb->data, "HTTP/", 5) == 0) {
- const char *sp = strchr(rb->data, ' ');
- if (sp) *out_status = atoi(sp + 1);
- }
+ /* Parse status */
+ *out_status = (rb->len > 12 && strncmp(rb->data, "HTTP/", 5) == 0) ? atoi(rb->data + 9) : 0;
- /* Strip HTTP headers, keep body only */
+ /* Strip headers */
char *body = strstr(rb->data, "\r\n\r\n");
if (body) {
body += 4;
- size_t blen = rb->len - (body - rb->data);
- memmove(rb->data, body, blen);
- rb->len = blen;
+ rb->len -= (body - rb->data);
+ memmove(rb->data, body, rb->len);
rb->data[rb->len] = '\0';
}
- /* Decode chunked transfer encoding if present */
resp_buf_decode_chunked(rb);
-
return ESP_OK;
}
-/* ── Shared HTTP dispatch ─────────────────────────────────────── */
-
static esp_err_t llm_http_call(const char *post_data, resp_buf_t *rb, int *out_status)
{
if (http_proxy_is_enabled()) {
- return llm_http_via_proxy(post_data, rb, out_status);
- } else {
- return llm_http_direct(post_data, rb, out_status);
+ if (s_api_tls) {
+ return llm_http_via_proxy(post_data, rb, out_status);
+ }
+ if (!s_logged_proxy_bypass_warning) {
+ ESP_LOGW(TAG, "Proxy configured but api_base is http; bypassing proxy");
+ s_logged_proxy_bypass_warning = true;
+ }
}
+ return llm_http_direct(post_data, rb, out_status);
}
static cJSON *convert_tools_openai(const char *tools_json)
@@ -554,18 +708,16 @@ esp_err_t llm_chat_tools(const char *system_prompt,
{
memset(resp, 0, sizeof(*resp));
- if (s_api_key[0] == '\0') return ESP_ERR_INVALID_STATE;
-
/* Build request body (non-streaming) */
cJSON *body = cJSON_CreateObject();
- cJSON_AddStringToObject(body, "model", s_model);
- if (provider_is_openai()) {
+ cJSON_AddStringToObject(body, "model", s_model_id);
+ if (strncasecmp(s_model_id, "gpt-5", 5) == 0 || strncasecmp(s_model_id, "o1", 2) == 0) {
cJSON_AddNumberToObject(body, "max_completion_tokens", MIMI_LLM_MAX_TOKENS);
} else {
cJSON_AddNumberToObject(body, "max_tokens", MIMI_LLM_MAX_TOKENS);
}
- if (provider_is_openai()) {
+ if (llm_protocol_is_openai()) {
cJSON *openai_msgs = convert_messages_openai(system_prompt, messages);
cJSON_AddItemToObject(body, "messages", openai_msgs);
@@ -596,8 +748,8 @@ esp_err_t llm_chat_tools(const char *system_prompt,
cJSON_Delete(body);
if (!post_data) return ESP_ERR_NO_MEM;
- ESP_LOGI(TAG, "Calling LLM API with tools (provider: %s, model: %s, body: %d bytes)",
- s_provider, s_model, (int)strlen(post_data));
+ ESP_LOGI(TAG, "Calling LLM API with tools (protocol: %s, model: %s, body: %d bytes)",
+ llm_protocol_name(s_protocol), s_model_id, (int)strlen(post_data));
llm_log_payload("LLM tools request", post_data);
/* HTTP call */
@@ -635,7 +787,7 @@ esp_err_t llm_chat_tools(const char *system_prompt,
return ESP_FAIL;
}
- if (provider_is_openai()) {
+ if (llm_protocol_is_openai()) {
cJSON *choices = cJSON_GetObjectItem(root, "choices");
cJSON *choice0 = choices && cJSON_IsArray(choices) ? cJSON_GetArrayItem(choices, 0) : NULL;
if (choice0) {
@@ -784,6 +936,27 @@ esp_err_t llm_set_api_key(const char *api_key)
return ESP_OK;
}
+esp_err_t llm_set_api_base(const char *api_base)
+{
+ /* Validate before persisting - use validation-only function */
+ esp_err_t err = llm_validate_api_base(api_base);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "Invalid API base format: %s", api_base ? api_base : "");
+ return err;
+ }
+
+ nvs_handle_t nvs;
+ ESP_ERROR_CHECK(nvs_open(MIMI_NVS_LLM, NVS_READWRITE, &nvs));
+ ESP_ERROR_CHECK(nvs_set_str(nvs, MIMI_NVS_KEY_API_BASE, api_base));
+ ESP_ERROR_CHECK(nvs_commit(nvs));
+ nvs_close(nvs);
+
+ safe_copy(s_api_base, sizeof(s_api_base), api_base);
+ llm_recompute_effective_config();
+ ESP_LOGI(TAG, "API base set");
+ return ESP_OK;
+}
+
esp_err_t llm_set_model(const char *model)
{
nvs_handle_t nvs;
@@ -793,6 +966,7 @@ esp_err_t llm_set_model(const char *model)
nvs_close(nvs);
safe_copy(s_model, sizeof(s_model), model);
+ llm_recompute_effective_config();
ESP_LOGI(TAG, "Model set to: %s", s_model);
return ESP_OK;
}
@@ -806,6 +980,7 @@ esp_err_t llm_set_provider(const char *provider)
nvs_close(nvs);
safe_copy(s_provider, sizeof(s_provider), provider);
+ llm_recompute_effective_config();
ESP_LOGI(TAG, "Provider set to: %s", s_provider);
return ESP_OK;
-}
+}
\ No newline at end of file
diff --git a/main/llm/llm_proxy.h b/main/llm/llm_proxy.h
index b667f624..7d333b84 100644
--- a/main/llm/llm_proxy.h
+++ b/main/llm/llm_proxy.h
@@ -17,6 +17,19 @@ esp_err_t llm_proxy_init(void);
*/
esp_err_t llm_set_api_key(const char *api_key);
+/**
+ * Save the LLM API base URL to NVS.
+ *
+ * Expected format: http(s)://host[:port][/path]
+ * Examples:
+ * - https://api.anthropic.com/v1
+ * - https://api.openai.com/v1
+ * - http://localhost:11434/v1
+ * - https://api.minimaxi.com/anthropic/v1
+ * - https://open.bigmodel.cn/api/paas/v4
+ */
+esp_err_t llm_set_api_base(const char *api_base);
+
/**
* Save the LLM provider to NVS. (e.g. "anthropic", "openai")
*/
@@ -58,4 +71,4 @@ void llm_response_free(llm_response_t *resp);
esp_err_t llm_chat_tools(const char *system_prompt,
cJSON *messages,
const char *tools_json,
- llm_response_t *resp);
+ llm_response_t *resp);
\ No newline at end of file
diff --git a/main/mimi.c b/main/mimi.c
index 0e8e8fa7..430b9b88 100644
--- a/main/mimi.c
+++ b/main/mimi.c
@@ -25,6 +25,7 @@
#include "cron/cron_service.h"
#include "heartbeat/heartbeat.h"
#include "skills/skill_loader.h"
+#include "voice/voice_channel.h"
static const char *TAG = "mimi";
@@ -60,41 +61,73 @@ static esp_err_t init_spiffs(void)
return ESP_OK;
}
-
+static void voice_speak_task(void *arg)
+{
+ char *text = (char *)arg;
+ if (text) {
+ esp_err_t err = voice_channel_speak_text(text);
+ if (err != ESP_OK) {
+ ESP_LOGW(TAG, "Voice playback failed: %s", esp_err_to_name(err));
+ }
+ free(text);
+ }
+ vTaskDelete(NULL);
+}
/* Outbound dispatch task: reads from outbound queue and routes to channels */
static void outbound_dispatch_task(void *arg)
{
- ESP_LOGI(TAG, "Outbound dispatch started");
+ (void)arg;
+ ESP_LOGI(TAG, "Outbound dispatch started on core %d", xPortGetCoreID());
while (1) {
- mimi_msg_t msg;
- if (message_bus_pop_outbound(&msg, UINT32_MAX) != ESP_OK) continue;
+ mimi_msg_t msg = {0};
+ if (message_bus_pop_outbound(&msg, UINT32_MAX) != ESP_OK) {
+ continue;
+ }
+
+ ESP_LOGI(TAG, "Dispatching response to %s:%s",
+ msg.channel[0] ? msg.channel : "(unknown)",
+ msg.chat_id[0] ? msg.chat_id : "(empty)");
- ESP_LOGI(TAG, "Dispatching response to %s:%s", msg.channel, msg.chat_id);
+ if (!msg.content || !msg.content[0]) {
+ free(msg.content);
+ continue;
+ }
if (strcmp(msg.channel, MIMI_CHAN_TELEGRAM) == 0) {
- esp_err_t send_err = telegram_send_message(msg.chat_id, msg.content);
- if (send_err != ESP_OK) {
- ESP_LOGE(TAG, "Telegram send failed for %s: %s", msg.chat_id, esp_err_to_name(send_err));
- } else {
- ESP_LOGI(TAG, "Telegram send success for %s (%d bytes)", msg.chat_id, (int)strlen(msg.content));
- }
+ telegram_send_message(msg.chat_id, msg.content);
+
} else if (strcmp(msg.channel, MIMI_CHAN_FEISHU) == 0) {
- esp_err_t send_err = feishu_send_message(msg.chat_id, msg.content);
- if (send_err != ESP_OK) {
- ESP_LOGE(TAG, "Feishu send failed for %s: %s", msg.chat_id, esp_err_to_name(send_err));
- } else {
- ESP_LOGI(TAG, "Feishu send success for %s (%d bytes)", msg.chat_id, (int)strlen(msg.content));
- }
+ feishu_send_message(msg.chat_id, msg.content);
+
} else if (strcmp(msg.channel, MIMI_CHAN_WEBSOCKET) == 0) {
- esp_err_t ws_err = ws_server_send(msg.chat_id, msg.content);
- if (ws_err != ESP_OK) {
- ESP_LOGW(TAG, "WS send failed for %s: %s", msg.chat_id, esp_err_to_name(ws_err));
+ ws_server_send(msg.chat_id, msg.content);
+
+ } else if (strcmp(msg.channel, MIMI_CHAN_VOICE) == 0) {
+ char *copy = strdup(msg.content);
+ if (!copy) {
+ ESP_LOGW(TAG, "No memory for voice speak task");
+ } else {
+ BaseType_t ok = xTaskCreatePinnedToCore(
+ voice_speak_task,
+ "voice_speak",
+ MIMI_VOICE_SPEAK_STACK,
+ copy,
+ MIMI_VOICE_SPEAK_PRIO,
+ NULL,
+ MIMI_VOICE_SPEAK_CORE
+ );
+ if (ok != pdPASS) {
+ ESP_LOGW(TAG, "Failed to create voice_speak task");
+ free(copy);
+ }
}
- } else if (strcmp(msg.channel, MIMI_CHAN_SYSTEM) == 0) {
- ESP_LOGI(TAG, "System message [%s]: %.128s", msg.chat_id, msg.content);
+
+ } else if (strcmp(msg.channel, MIMI_CHAN_CLI) == 0) {
+ printf("\n%s\n", msg.content);
+
} else {
- ESP_LOGW(TAG, "Unknown channel: %s", msg.channel);
+ ESP_LOGW(TAG, "Unknown outbound channel: %s", msg.channel);
}
free(msg.content);
@@ -134,6 +167,7 @@ void app_main(void)
ESP_ERROR_CHECK(tool_registry_init());
ESP_ERROR_CHECK(cron_service_init());
ESP_ERROR_CHECK(heartbeat_init());
+ ESP_ERROR_CHECK(voice_channel_init());
ESP_ERROR_CHECK(agent_loop_init());
/* Start Serial CLI first (works without WiFi) */
@@ -161,6 +195,7 @@ void app_main(void)
ESP_ERROR_CHECK(feishu_bot_start());
cron_service_start();
heartbeat_start();
+ voice_channel_start();
ESP_ERROR_CHECK(ws_server_start());
ESP_LOGI(TAG, "All services started!");
diff --git a/main/mimi_config.h b/main/mimi_config.h
index 9be7c087..68f99369 100644
--- a/main/mimi_config.h
+++ b/main/mimi_config.h
@@ -19,6 +19,9 @@
#ifndef MIMI_SECRET_API_KEY
#define MIMI_SECRET_API_KEY ""
#endif
+#ifndef MIMI_SECRET_LLM_API_URL
+#define MIMI_SECRET_LLM_API_URL ""
+#endif
#ifndef MIMI_SECRET_MODEL
#define MIMI_SECRET_MODEL ""
#endif
@@ -46,6 +49,36 @@
#ifndef MIMI_SECRET_TAVILY_KEY
#define MIMI_SECRET_TAVILY_KEY ""
#endif
+#ifndef MIMI_SECRET_STT_URL
+#define MIMI_SECRET_STT_URL ""
+#endif
+#ifndef MIMI_SECRET_STT_API_KEY
+#define MIMI_SECRET_STT_API_KEY ""
+#endif
+#ifndef MIMI_SECRET_STT_MODEL
+#define MIMI_SECRET_STT_MODEL ""
+#endif
+#ifndef MIMI_SECRET_TTS_URL
+#define MIMI_SECRET_TTS_URL ""
+#endif
+#ifndef MIMI_SECRET_TTS_API_KEY
+#define MIMI_SECRET_TTS_API_KEY ""
+#endif
+#ifndef MIMI_SECRET_TTS_VOICE
+#define MIMI_SECRET_TTS_VOICE "Cherry"
+#endif
+#ifndef MIMI_SECRET_TTS_MODEL
+#define MIMI_SECRET_TTS_MODEL ""
+#endif
+#ifndef MIMI_SECRET_TTS_LANGUAGE
+#define MIMI_SECRET_TTS_LANGUAGE "English"
+#endif
+
+/* Qwen voice API defaults (DashScope) */
+#define MIMI_QWEN_STT_URL "https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions"
+#define MIMI_QWEN_STT_MODEL "qwen3-asr-flash"
+#define MIMI_QWEN_TTS_URL "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation"
+#define MIMI_QWEN_TTS_MODEL "qwen3-tts-flash"
/* WiFi */
#define MIMI_WIFI_MAX_RETRY 10
@@ -79,6 +112,40 @@
#define MIMI_MAX_TOOL_CALLS 4
#define MIMI_AGENT_SEND_WORKING_STATUS 1
+/* Voice UX (LLM -> TTS) */
+/* Rough speaking rate for Simplified Chinese TTS is often ~4–6 chars/sec depending on voice.
+ * Default limits aim to keep playback under ~20 seconds in typical conditions.
+ * Override these in mimi_secrets.h per your preferred voice/speed.
+ */
+#ifndef MIMI_VOICE_TTS_MAX_SECONDS
+#define MIMI_VOICE_TTS_MAX_SECONDS 20
+#endif
+
+#ifndef MIMI_VOICE_TTS_CHARS_PER_SEC
+#define MIMI_VOICE_TTS_CHARS_PER_SEC 7
+#endif
+
+#ifndef MIMI_VOICE_LLM_MAX_CHARS
+#define MIMI_VOICE_LLM_MAX_CHARS (MIMI_VOICE_TTS_MAX_SECONDS * MIMI_VOICE_TTS_CHARS_PER_SEC)
+#endif
+
+#ifndef MIMI_VOICE_TTS_MAX_CHARS
+#define MIMI_VOICE_TTS_MAX_CHARS (MIMI_VOICE_LLM_MAX_CHARS + 10)
+#endif
+
+/* Voice capture (VAD / STT trigger) */
+#ifndef MIMI_VOICE_VAD_START_FRAMES
+#define MIMI_VOICE_VAD_START_FRAMES 4 /* consecutive frames above threshold to enter speech */
+#endif
+
+#ifndef MIMI_VOICE_VAD_MIN_FRAMES
+#define MIMI_VOICE_VAD_MIN_FRAMES 50 /* minimum utterance frames before sending to STT */
+#endif
+
+#ifndef MIMI_VOICE_STT_COOLDOWN_MS
+#define MIMI_VOICE_STT_COOLDOWN_MS 2000 /* cooldown after an STT attempt to reduce re-trigger */
+#endif
+
/* Timezone (POSIX TZ format) */
#define MIMI_TIMEZONE "PST8PDT,M3.2.0,M11.1.0"
@@ -86,8 +153,8 @@
#define MIMI_LLM_DEFAULT_MODEL "claude-opus-4-5"
#define MIMI_LLM_PROVIDER_DEFAULT "anthropic"
#define MIMI_LLM_MAX_TOKENS 4096
-#define MIMI_LLM_API_URL "https://api.anthropic.com/v1/messages"
-#define MIMI_OPENAI_API_URL "https://api.openai.com/v1/chat/completions"
+#define MIMI_LLM_API_BASE_ANTHROPIC "https://api.anthropic.com/v1"
+#define MIMI_LLM_API_BASE_OPENAI "https://api.openai.com/v1"
#define MIMI_LLM_API_VERSION "2023-06-01"
#define MIMI_LLM_STREAM_BUF_SIZE (32 * 1024)
#define MIMI_LLM_LOG_VERBOSE_PAYLOAD 0
@@ -99,6 +166,22 @@
#define MIMI_OUTBOUND_PRIO 5
#define MIMI_OUTBOUND_CORE 0
+/* Voice speak task (TTS download + resample + playback) */
+#ifndef MIMI_VOICE_SPEAK_STACK
+#define MIMI_VOICE_SPEAK_STACK (12 * 1024)
+#endif
+#ifndef MIMI_VOICE_SPEAK_PRIO
+#define MIMI_VOICE_SPEAK_PRIO 5
+#endif
+#ifndef MIMI_VOICE_SPEAK_CORE
+#define MIMI_VOICE_SPEAK_CORE 1
+#endif
+
+/* WiFi reliability */
+#ifndef MIMI_WIFI_DISABLE_POWERSAVE
+#define MIMI_WIFI_DISABLE_POWERSAVE 1
+#endif
+
/* Memory / SPIFFS */
#define MIMI_SPIFFS_BASE "/spiffs"
#define MIMI_SPIFFS_CONFIG_DIR MIMI_SPIFFS_BASE "/config"
@@ -144,6 +227,7 @@
#define MIMI_NVS_KEY_FEISHU_APP_ID "app_id"
#define MIMI_NVS_KEY_FEISHU_APP_SECRET "app_secret"
#define MIMI_NVS_KEY_API_KEY "api_key"
+#define MIMI_NVS_KEY_API_BASE "api_base"
#define MIMI_NVS_KEY_TAVILY_KEY "tavily_key"
#define MIMI_NVS_KEY_MODEL "model"
#define MIMI_NVS_KEY_PROVIDER "provider"
diff --git a/main/mimi_secrets.h.example b/main/mimi_secrets.h.example
index ecebf54e..1852f66c 100644
--- a/main/mimi_secrets.h.example
+++ b/main/mimi_secrets.h.example
@@ -21,8 +21,9 @@
#define MIMI_SECRET_FEISHU_APP_ID ""
#define MIMI_SECRET_FEISHU_APP_SECRET ""
-/* Anthropic API */
+/* LLM */
#define MIMI_SECRET_API_KEY ""
+#define MIMI_SECRET_LLM_API_URL "" /* optional: full URL incl scheme/host/port/path */
#define MIMI_SECRET_MODEL ""
#define MIMI_SECRET_MODEL_PROVIDER "anthropic"
@@ -33,5 +34,53 @@
/* Brave Search API */
#define MIMI_SECRET_SEARCH_KEY ""
+
+/* Voice STT / TTS services */
+#define MIMI_SECRET_STT_URL ""
+#define MIMI_SECRET_STT_API_KEY ""
+#define MIMI_SECRET_STT_MODEL ""
+#define MIMI_SECRET_TTS_URL ""
+#define MIMI_SECRET_TTS_API_KEY ""
+#define MIMI_SECRET_TTS_VOICE "Cherry"
+#define MIMI_SECRET_TTS_MODEL ""
+#define MIMI_SECRET_TTS_LANGUAGE "English"
+
+/* ReSpeaker XVF3800 I2S pin map (set per board) */
+#define MIMI_VOICE_I2S_PORT 0
+#define MIMI_VOICE_I2S_BCLK (-1)
+#define MIMI_VOICE_I2S_WS (-1)
+#define MIMI_VOICE_I2S_DIN (-1)
+#define MIMI_VOICE_I2S_DOUT (-1)
+
+/* I2S slot/timing style (set per DAC/codec):
+ * 0: Philips (I2S)
+ * 1: MSB (left-justified)
+ * 2: PCM (short frame sync)
+ */
+/* #define MIMI_VOICE_I2S_STD_SLOT_STYLE 1 */
+
+/* Optional: tune DMA and silence tail to suppress post-playback "咚咚" on some DAC/amps */
+/* #define MIMI_VOICE_I2S_DMA_DESC_NUM 6 */
+/* #define MIMI_VOICE_I2S_DMA_FRAME_NUM 240 */
+/* #define MIMI_VOICE_TX_SILENCE_TAIL_MS 400 */
+
+/* Optional: voice conversation pacing (LLM -> TTS)
+ * Target: <= 20s playback, <= 2 sentences + 1 follow-up question.
+ */
+/* #define MIMI_VOICE_TTS_MAX_SECONDS 20 */
+/* #define MIMI_VOICE_TTS_CHARS_PER_SEC 5 */
+/* #define MIMI_VOICE_LLM_MAX_CHARS 100 */
+/* #define MIMI_VOICE_TTS_MAX_CHARS 110 */
+
+/* Optional: reduce STT false triggers (VAD tuning) */
+/* #define MIMI_VOICE_VAD_START_FRAMES 3 */
+/* #define MIMI_VOICE_VAD_MIN_FRAMES 15 */
+/* #define MIMI_VOICE_STT_COOLDOWN_MS 1200 */
+
+/* Optional: WiFi reliability tuning (may increase power draw) */
+/* #define MIMI_WIFI_DISABLE_POWERSAVE 1 */
+
+/* Optional: move TTS/resample/playback off WiFi core to reduce bcn_timeout under load */
+/* #define MIMI_VOICE_SPEAK_CORE 1 */
/* Tavily Search API */
#define MIMI_SECRET_TAVILY_KEY ""
diff --git a/main/proxy/http_proxy.c b/main/proxy/http_proxy.c
index fdb75541..3745144d 100644
--- a/main/proxy/http_proxy.c
+++ b/main/proxy/http_proxy.c
@@ -104,6 +104,61 @@ bool http_proxy_is_enabled(void)
return s_proxy_host[0] != '\0' && s_proxy_port != 0;
}
+/* ── Raw tunnels (no TLS) ────────────────────────────────────── */
+
+static int open_connect_tunnel(const char *host, int port, int timeout_ms);
+static int open_socks5_tunnel(const char *host, int port, int timeout_ms);
+
+int proxy_tunnel_open(const char *host, int port, int timeout_ms)
+{
+ if (!http_proxy_is_enabled()) {
+ ESP_LOGE(TAG, "proxy_tunnel_open called but no proxy configured");
+ return -1;
+ }
+
+ if (!host || !host[0] || port <= 0 || port > 65535) {
+ ESP_LOGE(TAG, "proxy_tunnel_open invalid target");
+ return -1;
+ }
+
+ if (strcmp(s_proxy_type, "socks5") == 0) {
+ return open_socks5_tunnel(host, port, timeout_ms);
+ }
+ return open_connect_tunnel(host, port, timeout_ms);
+}
+
+int proxy_tunnel_write(int sock, const char *data, int len)
+{
+ if (sock < 0 || !data || len <= 0) return -1;
+
+ int written = 0;
+ while (written < len) {
+ int n = send(sock, data + written, len - written, 0);
+ if (n <= 0) return -1;
+ written += n;
+ }
+ return written;
+}
+
+int proxy_tunnel_read(int sock, char *buf, int len, int timeout_ms)
+{
+ if (sock < 0 || !buf || len <= 0) return -1;
+
+ struct timeval tv = { .tv_sec = timeout_ms / 1000, .tv_usec = (timeout_ms % 1000) * 1000 };
+ setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
+
+ int n = recv(sock, buf, len, 0);
+ if (n < 0) return -1;
+ return n;
+}
+
+void proxy_tunnel_close(int sock)
+{
+ if (sock >= 0) {
+ close(sock);
+ }
+}
+
/* ── Proxied TLS connection ───────────────────────────────────── */
struct proxy_conn {
diff --git a/main/proxy/http_proxy.h b/main/proxy/http_proxy.h
index f324700e..382dc742 100644
--- a/main/proxy/http_proxy.h
+++ b/main/proxy/http_proxy.h
@@ -24,6 +24,27 @@ esp_err_t http_proxy_set(const char *host, uint16_t port, const char *type);
*/
esp_err_t http_proxy_clear(void);
+/* ── Proxy tunnels (no TLS) ──────────────────────────────────── */
+
+/**
+ * Open a raw TCP tunnel to target host:port through the configured proxy.
+ *
+ * - If proxy type is "http": uses HTTP CONNECT
+ * - If proxy type is "socks5": uses SOCKS5 CONNECT
+ *
+ * Returns a socket fd on success, or -1 on failure.
+ */
+int proxy_tunnel_open(const char *host, int port, int timeout_ms);
+
+/** Write raw bytes through the tunnel. Returns bytes written or -1. */
+int proxy_tunnel_write(int sock, const char *data, int len);
+
+/** Read raw bytes from the tunnel. Returns bytes read or -1. */
+int proxy_tunnel_read(int sock, char *buf, int len, int timeout_ms);
+
+/** Close the tunnel socket. */
+void proxy_tunnel_close(int sock);
+
/* ── Proxied HTTPS connection ─────────────────────────────────── */
typedef struct proxy_conn proxy_conn_t;
diff --git a/main/tools/tool_cron.c b/main/tools/tool_cron.c
index 048e8902..5670678f 100644
--- a/main/tools/tool_cron.c
+++ b/main/tools/tool_cron.c
@@ -66,16 +66,21 @@ esp_err_t tool_cron_add_execute(const char *input_json, char *output, size_t out
job.delete_after_run = false;
} else if (strcmp(schedule_type, "at") == 0) {
job.kind = CRON_KIND_AT;
- cJSON *at_epoch = cJSON_GetObjectItem(root, "at_epoch");
- if (!at_epoch || !cJSON_IsNumber(at_epoch)) {
- snprintf(output, output_size, "Error: 'at' schedule requires 'at_epoch' (unix timestamp)");
- cJSON_Delete(root);
- return ESP_ERR_INVALID_ARG;
+ time_t now = time(NULL);
+ cJSON *delay_s = cJSON_GetObjectItem(root, "delay_s");
+ if (delay_s && cJSON_IsNumber(delay_s) && delay_s->valuedouble > 0) {
+ job.at_epoch = (int64_t)now + (int64_t)delay_s->valuedouble;
+ } else {
+ cJSON *at_epoch = cJSON_GetObjectItem(root, "at_epoch");
+ if (!at_epoch || !cJSON_IsNumber(at_epoch)) {
+ snprintf(output, output_size, "Error: 'at' schedule requires 'at_epoch' (unix timestamp) or positive 'delay_s'");
+ cJSON_Delete(root);
+ return ESP_ERR_INVALID_ARG;
+ }
+ job.at_epoch = (int64_t)at_epoch->valuedouble;
}
- job.at_epoch = (int64_t)at_epoch->valuedouble;
/* Check if already in the past */
- time_t now = time(NULL);
if (job.at_epoch <= now) {
snprintf(output, output_size, "Error: at_epoch %lld is in the past (now=%lld)",
(long long)job.at_epoch, (long long)now);
diff --git a/main/tools/tool_registry.c b/main/tools/tool_registry.c
index 6c82a3ef..e6251f8a 100644
--- a/main/tools/tool_registry.c
+++ b/main/tools/tool_registry.c
@@ -135,14 +135,15 @@ esp_err_t tool_registry_init(void)
/* Register cron_add */
mimi_tool_t ca = {
.name = "cron_add",
- .description = "Schedule a recurring or one-shot task. The message will trigger an agent turn when the job fires.",
+ .description = "Schedule a recurring or one-shot task. For relative reminders (e.g. 'in 2 minutes'), prefer delay_s to avoid timestamp math. The message will trigger an agent turn when the job fires.",
.input_schema_json =
"{\"type\":\"object\","
"\"properties\":{"
"\"name\":{\"type\":\"string\",\"description\":\"Short name for the job\"},"
"\"schedule_type\":{\"type\":\"string\",\"description\":\"'every' for recurring interval or 'at' for one-shot at a unix timestamp\"},"
"\"interval_s\":{\"type\":\"integer\",\"description\":\"Interval in seconds (required for 'every')\"},"
- "\"at_epoch\":{\"type\":\"integer\",\"description\":\"Unix timestamp to fire at (required for 'at')\"},"
+ "\"at_epoch\":{\"type\":\"integer\",\"description\":\"Unix timestamp to fire at (for 'at'). Prefer delay_s for relative reminders.\"},"
+ "\"delay_s\":{\"type\":\"integer\",\"description\":\"Delay in seconds from now (preferred for 'at' when user says 'in N minutes')\"},"
"\"message\":{\"type\":\"string\",\"description\":\"Message to inject when the job fires, triggering an agent turn\"},"
"\"channel\":{\"type\":\"string\",\"description\":\"Optional reply channel (e.g. 'telegram'). If omitted, current turn channel is used when available\"},"
"\"chat_id\":{\"type\":\"string\",\"description\":\"Optional reply chat_id. Required when channel='telegram'. If omitted during a Telegram turn, current chat_id is used\"}"
diff --git a/main/voice/voice_channel.c b/main/voice/voice_channel.c
new file mode 100644
index 00000000..039c6e26
--- /dev/null
+++ b/main/voice/voice_channel.c
@@ -0,0 +1,1613 @@
+#include "voice/voice_channel.h"
+
+#include
+#include
+#include
+#include
+#include
+#include
+#include
+
+#include "mimi_config.h"
+#include "bus/message_bus.h"
+#include "proxy/http_proxy.h"
+
+#include "freertos/FreeRTOS.h"
+#include "freertos/task.h"
+#include "freertos/semphr.h"
+
+#include "esp_log.h"
+#include "esp_err.h"
+#include "esp_http_client.h"
+#include "esp_crt_bundle.h"
+#include "esp_heap_caps.h"
+#include "driver/i2s_std.h"
+#include "driver/i2s_common.h"
+
+#include "cJSON.h"
+#include "mbedtls/base64.h"
+
+static const char *TAG = "voice";
+
+/*
+ * I2S timing / slot style selection:
+ * 0: Philips (I2S, 1-bit delay after WS edge)
+ * 1: MSB (left-justified, no 1-bit delay)
+ * 2: PCM (short frame sync, ws_width=1, ws_pol=true)
+ *
+ * Many DAC/codec parts are sensitive to this. If your audio sounds like loud
+ * "沙沙" noise but speech is partially recognizable, this is a prime suspect.
+ */
+#ifndef MIMI_VOICE_I2S_STD_SLOT_STYLE
+#define MIMI_VOICE_I2S_STD_SLOT_STYLE 0
+#endif
+
+#ifndef MIMI_VOICE_I2S_DMA_DESC_NUM
+#define MIMI_VOICE_I2S_DMA_DESC_NUM 6
+#endif
+
+#ifndef MIMI_VOICE_I2S_DMA_FRAME_NUM
+#define MIMI_VOICE_I2S_DMA_FRAME_NUM 240
+#endif
+
+#ifndef MIMI_VOICE_TX_SILENCE_TAIL_MS
+#define MIMI_VOICE_TX_SILENCE_TAIL_MS 400
+#endif
+
+#define MIMI_VOICE_TX_BYTES_PER_FRAME (2U * sizeof(int32_t))
+#define MIMI_VOICE_TX_DMA_TOTAL_BYTES \
+ ((uint32_t)MIMI_VOICE_I2S_DMA_DESC_NUM * (uint32_t)MIMI_VOICE_I2S_DMA_FRAME_NUM * (uint32_t)MIMI_VOICE_TX_BYTES_PER_FRAME)
+
+#if MIMI_VOICE_I2S_STD_SLOT_STYLE == 1
+#define MIMI_VOICE_I2S_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) \
+ I2S_STD_MSB_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo)
+#elif MIMI_VOICE_I2S_STD_SLOT_STYLE == 2
+#define MIMI_VOICE_I2S_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) \
+ I2S_STD_PCM_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo)
+#else
+#define MIMI_VOICE_I2S_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) \
+ I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo)
+#endif
+
+static const char *i2s_slot_style_str(void)
+{
+#if MIMI_VOICE_I2S_STD_SLOT_STYLE == 1
+ return "MSB";
+#elif MIMI_VOICE_I2S_STD_SLOT_STYLE == 2
+ return "PCM";
+#else
+ return "PHILIPS";
+#endif
+}
+
+/* =========================
+ * Fallback config defaults
+ * ========================= */
+
+#ifndef MIMI_VOICE_ENABLED_DEFAULT
+#define MIMI_VOICE_ENABLED_DEFAULT 0
+#endif
+
+#ifndef MIMI_VOICE_CHAT_ID
+#define MIMI_VOICE_CHAT_ID "voice_local"
+#endif
+
+#ifndef MIMI_VOICE_SAMPLE_RATE
+#define MIMI_VOICE_SAMPLE_RATE 16000
+#endif
+
+#ifndef MIMI_VOICE_FRAME_MS
+#define MIMI_VOICE_FRAME_MS 20
+#endif
+
+#ifndef MIMI_VOICE_MAX_UTTERANCE_MS
+#define MIMI_VOICE_MAX_UTTERANCE_MS 10000
+#endif
+
+#ifndef MIMI_VOICE_SILENCE_END_MS
+#define MIMI_VOICE_SILENCE_END_MS 600
+#endif
+
+#ifndef MIMI_VOICE_VAD_THRESHOLD
+#define MIMI_VOICE_VAD_THRESHOLD 700
+#endif
+
+#ifndef MIMI_VOICE_CAPTURE_STACK
+#define MIMI_VOICE_CAPTURE_STACK (8 * 1024)
+#endif
+
+#ifndef MIMI_VOICE_TASK_PRIO
+#define MIMI_VOICE_TASK_PRIO 5
+#endif
+
+#ifndef MIMI_VOICE_CORE
+#define MIMI_VOICE_CORE 0
+#endif
+
+#ifndef MIMI_SECRET_STT_URL
+#define MIMI_SECRET_STT_URL ""
+#endif
+
+#ifndef MIMI_SECRET_STT_API_KEY
+#define MIMI_SECRET_STT_API_KEY ""
+#endif
+
+#ifndef MIMI_SECRET_STT_MODEL
+#define MIMI_SECRET_STT_MODEL "qwen3-asr-flash"
+#endif
+
+#ifndef MIMI_SECRET_TTS_URL
+#define MIMI_SECRET_TTS_URL ""
+#endif
+
+#ifndef MIMI_SECRET_TTS_API_KEY
+#define MIMI_SECRET_TTS_API_KEY ""
+#endif
+
+#ifndef MIMI_SECRET_TTS_MODEL
+#define MIMI_SECRET_TTS_MODEL "qwen3-tts-flash"
+#endif
+
+#ifndef MIMI_SECRET_TTS_VOICE
+#define MIMI_SECRET_TTS_VOICE "Cherry"
+#endif
+
+#ifndef MIMI_SECRET_TTS_LANGUAGE
+#define MIMI_SECRET_TTS_LANGUAGE "English"
+#endif
+
+#ifndef MIMI_SECRET_API_KEY
+#define MIMI_SECRET_API_KEY ""
+#endif
+
+/* TTS text constraints (can override in mimi_secrets.h) */
+#ifndef MIMI_VOICE_TTS_MAX_CHARS
+#define MIMI_VOICE_TTS_MAX_CHARS 140
+#endif
+
+#ifndef MIMI_VOICE_I2S_PORT
+#define MIMI_VOICE_I2S_PORT 0
+#endif
+
+#ifndef MIMI_VOICE_I2S_BCLK
+#define MIMI_VOICE_I2S_BCLK 42
+#endif
+
+#ifndef MIMI_VOICE_I2S_WS
+#define MIMI_VOICE_I2S_WS 41
+#endif
+
+#ifndef MIMI_VOICE_I2S_DIN
+#define MIMI_VOICE_I2S_DIN 40
+#endif
+
+#ifndef MIMI_VOICE_I2S_DOUT
+#define MIMI_VOICE_I2S_DOUT 39
+#endif
+
+/* XVF3800 fixed digital format in your current design:
+ * 16 kHz, stereo, 32-bit samples over I2S.
+ */
+#define VOICE_I2S_CHANNELS 2
+#define VOICE_I2S_BYTES_PER_SAMPLE 4
+#define VOICE_I2S_BYTES_PER_STEREO_FRAME (VOICE_I2S_CHANNELS * VOICE_I2S_BYTES_PER_SAMPLE)
+#define VOICE_PCM_BITS 16
+
+typedef struct {
+ char *buf;
+ size_t len;
+ size_t cap;
+} http_resp_t;
+
+typedef struct {
+ uint16_t audio_format; /* 1 = PCM */
+ uint16_t channels;
+ uint32_t sample_rate;
+ uint16_t bits_per_sample;
+} wav_fmt_t;
+
+static bool s_enabled = false;
+static bool s_i2s_ready = false;
+static volatile bool s_is_playing = false;
+
+static i2s_chan_handle_t s_tx_chan = NULL;
+static i2s_chan_handle_t s_rx_chan = NULL;
+static TaskHandle_t s_capture_task = NULL;
+static SemaphoreHandle_t s_http_lock = NULL;
+
+/* =========================
+ * Secrets / config helpers
+ * ========================= */
+
+static const char *stt_api_url(void)
+{
+ return (MIMI_SECRET_STT_URL[0] != '\0') ? MIMI_SECRET_STT_URL : "";
+}
+
+static const char *stt_api_key(void)
+{
+ return (MIMI_SECRET_STT_API_KEY[0] != '\0') ? MIMI_SECRET_STT_API_KEY :
+ (MIMI_SECRET_API_KEY[0] != '\0') ? MIMI_SECRET_API_KEY : "";
+}
+
+static const char *stt_model(void)
+{
+ return (MIMI_SECRET_STT_MODEL[0] != '\0') ? MIMI_SECRET_STT_MODEL : "qwen3-asr-flash";
+}
+
+static const char *tts_api_url(void)
+{
+ return (MIMI_SECRET_TTS_URL[0] != '\0') ? MIMI_SECRET_TTS_URL : "";
+}
+
+static const char *tts_api_key(void)
+{
+ return (MIMI_SECRET_TTS_API_KEY[0] != '\0') ? MIMI_SECRET_TTS_API_KEY :
+ (MIMI_SECRET_API_KEY[0] != '\0') ? MIMI_SECRET_API_KEY : "";
+}
+
+static const char *tts_model(void)
+{
+ return (MIMI_SECRET_TTS_MODEL[0] != '\0') ? MIMI_SECRET_TTS_MODEL : "qwen3-tts-flash";
+}
+
+static const char *tts_voice(void)
+{
+ return (MIMI_SECRET_TTS_VOICE[0] != '\0') ? MIMI_SECRET_TTS_VOICE : "Cherry";
+}
+
+static const char *tts_language(void)
+{
+ return (MIMI_SECRET_TTS_LANGUAGE[0] != '\0') ? MIMI_SECRET_TTS_LANGUAGE : "English";
+}
+
+/* =========================
+ * HTTP helpers
+ * ========================= */
+
+static esp_err_t http_event_handler(esp_http_client_event_t *evt)
+{
+ http_resp_t *resp = (http_resp_t *)evt->user_data;
+ if (evt->event_id != HTTP_EVENT_ON_DATA || !resp || !evt->data || evt->data_len <= 0) {
+ return ESP_OK;
+ }
+
+ size_t need = resp->len + (size_t)evt->data_len + 1;
+ if (need > resp->cap) {
+ size_t new_cap = resp->cap ? resp->cap * 2 : 1024;
+ while (new_cap < need) {
+ new_cap *= 2;
+ }
+ char *tmp = realloc(resp->buf, new_cap);
+ if (!tmp) {
+ return ESP_ERR_NO_MEM;
+ }
+ resp->buf = tmp;
+ resp->cap = new_cap;
+ }
+
+ memcpy(resp->buf + resp->len, evt->data, evt->data_len);
+ resp->len += (size_t)evt->data_len;
+ resp->buf[resp->len] = '\0';
+ return ESP_OK;
+}
+
+static esp_err_t http_post_json(const char *url,
+ const char *bearer_key,
+ const char *json_body,
+ bool enable_sse,
+ http_resp_t *resp,
+ int *http_status_out)
+{
+ if (!url || !url[0] || !json_body || !resp) {
+ return ESP_ERR_INVALID_ARG;
+ }
+
+ memset(resp, 0, sizeof(*resp));
+
+ esp_http_client_config_t cfg = {
+ .url = url,
+ .method = HTTP_METHOD_POST,
+ .event_handler = http_event_handler,
+ .user_data = resp,
+ .crt_bundle_attach = esp_crt_bundle_attach,
+ .timeout_ms = 30000,
+ .buffer_size = 2048,
+ .buffer_size_tx = 2048,
+ };
+
+ esp_http_client_handle_t client = esp_http_client_init(&cfg);
+ if (!client) {
+ return ESP_FAIL;
+ }
+
+ esp_http_client_set_header(client, "Content-Type", "application/json");
+ if (bearer_key && bearer_key[0]) {
+ char auth[320];
+ snprintf(auth, sizeof(auth), "Bearer %s", bearer_key);
+ esp_http_client_set_header(client, "Authorization", auth);
+ }
+ esp_http_client_set_header(client, "X-DashScope-SSE", enable_sse ? "enable" : "disable");
+ esp_http_client_set_post_field(client, json_body, (int)strlen(json_body));
+
+ esp_err_t err = esp_http_client_perform(client);
+ if (err == ESP_OK && http_status_out) {
+ *http_status_out = esp_http_client_get_status_code(client);
+ }
+
+ esp_http_client_cleanup(client);
+ return err;
+}
+
+static esp_err_t http_get_binary(const char *url, http_resp_t *resp, int *http_status_out)
+{
+ if (!url || !url[0] || !resp) {
+ return ESP_ERR_INVALID_ARG;
+ }
+
+ memset(resp, 0, sizeof(*resp));
+
+ esp_http_client_config_t cfg = {
+ .url = url,
+ .method = HTTP_METHOD_GET,
+ .event_handler = http_event_handler,
+ .user_data = resp,
+ .crt_bundle_attach = esp_crt_bundle_attach,
+ .timeout_ms = 30000,
+ .buffer_size = 2048,
+ .buffer_size_tx = 1024,
+ };
+
+ esp_http_client_handle_t client = esp_http_client_init(&cfg);
+ if (!client) {
+ return ESP_FAIL;
+ }
+
+ esp_err_t err = esp_http_client_perform(client);
+ if (err == ESP_OK && http_status_out) {
+ *http_status_out = esp_http_client_get_status_code(client);
+ }
+
+ esp_http_client_cleanup(client);
+ return err;
+}
+
+/* =========================
+ * Audio helpers
+ * ========================= */
+
+static void *malloc_prefer_spiram(size_t bytes)
+{
+ if (bytes == 0) {
+ return NULL;
+ }
+
+ void *p = heap_caps_malloc(bytes, MALLOC_CAP_SPIRAM);
+ if (p) {
+ return p;
+ }
+ return malloc(bytes);
+}
+
+static bool utf8_is_continuation_byte(uint8_t b)
+{
+ return (b & 0xC0U) == 0x80U;
+}
+
+static bool utf8_starts_with(const char *s, size_t i, size_t len, const char *lit)
+{
+ size_t lit_len = strlen(lit);
+ if (i + lit_len > len) {
+ return false;
+ }
+ return memcmp(s + i, lit, lit_len) == 0;
+}
+
+static bool is_speech_cut_punct(const char *s, size_t i, size_t len)
+{
+ const uint8_t b = (uint8_t)s[i];
+ if (b == '\n' || b == '\r') {
+ return true;
+ }
+ if (b == '.' || b == '!' || b == '?' || b == ',' || b == ';' || b == ':') {
+ return true;
+ }
+
+ /* Common CJK punctuation in UTF-8 */
+ if (utf8_starts_with(s, i, len, "。") ||
+ utf8_starts_with(s, i, len, "!") ||
+ utf8_starts_with(s, i, len, "?") ||
+ utf8_starts_with(s, i, len, ",") ||
+ utf8_starts_with(s, i, len, ";") ||
+ utf8_starts_with(s, i, len, ":") ||
+ utf8_starts_with(s, i, len, "、")) {
+ return true;
+ }
+
+ return false;
+}
+
+static size_t utf8_truncate_for_tts(const char *text, size_t max_chars, size_t *out_char_count, bool *out_truncated)
+{
+ if (!text || max_chars == 0) {
+ if (out_char_count) *out_char_count = 0;
+ if (out_truncated) *out_truncated = false;
+ return 0;
+ }
+
+ const size_t len = strlen(text);
+ size_t char_count = 0;
+ size_t last_punct_cut = 0;
+ size_t i = 0;
+
+ while (i < len) {
+ if (char_count >= max_chars) {
+ break;
+ }
+
+ if (!utf8_is_continuation_byte((uint8_t)text[i])) {
+ char_count++;
+ if (is_speech_cut_punct(text, i, len)) {
+ /* Cut after this codepoint (best-effort) */
+ size_t j = i + 1;
+ while (j < len && utf8_is_continuation_byte((uint8_t)text[j])) {
+ j++;
+ }
+ last_punct_cut = j;
+ }
+ }
+ i++;
+ }
+
+ if (out_char_count) {
+ *out_char_count = char_count;
+ }
+
+ if (i >= len) {
+ if (out_truncated) *out_truncated = false;
+ return len;
+ }
+
+ /* Prefer cutting at punctuation, but avoid cutting too early */
+ size_t cut = i;
+ if (last_punct_cut > 0) {
+ const size_t min_reasonable = (max_chars >= 20) ? (max_chars / 2) : 0;
+ if (last_punct_cut >= min_reasonable) {
+ cut = last_punct_cut;
+ }
+ }
+
+ while (cut > 0 && utf8_is_continuation_byte((uint8_t)text[cut])) {
+ cut--;
+ }
+
+ if (out_truncated) *out_truncated = true;
+ return cut;
+}
+
+static char *voice_build_tts_text(const char *text)
+{
+ if (!text) {
+ return NULL;
+ }
+
+ size_t char_count = 0;
+ bool truncated = false;
+ size_t cut_bytes = utf8_truncate_for_tts(text, MIMI_VOICE_TTS_MAX_CHARS, &char_count, &truncated);
+
+ if (!truncated) {
+ return NULL; /* caller can use original text */
+ }
+
+ char *out = (char *)malloc(cut_bytes + 1);
+ if (!out) {
+ return NULL;
+ }
+ memcpy(out, text, cut_bytes);
+ out[cut_bytes] = '\0';
+
+ ESP_LOGW(TAG, "TTS text truncated: max=%u chars, cut_bytes=%u", (unsigned)MIMI_VOICE_TTS_MAX_CHARS, (unsigned)cut_bytes);
+ return out;
+}
+
+static int16_t fir5_s16_at_clamped(const int16_t *src, size_t src_samples, size_t idx)
+{
+ if (!src || src_samples == 0) {
+ return 0;
+ }
+
+ size_t i0 = (idx >= 2) ? (idx - 2) : 0;
+ size_t i1 = (idx >= 1) ? (idx - 1) : 0;
+ size_t i2 = idx;
+ if (i2 >= src_samples) i2 = src_samples - 1;
+ size_t i3 = (idx + 1 < src_samples) ? (idx + 1) : (src_samples - 1);
+ size_t i4 = (idx + 2 < src_samples) ? (idx + 2) : (src_samples - 1);
+
+ int32_t acc =
+ (int32_t)src[i0] * 1 +
+ (int32_t)src[i1] * 4 +
+ (int32_t)src[i2] * 6 +
+ (int32_t)src[i3] * 4 +
+ (int32_t)src[i4] * 1;
+
+ acc = acc / 16;
+ if (acc > INT16_MAX) acc = INT16_MAX;
+ if (acc < INT16_MIN) acc = INT16_MIN;
+ return (int16_t)acc;
+}
+
+static esp_err_t i2s_tx_write_silence_ms(uint32_t ms)
+{
+ if (!s_i2s_ready || !s_tx_chan || ms == 0) {
+ return ESP_OK;
+ }
+
+ uint64_t frames_total = ((uint64_t)MIMI_VOICE_SAMPLE_RATE * (uint64_t)ms) / 1000ULL;
+ while (frames_total > 0) {
+ const size_t frames_chunk = (frames_total > 256) ? 256 : (size_t)frames_total;
+ int32_t zeros[256 * 2] = {0};
+
+ const uint8_t *p = (const uint8_t *)zeros;
+ size_t bytes_total = frames_chunk * 2 * sizeof(int32_t);
+ size_t bytes_sent = 0;
+
+ while (bytes_sent < bytes_total) {
+ size_t written = 0;
+ esp_err_t err = i2s_channel_write(s_tx_chan,
+ p + bytes_sent,
+ bytes_total - bytes_sent,
+ &written,
+ pdMS_TO_TICKS(1000));
+ if (err != ESP_OK) {
+ return err;
+ }
+ if (written == 0) {
+ return ESP_FAIL;
+ }
+ bytes_sent += written;
+ }
+
+ frames_total -= frames_chunk;
+ }
+
+ return ESP_OK;
+}
+
+static esp_err_t i2s_tx_overwrite_dma_with_zeros(void)
+{
+ if (!s_i2s_ready || !s_tx_chan) {
+ return ESP_ERR_INVALID_STATE;
+ }
+
+ uint32_t remaining = MIMI_VOICE_TX_DMA_TOTAL_BYTES;
+ while (remaining > 0) {
+ int32_t zeros[256 * 2] = {0};
+ size_t chunk = sizeof(zeros);
+ if (chunk > remaining) {
+ chunk = remaining;
+ }
+
+ const uint8_t *p = (const uint8_t *)zeros;
+ size_t sent = 0;
+ while (sent < chunk) {
+ size_t written = 0;
+ esp_err_t err = i2s_channel_write(s_tx_chan,
+ p + sent,
+ chunk - sent,
+ &written,
+ pdMS_TO_TICKS(1000));
+ if (err != ESP_OK) {
+ return err;
+ }
+ if (written == 0) {
+ return ESP_FAIL;
+ }
+ sent += written;
+ }
+
+ remaining -= (uint32_t)chunk;
+ }
+
+ return ESP_OK;
+}
+
+static void pcm_s32_stereo_to_s16_mono(const uint8_t *src, size_t src_len, int16_t *dst, size_t *out_samples)
+{
+ size_t frames = src_len / VOICE_I2S_BYTES_PER_STEREO_FRAME;
+ const int32_t *p = (const int32_t *)src;
+
+ for (size_t i = 0; i < frames; i++) {
+ int32_t l = p[i * 2 + 0];
+ int32_t r = p[i * 2 + 1];
+
+ /* XVF3800 / many I2S MEMS frontends deliver valid audio in high 16 bits of s32 slot. */
+ int16_t ls = (int16_t)(l >> 16);
+ int16_t rs = (int16_t)(r >> 16);
+ int32_t mono = ((int32_t)ls + (int32_t)rs) / 2;
+
+ if (mono > INT16_MAX) mono = INT16_MAX;
+ if (mono < INT16_MIN) mono = INT16_MIN;
+ dst[i] = (int16_t)mono;
+ }
+
+ if (out_samples) {
+ *out_samples = frames;
+ }
+}
+
+static uint32_t pcm_energy_absavg(const int16_t *pcm, size_t samples)
+{
+ if (!pcm || samples == 0) return 0;
+
+ uint64_t sum = 0;
+ for (size_t i = 0; i < samples; i++) {
+ int32_t v = pcm[i];
+ if (v < 0) v = -v;
+ sum += (uint32_t)v;
+ }
+ return (uint32_t)(sum / samples);
+}
+
+static size_t wav_build_from_pcm16(const int16_t *pcm,
+ size_t pcm_bytes,
+ uint32_t sample_rate,
+ uint16_t channels,
+ uint8_t **out_buf)
+{
+ if (!pcm || !out_buf || pcm_bytes == 0) {
+ return 0;
+ }
+
+ const size_t wav_size = 44 + pcm_bytes;
+ uint8_t *buf = (uint8_t *)malloc(wav_size);
+ if (!buf) {
+ return 0;
+ }
+
+ const uint32_t byte_rate = sample_rate * channels * 2;
+ const uint16_t block_align = channels * 2;
+ const uint32_t riff_size = (uint32_t)(wav_size - 8);
+ const uint32_t data_size = (uint32_t)pcm_bytes;
+
+ memcpy(buf + 0, "RIFF", 4);
+ memcpy(buf + 4, &riff_size, 4);
+ memcpy(buf + 8, "WAVE", 4);
+
+ memcpy(buf + 12, "fmt ", 4);
+ uint32_t fmt_size = 16;
+ uint16_t audio_format = 1;
+ uint16_t bits_per_sample = 16;
+ memcpy(buf + 16, &fmt_size, 4);
+ memcpy(buf + 20, &audio_format, 2);
+ memcpy(buf + 22, &channels, 2);
+ memcpy(buf + 24, &sample_rate, 4);
+ memcpy(buf + 28, &byte_rate, 4);
+ memcpy(buf + 32, &block_align, 2);
+ memcpy(buf + 34, &bits_per_sample, 2);
+
+ memcpy(buf + 36, "data", 4);
+ memcpy(buf + 40, &data_size, 4);
+ memcpy(buf + 44, pcm, pcm_bytes);
+
+ *out_buf = buf;
+ return wav_size;
+}
+
+static esp_err_t wav_find_data_chunk(const uint8_t *wav,
+ size_t wav_len,
+ wav_fmt_t *fmt,
+ const uint8_t **data_out,
+ size_t *data_len_out)
+{
+ if (!wav || wav_len < 44 || !fmt || !data_out || !data_len_out) {
+ return ESP_ERR_INVALID_ARG;
+ }
+
+ if (memcmp(wav, "RIFF", 4) != 0 || memcmp(wav + 8, "WAVE", 4) != 0) {
+ return ESP_ERR_INVALID_RESPONSE;
+ }
+
+ memset(fmt, 0, sizeof(*fmt));
+ *data_out = NULL;
+ *data_len_out = 0;
+
+ size_t pos = 12;
+ bool got_fmt = false;
+
+ while (pos + 8 <= wav_len) {
+ const uint8_t *chunk = wav + pos;
+ uint32_t chunk_size = 0;
+ memcpy(&chunk_size, chunk + 4, 4);
+
+ size_t chunk_data_pos = pos + 8;
+ if (chunk_data_pos > wav_len) {
+ break;
+ }
+
+ size_t available = wav_len - chunk_data_pos;
+ size_t declared = (size_t)chunk_size;
+ size_t actual = declared <= available ? declared : available;
+
+ char id[5] = {0};
+ memcpy(id, chunk, 4);
+ ESP_LOGI(TAG, "WAV chunk id=%s declared=%u actual=%u pos=%u",
+ id, (unsigned)chunk_size, (unsigned)actual, (unsigned)pos);
+
+ if (memcmp(chunk, "fmt ", 4) == 0) {
+ if (actual < 16) {
+ ESP_LOGW(TAG, "WAV fmt chunk too short: %u", (unsigned)actual);
+ } else {
+ memcpy(&fmt->audio_format, wav + chunk_data_pos + 0, 2);
+ memcpy(&fmt->channels, wav + chunk_data_pos + 2, 2);
+ memcpy(&fmt->sample_rate, wav + chunk_data_pos + 4, 4);
+ memcpy(&fmt->bits_per_sample, wav + chunk_data_pos + 14, 2);
+ got_fmt = true;
+ }
+ } else if (memcmp(chunk, "data", 4) == 0) {
+ *data_out = wav + chunk_data_pos;
+ *data_len_out = actual;
+ break;
+ }
+
+ size_t step = 8 + actual;
+ if (declared <= available) {
+ step += (declared & 1U);
+ }
+
+ if (step == 0 || pos + step <= pos) {
+ break;
+ }
+ pos += step;
+ }
+
+ if (!*data_out || *data_len_out == 0) {
+ return ESP_ERR_NOT_FOUND;
+ }
+
+ if (!got_fmt) {
+ ESP_LOGW(TAG, "WAV fmt chunk not found, assume PCM16 mono/stereo fallback");
+ fmt->audio_format = 1;
+ fmt->channels = 1;
+ fmt->sample_rate = MIMI_VOICE_SAMPLE_RATE;
+ fmt->bits_per_sample = 16;
+ }
+
+ if (fmt->audio_format != 1) {
+ ESP_LOGE(TAG, "Unsupported WAV audio_format=%u", (unsigned)fmt->audio_format);
+ return ESP_ERR_NOT_SUPPORTED;
+ }
+
+ if (fmt->bits_per_sample != 16) {
+ ESP_LOGE(TAG, "Unsupported WAV bits_per_sample=%u", (unsigned)fmt->bits_per_sample);
+ return ESP_ERR_NOT_SUPPORTED;
+ }
+
+ return ESP_OK;
+}
+
+/* =========================
+ * STT / TTS JSON helpers
+ * ========================= */
+
+static char *build_data_url_from_wav(const uint8_t *wav, size_t wav_len)
+{
+ if (!wav || wav_len == 0) {
+ return NULL;
+ }
+
+ size_t b64_len = 0;
+ int rc = mbedtls_base64_encode(NULL, 0, &b64_len, wav, wav_len);
+ if (rc != MBEDTLS_ERR_BASE64_BUFFER_TOO_SMALL && rc != 0) {
+ return NULL;
+ }
+
+ const char *prefix = "data:audio/wav;base64,";
+ size_t prefix_len = strlen(prefix);
+ char *out = (char *)malloc(prefix_len + b64_len + 1);
+ if (!out) {
+ return NULL;
+ }
+
+ memcpy(out, prefix, prefix_len);
+
+ size_t actual = 0;
+ rc = mbedtls_base64_encode((unsigned char *)(out + prefix_len),
+ b64_len,
+ &actual,
+ wav,
+ wav_len);
+ if (rc != 0) {
+ free(out);
+ return NULL;
+ }
+
+ out[prefix_len + actual] = '\0';
+ return out;
+}
+
+static esp_err_t parse_stt_response_text(const char *json, char *out_text, size_t out_size)
+{
+ if (!json || !out_text || out_size == 0) {
+ return ESP_ERR_INVALID_ARG;
+ }
+
+ out_text[0] = '\0';
+
+ cJSON *root = cJSON_Parse(json);
+ if (!root) {
+ return ESP_ERR_INVALID_RESPONSE;
+ }
+
+ cJSON *choices = cJSON_GetObjectItem(root, "choices");
+ cJSON *choice0 = (choices && cJSON_IsArray(choices)) ? cJSON_GetArrayItem(choices, 0) : NULL;
+ cJSON *message = choice0 ? cJSON_GetObjectItem(choice0, "message") : NULL;
+ cJSON *content = message ? cJSON_GetObjectItem(message, "content") : NULL;
+
+ if (cJSON_IsString(content) && content->valuestring) {
+ strlcpy(out_text, content->valuestring, out_size);
+ cJSON_Delete(root);
+ return ESP_OK;
+ }
+
+ cJSON_Delete(root);
+ return ESP_ERR_NOT_FOUND;
+}
+
+static esp_err_t parse_tts_audio_url(const char *json, char *out_url, size_t out_size)
+{
+ if (!json || !out_url || out_size == 0) {
+ return ESP_ERR_INVALID_ARG;
+ }
+
+ out_url[0] = '\0';
+
+ cJSON *root = cJSON_Parse(json);
+ if (!root) {
+ return ESP_ERR_INVALID_RESPONSE;
+ }
+
+ cJSON *output = cJSON_GetObjectItem(root, "output");
+ cJSON *audio = output ? cJSON_GetObjectItem(output, "audio") : NULL;
+ cJSON *url = audio ? cJSON_GetObjectItem(audio, "url") : NULL;
+
+ if (cJSON_IsString(url) && url->valuestring && url->valuestring[0]) {
+ strlcpy(out_url, url->valuestring, out_size);
+ cJSON_Delete(root);
+ return ESP_OK;
+ }
+
+ cJSON_Delete(root);
+ return ESP_ERR_NOT_FOUND;
+}
+
+/* =========================
+ * Bus integration
+ * ========================= */
+
+static void push_voice_inbound(const char *text)
+{
+ if (!text || !text[0]) {
+ return;
+ }
+
+ mimi_msg_t msg = {0};
+ strlcpy(msg.channel, MIMI_CHAN_VOICE, sizeof(msg.channel));
+ strlcpy(msg.chat_id, MIMI_VOICE_CHAT_ID, sizeof(msg.chat_id));
+ msg.content = strdup(text);
+
+ if (!msg.content) {
+ ESP_LOGE(TAG, "No memory for voice inbound text");
+ return;
+ }
+
+ if (message_bus_push_inbound(&msg) != ESP_OK) {
+ ESP_LOGW(TAG, "Inbound queue full, drop voice transcript");
+ free(msg.content);
+ }
+}
+
+/* =========================
+ * I2S init / playback
+ * ========================= */
+
+static esp_err_t i2s_init_xvf3800(void)
+{
+ esp_err_t err;
+
+ i2s_chan_config_t chan_cfg = {
+ .id = (i2s_port_t)MIMI_VOICE_I2S_PORT,
+ .role = I2S_ROLE_MASTER,
+ .dma_desc_num = MIMI_VOICE_I2S_DMA_DESC_NUM,
+ .dma_frame_num = MIMI_VOICE_I2S_DMA_FRAME_NUM,
+ .auto_clear_after_cb = false,
+ .auto_clear_before_cb = false,
+ .allow_pd = false,
+ .intr_priority = 0,
+ };
+
+ err = i2s_new_channel(&chan_cfg, &s_tx_chan, &s_rx_chan);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "i2s_new_channel failed: %s", esp_err_to_name(err));
+ return err;
+ }
+
+ i2s_std_config_t rx_cfg = {
+ .clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(MIMI_VOICE_SAMPLE_RATE),
+ .slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_STEREO),
+ .gpio_cfg = {
+ .mclk = I2S_GPIO_UNUSED,
+ .bclk = MIMI_VOICE_I2S_BCLK,
+ .ws = MIMI_VOICE_I2S_WS,
+ .dout = I2S_GPIO_UNUSED,
+ .din = MIMI_VOICE_I2S_DIN,
+ .invert_flags = {
+ .mclk_inv = false,
+ .bclk_inv = false,
+ .ws_inv = false,
+ },
+ },
+ };
+
+ i2s_std_config_t tx_cfg = {
+ .clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(MIMI_VOICE_SAMPLE_RATE),
+ .slot_cfg = I2S_STD_MSB_SLOT_DEFAULT_CONFIG(I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_STEREO),
+ .gpio_cfg = {
+ .mclk = I2S_GPIO_UNUSED,
+ .bclk = MIMI_VOICE_I2S_BCLK,
+ .ws = MIMI_VOICE_I2S_WS,
+ .dout = MIMI_VOICE_I2S_DOUT,
+ .din = I2S_GPIO_UNUSED,
+ .invert_flags = {
+ .mclk_inv = false,
+ .bclk_inv = false,
+ .ws_inv = false,
+ },
+ },
+ };
+
+ err = i2s_channel_init_std_mode(s_rx_chan, &rx_cfg);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "i2s rx init failed: %s", esp_err_to_name(err));
+ return err;
+ }
+
+ err = i2s_channel_init_std_mode(s_tx_chan, &tx_cfg);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "i2s tx init failed: %s", esp_err_to_name(err));
+ return err;
+ }
+
+ /* Seed TX DMA with silence, otherwise some DAC/amps output "咚咚/沙沙" due to
+ * undefined initial DMA content or repeating last buffer when idle.
+ *
+ * Only allowed before enabling the channel.
+ */
+ {
+ int32_t zeros[128 * 2] = {0};
+ size_t loaded = 0;
+ (void)i2s_channel_preload_data(s_tx_chan, zeros, sizeof(zeros), &loaded);
+ }
+
+ err = i2s_channel_enable(s_rx_chan);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "i2s rx enable failed: %s", esp_err_to_name(err));
+ return err;
+ }
+
+ err = i2s_channel_enable(s_tx_chan);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "i2s tx enable failed: %s", esp_err_to_name(err));
+ return err;
+ }
+
+ s_i2s_ready = true;
+ ESP_LOGI(TAG, "I2S ready: %dHz stereo s32 in / stereo s32 out (%s timing)",
+ MIMI_VOICE_SAMPLE_RATE,
+ i2s_slot_style_str());
+ return ESP_OK;
+}
+static int16_t *resample_s16_mono_linear(const int16_t *src,
+ size_t src_samples,
+ uint32_t src_rate,
+ uint32_t dst_rate,
+ size_t *out_samples)
+{
+ if (!src || src_samples == 0 || !out_samples || src_rate == 0 || dst_rate == 0) {
+ return NULL;
+ }
+
+ if (src_rate == dst_rate) {
+ int16_t *copy = (int16_t *)malloc_prefer_spiram(src_samples * sizeof(int16_t));
+ if (!copy) {
+ return NULL;
+ }
+ memcpy(copy, src, src_samples * sizeof(int16_t));
+ *out_samples = src_samples;
+ return copy;
+ }
+
+ const bool is_downsampling = src_rate > dst_rate;
+
+ size_t dst_samples = (size_t)(((uint64_t)src_samples * dst_rate) / src_rate);
+ if (dst_samples == 0) {
+ return NULL;
+ }
+
+ int16_t *dst = (int16_t *)malloc_prefer_spiram(dst_samples * sizeof(int16_t));
+ if (!dst) {
+ return NULL;
+ }
+
+ /* When downsampling (e.g. 24k -> 16k), naive linear interpolation tends to fold
+ * high-frequency content above the new Nyquist into the audible band (aliasing),
+ * often perceived as "沙沙" on sibilants/background.
+ *
+ * Apply a tiny 5-tap low-pass FIR [1,4,6,4,1]/16 on the source indices we touch.
+ * This is cheap and improves subjective quality significantly without pulling in DSP deps.
+ */
+ for (size_t i = 0; i < dst_samples; i++) {
+ float src_pos = ((float)i * (float)src_rate) / (float)dst_rate;
+ size_t idx = (size_t)src_pos;
+ float frac = src_pos - (float)idx;
+
+ if (idx >= src_samples - 1) {
+ dst[i] = src[src_samples - 1];
+ } else {
+ float a = (float)(is_downsampling ? fir5_s16_at_clamped(src, src_samples, idx) : src[idx]);
+ float b = (float)(is_downsampling ? fir5_s16_at_clamped(src, src_samples, idx + 1) : src[idx + 1]);
+ float v = a + (b - a) * frac;
+
+ if (v > 32767.0f) v = 32767.0f;
+ if (v < -32768.0f) v = -32768.0f;
+ dst[i] = (int16_t)v;
+ }
+ }
+
+ *out_samples = dst_samples;
+ return dst;
+}
+static esp_err_t i2s_play_wav_pcm16(const uint8_t *wav, size_t wav_len)
+{
+ if (!s_i2s_ready || !s_tx_chan || !wav || wav_len == 0) {
+ return ESP_ERR_INVALID_STATE;
+ }
+
+ wav_fmt_t fmt;
+ const uint8_t *pcm = NULL;
+ size_t pcm_len = 0;
+
+ esp_err_t err = wav_find_data_chunk(wav, wav_len, &fmt, &pcm, &pcm_len);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "wav_find_data_chunk failed: %s", esp_err_to_name(err));
+ return err;
+ }
+
+ ESP_LOGI(TAG, "WAV fmt: format=%u channels=%u sample_rate=%u bits=%u data_len=%u",
+ (unsigned)fmt.audio_format,
+ (unsigned)fmt.channels,
+ (unsigned)fmt.sample_rate,
+ (unsigned)fmt.bits_per_sample,
+ (unsigned)pcm_len);
+
+ if (fmt.audio_format != 1 || fmt.bits_per_sample != 16) {
+ return ESP_ERR_NOT_SUPPORTED;
+ }
+
+ const int16_t *src16 = (const int16_t *)pcm;
+ size_t src_samples_total = pcm_len / sizeof(int16_t);
+
+ const int16_t *mono_src = NULL;
+ int16_t *mono_owned = NULL;
+ size_t mono_samples = 0;
+
+ if (fmt.channels == 1) {
+ mono_src = src16;
+ mono_samples = src_samples_total;
+ } else if (fmt.channels == 2) {
+ mono_samples = src_samples_total / 2;
+ mono_owned = (int16_t *)malloc_prefer_spiram(mono_samples * sizeof(int16_t));
+ if (!mono_owned) {
+ return ESP_ERR_NO_MEM;
+ }
+
+ for (size_t i = 0, j = 0; j < mono_samples; i += 2, j++) {
+ int32_t v = ((int32_t)src16[i] + (int32_t)src16[i + 1]) / 2;
+ mono_owned[j] = (int16_t)v;
+ }
+ mono_src = mono_owned;
+ } else {
+ return ESP_ERR_NOT_SUPPORTED;
+ }
+
+ const int16_t *play_src = NULL;
+ int16_t *play_owned = NULL;
+ size_t play_samples = 0;
+
+ if (fmt.sample_rate == MIMI_VOICE_SAMPLE_RATE) {
+ play_src = mono_src;
+ play_samples = mono_samples;
+ } else {
+ play_owned = resample_s16_mono_linear(
+ mono_src,
+ mono_samples,
+ fmt.sample_rate,
+ MIMI_VOICE_SAMPLE_RATE,
+ &play_samples
+ );
+ free(mono_owned);
+ mono_owned = NULL;
+
+ if (!play_owned || play_samples == 0) {
+ return ESP_ERR_NO_MEM;
+ }
+ play_src = play_owned;
+ }
+
+ ESP_LOGI(TAG, "Playback PCM: %u samples @ %u Hz (~%u ms)",
+ (unsigned)play_samples,
+ (unsigned)MIMI_VOICE_SAMPLE_RATE,
+ (unsigned)((play_samples * 1000ULL) / MIMI_VOICE_SAMPLE_RATE));
+
+ s_is_playing = true;
+
+ size_t frames_total = play_samples;
+ size_t frames_sent = 0;
+
+ while (frames_sent < frames_total) {
+ const size_t frames_chunk = (frames_total - frames_sent > 256) ? 256 : (frames_total - frames_sent);
+
+ int32_t tx_buf[256 * 2];
+ for (size_t i = 0; i < frames_chunk; i++) {
+ int16_t s16 = play_src[frames_sent + i];
+ int32_t s32 = ((int32_t)s16) << 16;
+ tx_buf[i * 2 + 0] = s32;
+ tx_buf[i * 2 + 1] = s32;
+ }
+
+ const uint8_t *p = (const uint8_t *)tx_buf;
+ size_t bytes_total = frames_chunk * 2 * sizeof(int32_t);
+ size_t bytes_sent = 0;
+
+ while (bytes_sent < bytes_total) {
+ size_t written = 0;
+ err = i2s_channel_write(s_tx_chan,
+ p + bytes_sent,
+ bytes_total - bytes_sent,
+ &written,
+ pdMS_TO_TICKS(1000));
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "i2s write failed: %s", esp_err_to_name(err));
+ free(play_owned);
+ free(mono_owned);
+ s_is_playing = false;
+ return err;
+ }
+ if (written == 0) {
+ ESP_LOGE(TAG, "i2s write returned 0 bytes");
+ free(play_owned);
+ free(mono_owned);
+ s_is_playing = false;
+ return ESP_FAIL;
+ }
+ bytes_sent += written;
+ }
+
+ frames_sent += frames_chunk;
+ }
+
+ /* Leave a short silence tail so the TX engine doesn't keep repeating the last
+ * non-zero DMA buffer (often perceived as continuous "咚咚" when idle).
+ */
+ (void)i2s_tx_write_silence_ms(MIMI_VOICE_TX_SILENCE_TAIL_MS);
+ (void)i2s_tx_overwrite_dma_with_zeros();
+
+ free(play_owned);
+ free(mono_owned);
+ s_is_playing = false;
+ return ESP_OK;
+}
+
+/* =========================
+ * STT / TTS core
+ * ========================= */
+
+static esp_err_t stt_transcribe_pcm(const int16_t *pcm,
+ size_t pcm_bytes,
+ char *out_text,
+ size_t out_text_size)
+{
+ if (!pcm || pcm_bytes == 0 || !out_text || out_text_size == 0) {
+ return ESP_ERR_INVALID_ARG;
+ }
+ if (!stt_api_url()[0] || !stt_api_key()[0]) {
+ return ESP_ERR_INVALID_STATE;
+ }
+
+ out_text[0] = '\0';
+
+ uint8_t *wav = NULL;
+ size_t wav_len = wav_build_from_pcm16(pcm, pcm_bytes, MIMI_VOICE_SAMPLE_RATE, 1, &wav);
+ if (!wav || wav_len == 0) {
+ return ESP_ERR_NO_MEM;
+ }
+
+ char *data_url = build_data_url_from_wav(wav, wav_len);
+ free(wav);
+ if (!data_url) {
+ return ESP_ERR_NO_MEM;
+ }
+
+ cJSON *root = cJSON_CreateObject();
+ cJSON_AddStringToObject(root, "model", stt_model());
+ cJSON_AddBoolToObject(root, "stream", false);
+
+ cJSON *messages = cJSON_CreateArray();
+ cJSON *msg = cJSON_CreateObject();
+ cJSON_AddStringToObject(msg, "role", "user");
+
+ cJSON *content = cJSON_CreateArray();
+ cJSON *audio_item = cJSON_CreateObject();
+ cJSON_AddStringToObject(audio_item, "type", "input_audio");
+
+ cJSON *input_audio = cJSON_CreateObject();
+ cJSON_AddStringToObject(input_audio, "data", data_url);
+ cJSON_AddItemToObject(audio_item, "input_audio", input_audio);
+ cJSON_AddItemToArray(content, audio_item);
+
+ cJSON_AddItemToObject(msg, "content", content);
+ cJSON_AddItemToArray(messages, msg);
+ cJSON_AddItemToObject(root, "messages", messages);
+
+ cJSON *asr_options = cJSON_CreateObject();
+ cJSON_AddBoolToObject(asr_options, "enable_itn", false);
+ cJSON_AddItemToObject(root, "asr_options", asr_options);
+
+ char *body = cJSON_PrintUnformatted(root);
+ cJSON_Delete(root);
+ free(data_url);
+
+ if (!body) {
+ return ESP_ERR_NO_MEM;
+ }
+
+ http_resp_t resp = {0};
+ int http_status = 0;
+ esp_err_t err = http_post_json(stt_api_url(), stt_api_key(), body, false, &resp, &http_status);
+ free(body);
+
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "STT HTTP failed: %s", esp_err_to_name(err));
+ free(resp.buf);
+ return err;
+ }
+ if (http_status < 200 || http_status >= 300) {
+ ESP_LOGE(TAG, "STT HTTP status=%d body=%s", http_status, resp.buf ? resp.buf : "");
+ free(resp.buf);
+ return ESP_FAIL;
+ }
+
+ err = parse_stt_response_text(resp.buf ? resp.buf : "", out_text, out_text_size);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "STT parse failed, body=%s", resp.buf ? resp.buf : "");
+ } else {
+ ESP_LOGI(TAG, "STT transcript: %s", out_text);
+ }
+
+ free(resp.buf);
+ return err;
+}
+
+static esp_err_t tts_stream_play(const char *text)
+{
+ if (!text || !text[0]) {
+ return ESP_ERR_INVALID_ARG;
+ }
+ if (!tts_api_url()[0] || !tts_api_key()[0]) {
+ return ESP_ERR_INVALID_STATE;
+ }
+
+ cJSON *body = cJSON_CreateObject();
+ cJSON_AddStringToObject(body, "model", tts_model());
+
+ cJSON *input = cJSON_CreateObject();
+ cJSON_AddStringToObject(input, "text", text);
+ cJSON_AddStringToObject(input, "voice", tts_voice());
+ cJSON_AddStringToObject(input, "language_type", tts_language());
+ cJSON_AddItemToObject(body, "input", input);
+
+ char *json = cJSON_PrintUnformatted(body);
+ cJSON_Delete(body);
+
+ if (!json) {
+ return ESP_ERR_NO_MEM;
+ }
+
+ http_resp_t resp = {0};
+ int http_status = 0;
+ esp_err_t err = http_post_json(tts_api_url(), tts_api_key(), json, false, &resp, &http_status);
+ free(json);
+
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "TTS HTTP failed: %s", esp_err_to_name(err));
+ free(resp.buf);
+ return err;
+ }
+ if (http_status < 200 || http_status >= 300) {
+ ESP_LOGE(TAG, "TTS HTTP status=%d body=%s", http_status, resp.buf ? resp.buf : "");
+ free(resp.buf);
+ return ESP_FAIL;
+ }
+
+ char wav_url[1024] = {0};
+ err = parse_tts_audio_url(resp.buf ? resp.buf : "", wav_url, sizeof(wav_url));
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "TTS parse failed, body=%s", resp.buf ? resp.buf : "");
+ free(resp.buf);
+ return err;
+ }
+ free(resp.buf);
+
+ ESP_LOGI(TAG, "TTS audio url: %s", wav_url);
+
+ http_resp_t wav_resp = {0};
+ http_status = 0;
+ err = http_get_binary(wav_url, &wav_resp, &http_status);
+ if (err != ESP_OK) {
+ ESP_LOGE(TAG, "TTS wav download failed: %s", esp_err_to_name(err));
+ free(wav_resp.buf);
+ return err;
+ }
+ if (http_status < 200 || http_status >= 300) {
+ ESP_LOGE(TAG, "TTS wav status=%d", http_status);
+ free(wav_resp.buf);
+ return ESP_FAIL;
+ }
+ ESP_LOGI(TAG, "TTS wav http_status=%d len=%d", http_status, (int)wav_resp.len);
+
+ if (wav_resp.len >= 12) {
+ ESP_LOGI(TAG, "TTS wav magic: %.4s / %.4s",
+ wav_resp.buf,
+ wav_resp.buf + 8);
+ }
+
+ if (wav_resp.len >= 4 && memcmp(wav_resp.buf, "RIFF", 4) != 0) {
+ ESP_LOGE(TAG, "TTS response is not WAV, preview: %.120s", wav_resp.buf);
+ }
+
+ err = i2s_play_wav_pcm16((const uint8_t *)wav_resp.buf, wav_resp.len);
+ free(wav_resp.buf);
+ return err;
+}
+
+/* =========================
+ * Voice capture loop
+ * ========================= */
+
+static void voice_capture_task(void *arg)
+{
+ (void)arg;
+
+ const size_t frame_samples = (MIMI_VOICE_SAMPLE_RATE * MIMI_VOICE_FRAME_MS) / 1000;
+ const size_t stereo_frame_bytes = frame_samples * VOICE_I2S_BYTES_PER_STEREO_FRAME;
+ const size_t mono16_frame_bytes = frame_samples * sizeof(int16_t);
+ const size_t max_frames = MIMI_VOICE_MAX_UTTERANCE_MS / MIMI_VOICE_FRAME_MS;
+ const size_t silence_frames_end = MIMI_VOICE_SILENCE_END_MS / MIMI_VOICE_FRAME_MS;
+
+ uint8_t *rx_buf = (uint8_t *)heap_caps_malloc(stereo_frame_bytes, MALLOC_CAP_SPIRAM);
+ int16_t *mono_frame = (int16_t *)heap_caps_malloc(mono16_frame_bytes, MALLOC_CAP_SPIRAM);
+ int16_t *utterance = (int16_t *)heap_caps_malloc(max_frames * frame_samples * sizeof(int16_t), MALLOC_CAP_SPIRAM);
+
+ if (!rx_buf || !mono_frame || !utterance) {
+ ESP_LOGE(TAG, "voice_capture_task alloc failed");
+ free(rx_buf);
+ free(mono_frame);
+ free(utterance);
+ vTaskDelete(NULL);
+ return;
+ }
+
+ bool in_speech = false;
+ size_t total_frames = 0;
+ size_t silence_frames = 0;
+ size_t start_frames = 0;
+ TickType_t cooldown_until = 0;
+
+ /* Simple adaptive noise floor */
+ uint32_t noise_floor = MIMI_VOICE_VAD_THRESHOLD / 2;
+ if (noise_floor < 100) noise_floor = 100;
+
+ while (1) {
+ if (!s_i2s_ready || !s_rx_chan) {
+ vTaskDelay(pdMS_TO_TICKS(100));
+ continue;
+ }
+
+ if (s_is_playing) {
+ /* Avoid self-trigger during playback */
+ vTaskDelay(pdMS_TO_TICKS(MIMI_VOICE_FRAME_MS));
+ continue;
+ }
+
+ TickType_t now = xTaskGetTickCount();
+ if (cooldown_until != 0 && now < cooldown_until) {
+ vTaskDelay(cooldown_until - now);
+ continue;
+ }
+
+ size_t bytes_read = 0;
+ esp_err_t err = i2s_channel_read(s_rx_chan,
+ rx_buf,
+ stereo_frame_bytes,
+ &bytes_read,
+ pdMS_TO_TICKS(1000));
+ if (err != ESP_OK || bytes_read == 0) {
+ continue;
+ }
+
+ size_t mono_samples = 0;
+ pcm_s32_stereo_to_s16_mono(rx_buf, bytes_read, mono_frame, &mono_samples);
+ if (mono_samples == 0) {
+ continue;
+ }
+
+ uint32_t energy = pcm_energy_absavg(mono_frame, mono_samples);
+
+ /* Update noise floor only when not in speech */
+ if (!in_speech) {
+ noise_floor = (noise_floor * 15 + energy) / 16;
+ }
+
+ uint32_t dynamic_threshold = noise_floor + MIMI_VOICE_VAD_THRESHOLD;
+ bool speech_now = (energy > dynamic_threshold);
+
+ if (!in_speech) {
+ if (!speech_now) {
+ start_frames = 0;
+ continue;
+ }
+ start_frames++;
+ if (start_frames < MIMI_VOICE_VAD_START_FRAMES) {
+ continue;
+ }
+ in_speech = true;
+ total_frames = 0;
+ silence_frames = 0;
+ start_frames = 0;
+ }
+
+ if (total_frames < max_frames) {
+ memcpy(&utterance[total_frames * frame_samples], mono_frame, mono16_frame_bytes);
+ total_frames++;
+ }
+
+ if (speech_now) {
+ silence_frames = 0;
+ } else {
+ silence_frames++;
+ }
+
+ bool end_by_silence = (silence_frames >= silence_frames_end);
+ bool end_by_limit = (total_frames >= max_frames);
+
+ if (!end_by_silence && !end_by_limit) {
+ continue;
+ }
+
+ in_speech = false;
+
+ /* Ignore ultra-short bursts */
+ if (total_frames < MIMI_VOICE_VAD_MIN_FRAMES) {
+ total_frames = 0;
+ silence_frames = 0;
+ cooldown_until = xTaskGetTickCount() + pdMS_TO_TICKS(MIMI_VOICE_STT_COOLDOWN_MS);
+ continue;
+ }
+
+ size_t pcm_bytes = total_frames * frame_samples * sizeof(int16_t);
+ char text[512] = {0};
+
+ if (xSemaphoreTake(s_http_lock, pdMS_TO_TICKS(30000)) == pdTRUE) {
+ esp_err_t stt_err = stt_transcribe_pcm(utterance, pcm_bytes, text, sizeof(text));
+ xSemaphoreGive(s_http_lock);
+
+ if (stt_err == ESP_OK && text[0]) {
+ ESP_LOGI(TAG, "Voice STT: %s", text);
+ push_voice_inbound(text);
+ } else {
+ ESP_LOGW(TAG, "STT failed or empty transcript");
+ }
+ }
+
+ total_frames = 0;
+ silence_frames = 0;
+ cooldown_until = xTaskGetTickCount() + pdMS_TO_TICKS(MIMI_VOICE_STT_COOLDOWN_MS);
+ }
+}
+
+/* =========================
+ * Public API
+ * ========================= */
+
+esp_err_t voice_channel_init(void)
+{
+ s_enabled = (MIMI_VOICE_ENABLED_DEFAULT != 0) ||
+ (stt_api_key()[0] && tts_api_key()[0]);
+
+ if (!s_enabled) {
+ ESP_LOGI(TAG, "Voice channel disabled (set STT/TTS API key or enable default)");
+ return ESP_OK;
+ }
+
+ esp_err_t err = i2s_init_xvf3800();
+ if (err != ESP_OK) {
+ s_enabled = false;
+ return ESP_OK;
+ }
+
+ s_http_lock = xSemaphoreCreateMutex();
+ if (!s_http_lock) {
+ ESP_LOGE(TAG, "Voice init failed: cannot allocate mutex");
+ s_enabled = false;
+ return ESP_ERR_NO_MEM;
+ }
+
+ return ESP_OK;
+}
+
+esp_err_t voice_channel_start(void)
+{
+ if (!s_enabled || !s_i2s_ready) {
+ return ESP_OK;
+ }
+
+ if (!s_capture_task) {
+ if (xTaskCreatePinnedToCore(voice_capture_task,
+ "voice_cap",
+ MIMI_VOICE_CAPTURE_STACK,
+ NULL,
+ MIMI_VOICE_TASK_PRIO,
+ &s_capture_task,
+ MIMI_VOICE_CORE) != pdPASS) {
+ return ESP_FAIL;
+ }
+ }
+
+ ESP_LOGI(TAG, "Voice channel started");
+ return ESP_OK;
+}
+
+esp_err_t voice_channel_speak_text(const char *text)
+{
+ if (!s_enabled || !s_i2s_ready || !text || text[0] == '\0') {
+ return ESP_ERR_INVALID_STATE;
+ }
+
+ if (xSemaphoreTake(s_http_lock, pdMS_TO_TICKS(30000)) != pdTRUE) {
+ return ESP_ERR_TIMEOUT;
+ }
+
+ char *tts_text = voice_build_tts_text(text);
+ esp_err_t err = tts_stream_play(tts_text ? tts_text : text);
+ free(tts_text);
+
+ xSemaphoreGive(s_http_lock);
+ return err;
+}
+
+bool voice_channel_is_enabled(void)
+{
+ return s_enabled;
+}
+
+void voice_channel_get_status(voice_channel_status_t *status)
+{
+ if (!status) {
+ return;
+ }
+
+ status->enabled = s_enabled;
+ status->i2s_ready = s_i2s_ready;
+ status->is_playing = s_is_playing;
+ status->stt_configured = (stt_api_url()[0] != '\0' && stt_api_key()[0] != '\0');
+ status->tts_configured = (tts_api_url()[0] != '\0' && tts_api_key()[0] != '\0');
+}
diff --git a/main/voice/voice_channel.h b/main/voice/voice_channel.h
new file mode 100644
index 00000000..ffcc2504
--- /dev/null
+++ b/main/voice/voice_channel.h
@@ -0,0 +1,32 @@
+#pragma once
+
+#include
+#include "esp_err.h"
+
+typedef struct {
+ bool enabled;
+ bool i2s_ready;
+ bool is_playing;
+ bool stt_configured;
+ bool tts_configured;
+} voice_channel_status_t;
+
+/*
+ * Voice channel for ReSpeaker XVF3800 over I2S.
+ *
+ * Inbound path:
+ * Mic PCM -> VAD utterance -> STT -> message_bus inbound (channel=voice)
+ *
+ * Outbound path:
+ * Agent text (channel=voice) -> TTS -> speaker playback (I2S)
+ */
+esp_err_t voice_channel_init(void);
+esp_err_t voice_channel_start(void);
+
+/*
+ * Convert text to speech and enqueue for playback.
+ */
+esp_err_t voice_channel_speak_text(const char *text);
+
+bool voice_channel_is_enabled(void);
+void voice_channel_get_status(voice_channel_status_t *status);
diff --git a/partitions.csv b/partitions.csv
index 24c87784..017cf429 100644
--- a/partitions.csv
+++ b/partitions.csv
@@ -4,5 +4,5 @@ otadata, data, ota, 0xF000, 0x2000
phy_init, data, phy, 0x11000, 0x1000
ota_0, app, ota_0, 0x20000, 0x200000
ota_1, app, ota_1, 0x220000, 0x200000
-spiffs, data, spiffs, 0x420000, 0xBD0000
-coredump, data, coredump,0xFF0000, 0x10000
+spiffs, data, spiffs, 0x420000, 0x3D0000
+coredump, data, coredump,0x7F0000, 0x10000
diff --git a/sdkconfig.defaults.esp32s3 b/sdkconfig.defaults.esp32s3
index 4774cd93..eed91926 100644
--- a/sdkconfig.defaults.esp32s3
+++ b/sdkconfig.defaults.esp32s3
@@ -2,7 +2,7 @@
CONFIG_IDF_TARGET="esp32s3"
# Flash 16MB + QIO
-CONFIG_ESPTOOLPY_FLASHSIZE_16MB=y
+CONFIG_ESPTOOLPY_FLASHSIZE_8MB=y
CONFIG_ESPTOOLPY_FLASHMODE_QIO=y
# CPU 240MHz