diff --git a/README.md b/README.md
index b733fbf8..241477b5 100644
--- a/README.md
+++ b/README.md
@@ -1,318 +1,265 @@
-# MimiClaw: Pocket AI Assistant on a $5 Chip
+# reSpeaker-claw: Voice AI Agent for ReSpeaker XVF3800
 
 <p align="center">
-  <img src="assets/banner.png" alt="MimiClaw" width="500" />
-</p>
-
-<p align="center">
-  <a href="LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
-  <a href="https://deepwiki.com/memovai/mimiclaw"><img src="https://img.shields.io/badge/DeepWiki-mimiclaw-blue.svg" alt="DeepWiki"></a>
-  <a href="https://discord.gg/r8ZxSvB8Yr"><img src="https://img.shields.io/badge/Discord-mimiclaw-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
-  <a href="https://x.com/ssslvky"><img src="https://img.shields.io/badge/X-@ssslvky-black?logo=x" alt="X"></a>
+  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-yellow.svg" alt="License: MIT"></a>
+  <img src="https://img.shields.io/badge/language-C-00599C.svg" alt="Language: C">
+  <img src="https://img.shields.io/badge/framework-ESP--IDF%20v5.5%2B-E7352C.svg" alt="Framework: ESP-IDF v5.5+">
+  <img src="https://img.shields.io/badge/hardware-ReSpeaker%20XVF3800-1F6FEB.svg" alt="Hardware: ReSpeaker XVF3800">
+  <img src="https://img.shields.io/badge/architecture-Voice%20Agent-0A7E3B.svg" alt="Architecture: Voice Agent">
 </p>
 
 <p align="center">
   <strong><a href="README.md">English</a> | <a href="README_CN.md">中文</a> | <a href="README_JA.md">日本語</a></strong>
 </p>
 
-**The world's first AI assistant(OpenClaw) on a $5 chip. No Linux. No Node.js. Just pure C**
-
-MimiClaw turns a tiny ESP32-S3 board into a personal AI assistant. Plug it into USB power, connect to WiFi, and talk to it through Telegram — it handles any task you throw at it and evolves over time with local memory — all on a chip the size of a thumb.
+reSpeaker-claw turns a ReSpeaker XVF3800–based device into a voice-first AI agent. It captures audio over I2S, performs local VAD, sends utterances to STT, and processes them through an embedded agent loop. The system combines real-time speech interaction with local memory, tool calling, scheduling, heartbeat processes, OTA updates, and proxy support, and returns responses via TTS through the speaker.
 
-## Meet MimiClaw
+## Meet reSpeaker-claw
 
 - **Tiny** — No Linux, no Node.js, no bloat — just pure C
-- **Handy** — Message it from Telegram, it handles the rest
 - **Loyal** — Learns from memory, remembers across reboots
-- **Energetic** — USB power, 0.5 W, runs 24/7
-- **Lovable** — One ESP32-S3 board, $5, nothing else
+- **Energetic** — USB power, lower power consumption, runs 24/7
+- **Freedom** — ReSpeaker XVF3800's mic array + your choice of speaker amp/DAC
+- **Handy** — Built-in voice channel, no extra hardware needed beyond the XVF3800 and a speaker path
 
-## How It Works
+## Highlights
 
-![](assets/mimiclaw.png)
+- Voice input: ReSpeaker XVF3800 microphone array over I2S
+- Voice output: TTS audio download, WAV decode, resample, and speaker playback over I2S
+- Multi-channel agent: voice, Telegram, Feishu, WebSocket
+- Local persistence: SPIFFS stores memory, profiles, sessions, cron jobs, and daily notes
+- Compatible LLM backends: official Anthropic/OpenAI APIs or third-party gateways that expose Anthropic-compatible or OpenAI-compatible endpoints
+- Configurable STT/TTS: plug in your own service URL, API key, model, voice, and language
+- Runtime overrides: change WiFi, provider, model, API base, proxy, and tokens from the serial CLI without editing code
 
-You send a message on Telegram. The ESP32-S3 picks it up over WiFi, feeds it into an agent loop — the LLM thinks, calls tools, reads memory — and sends the reply back. Supports both **Anthropic (Claude)** and **OpenAI (GPT)** as providers, switchable at runtime. Everything runs on a single $5 chip with all your data stored locally on flash.
 
 ## Quick Start
 
-### What You Need
+### Requirements
 
-- An **ESP32-S3 dev board** with 16 MB flash and 8 MB PSRAM (e.g. Xiaozhi AI board, ~$10)
-- A **USB Type-C cable**
-- A **Telegram bot token** — talk to [@BotFather](https://t.me/BotFather) on Telegram to create one
-- An **Anthropic API key** — from [console.anthropic.com](https://console.anthropic.com), or an **OpenAI API key** — from [platform.openai.com](https://platform.openai.com)
+- A reSpeaker XVF3800 USB 4 Microphone Array with XIAO ESP32S3 board
+- A speaker / DAC / amp path on I2S output
+- A USB cable for flashing and serial monitoring
+- WiFi access
+- ESP-IDF v5.5+
+- Optional: Telegram bot token if you want Telegram
+- Optional: Feishu app credentials if you want Feishu
+- One LLM API key for an Anthropic-compatible or OpenAI-compatible endpoint
+- One STT service and one TTS service for voice mode
 
-### Install
+### Clone and Build Environment
 
-```bash
-# You need ESP-IDF v5.5+ installed first:
-# https://docs.espressif.com/projects/esp-idf/en/v5.5.2/esp32s3/get-started/
+Refer to the official guide to flash the I2S firmware:
+[SeeedStudio wiki](https://wiki.seeedstudio.com/respeaker_xvf3800_introduction/#flash-firmware)
+
+Then clone this project and set the target:
 
-git clone https://github.com/memovai/mimiclaw.git
-cd mimiclaw
+```bash
+git clone https://github.com/Seeed-Projects/reSpeaker-claw
+cd reSpeaker-claw
 
 idf.py set-target esp32s3
 ```
 
-<details>
-<summary>Ubuntu Install</summary>
+Install ESP-IDF first:  [ESP-IDF Install](https://docs.espressif.com/projects/esp-idf/en/v5.5.3/esp32s3/get-started/)
 
-Recommended baseline:
-
-- Ubuntu 22.04/24.04
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb-1.0-0`, `libffi-dev`, `libssl-dev`
-
-Install and build on Ubuntu:
+Ubuntu helper scripts:
 
 ```bash
-sudo apt-get update
-sudo apt-get install -y git wget flex bison gperf python3 python3-pip python3-venv \
-  cmake ninja-build ccache libffi-dev libssl-dev dfu-util libusb-1.0-0
-
 ./scripts/setup_idf_ubuntu.sh
 ./scripts/build_ubuntu.sh
 ```
 
-</details>
-
-<details>
-<summary>macOS Install</summary>
-
-Recommended baseline:
-
-- macOS 12/13/14
-- Xcode Command Line Tools
-- Homebrew
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb`, `libffi`, `openssl`
-
-Install and build on macOS:
+macOS helper scripts:
 
 ```bash
-xcode-select --install
-/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
-
 ./scripts/setup_idf_macos.sh
 ./scripts/build_macos.sh
 ```
 
-</details>
-
-### Configure
+## Configure
 
-MimiClaw uses a **two-layer config** system: build-time defaults in `mimi_secrets.h`, with runtime overrides via the serial CLI. CLI values are stored in NVS flash and take priority over build-time values.
+Copy the example secrets file:
 
 ```bash
-cp main/mimi_secrets.h.example main/mimi_secrets.h
+cp "main/mimi_secrets.h.example" "main/mimi_secrets.h"
 ```
 
-Edit `main/mimi_secrets.h`:
+Edit `main/mimi_secrets.h` and set the fields you actually use:
 
 ```c
+/* WiFi */
 #define MIMI_SECRET_WIFI_SSID       "YourWiFiName"
 #define MIMI_SECRET_WIFI_PASS       "YourWiFiPassword"
-#define MIMI_SECRET_TG_TOKEN        "123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11"
-#define MIMI_SECRET_API_KEY         "sk-ant-api03-xxxxx"
-#define MIMI_SECRET_MODEL_PROVIDER  "anthropic"     // "anthropic" or "openai"
-#define MIMI_SECRET_SEARCH_KEY      ""              // optional: Brave Search API key
-#define MIMI_SECRET_TAVILY_KEY      ""              // optional: Tavily API key (preferred)
-#define MIMI_SECRET_PROXY_HOST      ""              // optional: e.g. "10.0.0.1"
-#define MIMI_SECRET_PROXY_PORT      ""              // optional: e.g. "7897"
-```
-
-Then build and flash:
-
-```bash
-# Clean build (required after any mimi_secrets.h change)
-idf.py fullclean && idf.py build
-
-# Find your serial port
-ls /dev/cu.usb*          # macOS
-ls /dev/ttyACM*          # Linux
 
-# Flash and monitor (replace PORT with your port)
-# USB adapter: likely /dev/cu.usbmodem11401 (macOS) or /dev/ttyACM0 (Linux)
-idf.py -p PORT flash monitor
+/* Optional text channels */
+#define MIMI_SECRET_TG_TOKEN        ""
+#define MIMI_SECRET_FEISHU_APP_ID   ""
+#define MIMI_SECRET_FEISHU_APP_SECRET ""
+
+/* LLM */
+#define MIMI_SECRET_API_KEY         "your-llm-key"
+#define MIMI_SECRET_MODEL           "your-model"
+#define MIMI_SECRET_MODEL_PROVIDER  "openai"      /* or "anthropic" */
+
+/* Search and proxy */
+#define MIMI_SECRET_TAVILY_KEY      ""
+#define MIMI_SECRET_SEARCH_KEY      ""
+#define MIMI_SECRET_PROXY_HOST      ""
+#define MIMI_SECRET_PROXY_PORT      ""
+#define MIMI_SECRET_PROXY_TYPE      ""            /* "http" or "socks5" */
+
+/* Voice STT / TTS */
+#define MIMI_SECRET_STT_URL         "https://your-stt-endpoint"
+#define MIMI_SECRET_STT_API_KEY     "your-stt-key"
+#define MIMI_SECRET_STT_MODEL       "your-stt-model"
+#define MIMI_SECRET_TTS_URL         "https://your-tts-endpoint"
+#define MIMI_SECRET_TTS_API_KEY     "your-tts-key"
+#define MIMI_SECRET_TTS_MODEL       "your-tts-model"
+#define MIMI_SECRET_TTS_VOICE       ""
+#define MIMI_SECRET_TTS_LANGUAGE    "English"
+
+/* ReSpeaker XVF3800 I2S pin map */
+#define MIMI_VOICE_I2S_PORT         0
+#define MIMI_VOICE_I2S_BCLK         GPIO_NUM_8
+#define MIMI_VOICE_I2S_WS           GPIO_NUM_7
+#define MIMI_VOICE_I2S_DIN          GPIO_NUM_43
+#define MIMI_VOICE_I2S_DOUT         GPIO_NUM_44
 ```
 
-> **Important: Plug into the correct USB port!** Most ESP32-S3 boards have two USB-C ports. You must use the one labeled **USB** (native USB Serial/JTAG), **not** the one labeled **COM** (external UART bridge). Plugging into the wrong port will cause flash/monitor failures.
->
-> <details>
-> <summary>Show reference photo</summary>
->
-> <img src="assets/esp32s3-usb-port.jpg" alt="Plug into the USB port, not COM" width="480" />
->
-> </details>
+Notes:
 
-### CLI Commands (via UART/COM port)
+- `MIMI_SECRET_MODEL_PROVIDER` selects the request protocol, not just the vendor name.
+- Use `openai` for OpenAI-compatible gateways.
+- Use `anthropic` for Anthropic-compatible gateways.
+- Voice mode requires STT and TTS URL/key pairs to be configured.
+- LLM API base can be changed at runtime with `set_api_base`.
 
-Connect via serial to configure or debug. **Config commands** let you change settings without recompiling — just plug in a USB cable anywhere.
+## Adding STT and TTS
 
-**Runtime config** (saved to NVS, overrides build-time defaults):
+This project no longer treats speech as an afterthought. To enable the full ReSpeaker experience:
 
-```
-mimi> wifi_set MySSID MyPassword   # change WiFi network
-mimi> set_tg_token 123456:ABC...   # change Telegram bot token
-mimi> set_api_key sk-ant-api03-... # change API key (Anthropic or OpenAI)
-mimi> set_model_provider openai    # switch provider (anthropic|openai)
-mimi> set_model gpt-4o             # change LLM model
-mimi> set_proxy 127.0.0.1 7897  # set HTTP proxy
-mimi> clear_proxy                  # remove proxy
-mimi> set_search_key BSA...        # set Brave Search API key
-mimi> set_tavily_key tvly-...      # set Tavily API key (preferred)
-mimi> config_show                  # show all config (masked)
-mimi> config_reset                 # clear NVS, revert to build-time defaults
-```
+1. Configure `MIMI_SECRET_STT_URL`, `MIMI_SECRET_STT_API_KEY`, and `MIMI_SECRET_STT_MODEL`.
+2. Configure `MIMI_SECRET_TTS_URL`, `MIMI_SECRET_TTS_API_KEY`, `MIMI_SECRET_TTS_MODEL`, `MIMI_SECRET_TTS_VOICE`, and `MIMI_SECRET_TTS_LANGUAGE`.
+3. Set the XVF3800 input pins and your speaker output pins in the I2S section.
+4. If your DAC or amp sounds noisy, set `MIMI_VOICE_I2S_STD_SLOT_STYLE` to match the hardware timing style.
+5. If your room causes false triggers, tune `MIMI_VOICE_VAD_START_FRAMES`, `MIMI_VOICE_VAD_MIN_FRAMES`, and `MIMI_VOICE_STT_COOLDOWN_MS`.
+6. If your TTS audio is too long, tune `MIMI_VOICE_TTS_MAX_SECONDS`, `MIMI_VOICE_TTS_CHARS_PER_SEC`, and `MIMI_VOICE_TTS_MAX_CHARS`.
 
-**Debug & maintenance:**
+The current firmware already contains the full voice channel:
 
-```
-mimi> wifi_status              # am I connected?
-mimi> memory_read              # see what the bot remembers
-mimi> memory_write "content"   # write to MEMORY.md
-mimi> heap_info                # how much RAM is free?
-mimi> session_list             # list all chat sessions
-mimi> session_clear 12345      # wipe a conversation
-mimi> heartbeat_trigger           # manually trigger a heartbeat check
-mimi> cron_start                  # start cron scheduler now
-mimi> restart                     # reboot
-```
-
-### USB (JTAG) vs UART: Which Port for What
-
-Most ESP32-S3 dev boards expose **two USB-C ports**:
-
-| Port | Use for |
-|------|---------|
-| **USB** (JTAG) | `idf.py flash`, JTAG debugging |
-| **COM** (UART) | **REPL CLI**, serial console |
-
-> **REPL requires the UART (COM) port.** The USB (JTAG) port does not support interactive REPL input.
+- inbound: mic PCM -> VAD -> STT -> message bus
+- outbound: agent text -> TTS -> playback
 
-<details>
-<summary>Port details & recommended workflow</summary>
+## Flash and Monitor
 
-| Port | Label | Protocol |
-|------|-------|----------|
-| **USB** | USB / JTAG | Native USB Serial/JTAG |
-| **COM** | UART / COM | External UART bridge (CP2102/CH340) |
+After changing `main/mimi_secrets.h`, rebuild from a clean state:
 
-The ESP-IDF console/REPL is configured to use UART by default (`CONFIG_ESP_CONSOLE_UART_DEFAULT=y`).
-
-**If you have both ports connected simultaneously:**
-
-- USB (JTAG) handles flash/download and provides secondary serial output
-- UART (COM) provides the primary interactive console for the REPL
-- macOS: both appear as `/dev/cu.usbmodem*` or `/dev/cu.usbserial-*` — run `ls /dev/cu.usb*` to identify
-- Linux: USB (JTAG) → `/dev/ttyACM0`, UART → `/dev/ttyUSB0`
+```bash
+idf.py fullclean
+idf.py build
+```
 
-**Recommended workflow:**
+Find your serial port:
 
 ```bash
-# Flash via USB (JTAG) port
-idf.py -p /dev/cu.usbmodem11401 flash
-
-# Open REPL via UART (COM) port
-idf.py -p /dev/cu.usbserial-110 monitor
-# or use any serial terminal: screen, minicom, PuTTY at 115200 baud
+ls /dev/cu.usb*      # macOS
+ls /dev/ttyACM*      # Linux
 ```
 
-</details>
+Flash and monitor:
 
-## Memory
+```bash
+idf.py -p PORT flash monitor
+```
 
-MimiClaw stores everything as plain text files you can read and edit:
+Replace `PORT` with your actual device path.
 
-| File | What it is |
-|------|------------|
-| `SOUL.md` | The bot's personality — edit this to change how it behaves |
-| `USER.md` | Info about you — name, preferences, language |
-| `MEMORY.md` | Long-term memory — things the bot should always remember |
-| `HEARTBEAT.md` | Task list the bot checks periodically and acts on autonomously |
-| `cron.json` | Scheduled jobs — recurring or one-shot tasks created by the AI |
-| `2026-02-05.md` | Daily notes — what happened today |
-| `tg_12345.jsonl` | Chat history — your conversation with the bot |
+## Serial CLI
 
-## Tools
+The serial CLI is the fastest way to change runtime settings stored in NVS:
 
-MimiClaw supports tool calling for both Anthropic and OpenAI — the LLM can call tools during a conversation and loop until the task is done (ReAct pattern).
+```
+mimi> wifi_set MySSID MyPassword
+mimi> set_tg_token 123456:ABC...
+mimi> set_api_key your-llm-key
+mimi> set_api_base https://your-compatible-endpoint/v1
+mimi> set_model_provider openai
+mimi> set_model gpt-5.2
+mimi> set_proxy 127.0.0.1 7897
+mimi> clear_proxy
+mimi> set_search_key BSA...
+mimi> set_tavily_key tvly-...
+mimi> config_show
+mimi> config_reset
+```
 
-| Tool | Description |
-|------|-------------|
-| `web_search` | Search the web via Tavily (preferred) or Brave for current information |
-| `get_current_time` | Fetch current date/time via HTTP and set the system clock |
-| `cron_add` | Schedule a recurring or one-shot task (the LLM creates cron jobs on its own) |
-| `cron_list` | List all scheduled cron jobs |
-| `cron_remove` | Remove a cron job by ID |
+Maintenance commands:
+
+```text
+mimi> wifi_status
+mimi> memory_read
+mimi> memory_write "remember this"
+mimi> heap_info
+mimi> session_list
+mimi> session_clear 12345
+mimi> heartbeat_trigger
+mimi> cron_start
+mimi> restart
+```
 
-To enable web search, set a [Tavily API key](https://app.tavily.com/home) via `MIMI_SECRET_TAVILY_KEY` (preferred), or a [Brave Search API key](https://brave.com/search/api/) via `MIMI_SECRET_SEARCH_KEY` in `mimi_secrets.h`.
+## Compatible Provider Model
 
-## Cron Tasks
+`reSpeaker-claw` is not limited to the official Anthropic and OpenAI endpoints.
 
-MimiClaw has a built-in cron scheduler that lets the AI schedule its own tasks. The LLM can create recurring jobs ("every N seconds") or one-shot jobs ("at unix timestamp") via the `cron_add` tool. When a job fires, its message is injected into the agent loop — so the AI wakes up, processes the task, and responds.
+It supports:
 
-Jobs are persisted to SPIFFS (`cron.json`) and survive reboots. Example use cases: daily summaries, periodic reminders, scheduled check-ins.
+- Anthropic protocol compatible services, selected with `set_model_provider anthropic`
+- OpenAI protocol compatible services, selected with `set_model_provider openai`
+- Custom API bases through `set_api_base`
 
-## Heartbeat
+This makes it practical to use local gateways, regional cloud vendors, or unified API platforms without changing the agent loop.
 
-The heartbeat service periodically reads `HEARTBEAT.md` from SPIFFS and checks for actionable tasks. If uncompleted items are found (anything that isn't an empty line, a header, or a checked `- [x]` box), it sends a prompt to the agent loop so the AI can act on them autonomously.
+## Memory and Automation
 
-This turns MimiClaw into a proactive assistant — write tasks to `HEARTBEAT.md` and the bot will pick them up on the next heartbeat cycle (default: every 30 minutes).
+The agent persists its state in plain files on SPIFFS:
 
-## Also Included
+| File | Purpose |
+|------|---------|
+| `SOUL.md` | Assistant persona |
+| `USER.md` | User profile |
+| `MEMORY.md` | Long-term memory |
+| `HEARTBEAT.md` | Periodic autonomous task list |
+| `cron.json` | Scheduled jobs |
+| `tg_12345.jsonl` | Session history |
 
-- **WebSocket gateway** on port 18789 — connect from your LAN with any WebSocket client
-- **OTA updates** — flash new firmware over WiFi, no USB needed
-- **Dual-core** — network I/O and AI processing run on separate CPU cores
-- **HTTP proxy** — CONNECT tunnel support for restricted networks
-- **Multi-provider** — supports both Anthropic (Claude) and OpenAI (GPT), switchable at runtime
-- **Cron scheduler** — the AI can schedule its own recurring and one-shot tasks, persisted across reboots
-- **Heartbeat** — periodically checks a task file and prompts the AI to act autonomously
-- **Tool use** — ReAct agent loop with tool calling for both providers
+Built-in automation features:
 
-## For Developers
+- `cron_add`, `cron_list`, `cron_remove`
+- heartbeat-driven proactive task handling
+- tool calling in the ReAct loop
+- local storage that survives reboot
 
-Technical details live in the `docs/` folder:
+## Tooling
 
-- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — system design, module map, task layout, memory budget, protocols, flash partitions
-- **[docs/TODO.md](docs/TODO.md)** — feature gap tracker and roadmap
-- **[docs/tool-setup/](docs/tool-setup/README.md)** — configuration guides for external service integrations (Tavily, etc.)
+Built-in tools include:
 
-## Contributing
+- `web_search`
+- `get_current_time`
+- `cron_add`
+- `cron_list`
+- `cron_remove`
+- SPIFFS file tools used by the agent runtime
 
-Please read **[CONTRIBUTING.md](CONTRIBUTING.md)** before opening issues or pull requests.
+For web search, configure either:
 
-## Contributors
+- `MIMI_SECRET_TAVILY_KEY`
+- `MIMI_SECRET_SEARCH_KEY`
 
-Thanks to everyone who has contributed to MimiClaw.
+## Acknowledgments
 
-<a href="https://github.com/memovai/mimiclaw/graphs/contributors">
-  <img src="https://contrib.rocks/image?repo=memovai/mimiclaw" alt="MimiClaw contributors" />
-</a>
+This project builds on the original [mimiclaw](https://github.com/memovai/mimiclaw). reSpeaker-claw adapts that embedded agent foundation to ReSpeaker XVF3800 voice hardware, extends the STT / TTS pipeline, and continues the multi-channel agent architecture.
 
 ## License
 
 MIT
-
-## Acknowledgments
-
-Inspired by [OpenClaw](https://github.com/openclaw/openclaw) and [Nanobot](https://github.com/HKUDS/nanobot). MimiClaw reimplements the core AI agent architecture for embedded hardware — no Linux, no server, just a $5 chip.
-
-## Star History
-
-[![Star History Chart](https://api.star-history.com/svg?repos=memovai/mimiclaw&type=Date)](https://star-history.com/#memovai/mimiclaw&Date)
diff --git a/README_CN.md b/README_CN.md
index f1cefa51..4a36f65f 100644
--- a/README_CN.md
+++ b/README_CN.md
@@ -1,339 +1,264 @@
-# MimiClaw: $5 芯片上的口袋 AI 助理
+# reSpeaker-claw：面向 ReSpeaker XVF3800 的语音 AI Agent
 
 <p align="center">
-  <img src="assets/banner.png" alt="MimiClaw" width="500" />
-</p>
-
-<p align="center">
-  <a href="LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
-  <a href="https://deepwiki.com/memovai/mimiclaw"><img src="https://img.shields.io/badge/DeepWiki-mimiclaw-blue.svg" alt="DeepWiki"></a>
-  <a href="https://discord.gg/r8ZxSvB8Yr"><img src="https://img.shields.io/badge/Discord-mimiclaw-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
-  <a href="https://x.com/ssslvky"><img src="https://img.shields.io/badge/X-@ssslvky-black?logo=x" alt="X"></a>
+  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-yellow.svg" alt="License: MIT"></a>
+  <img src="https://img.shields.io/badge/language-C-00599C.svg" alt="Language: C">
+  <img src="https://img.shields.io/badge/framework-ESP--IDF%20v5.5%2B-E7352C.svg" alt="Framework: ESP-IDF v5.5+">
+  <img src="https://img.shields.io/badge/hardware-ReSpeaker%20XVF3800-1F6FEB.svg" alt="Hardware: ReSpeaker XVF3800">
+  <img src="https://img.shields.io/badge/architecture-Voice%20Agent-0A7E3B.svg" alt="Architecture: Voice Agent">
 </p>
 
 <p align="center">
   <strong><a href="README.md">English</a> | <a href="README_CN.md">中文</a> | <a href="README_JA.md">日本語</a></strong>
 </p>
 
-**$5 芯片上的 AI 助理（OpenClaw）。没有 Linux，没有 Node.js，纯 C。**
+reSpeaker-claw 将基于 ReSpeaker XVF3800 的设备变成一个以语音为主入口的 AI Agent。它通过 I2S 采集音频，在本地执行 VAD，将话语送入 STT，并通过嵌入式 agent loop 处理。系统把实时语音交互、本地记忆、工具调用、调度、heartbeat、OTA 更新和代理支持整合在一起，最后通过 TTS 从扬声器返回响应。
 
-MimiClaw 把一块小小的 ESP32-S3 开发板变成你的私人 AI 助理。插上 USB 供电，连上 WiFi，通过 Telegram 跟它对话 — 它能处理你丢给它的任何任务，还会随时间积累本地记忆不断进化 — 全部跑在一颗拇指大小的芯片上。
+## 认识 reSpeaker-claw
 
-## 认识 MimiClaw
+- **小巧**：没有 Linux，没有 Node.js，没有臃肿依赖，只有纯 C
+- **忠诚**：从记忆中学习，重启后依然保留上下文
+- **高效**：USB 供电，功耗更低，可 24/7 运行
+- **自由**：ReSpeaker XVF3800 麦克风阵列，配合你自己选择的功放或 DAC
+- **顺手**：内置语音通道，除了 XVF3800 和扬声器链路，不需要额外硬件
 
-- **小巧** — 没有 Linux，没有 Node.js，没有臃肿依赖 — 纯 C
-- **好用** — 在 Telegram 发消息，剩下的它来搞定
-- **忠诚** — 从记忆中学习，跨重启也不会忘
-- **能干** — USB 供电，0.5W，24/7 运行
-- **可爱** — 一块 ESP32-S3 开发板，$5，没了
+## 亮点
 
-## 工作原理
+- 语音输入：ReSpeaker XVF3800 麦克风阵列，通过 I2S 接入
+- 语音输出：TTS 音频下载、WAV 解码、重采样与 I2S 播放
+- 多通道 Agent：语音、Telegram、飞书、WebSocket
+- 本地持久化：SPIFFS 保存记忆、配置、会话、cron 任务和每日笔记
+- 兼容 LLM 后端：支持官方 Anthropic / OpenAI API，也支持兼容 Anthropic 或 OpenAI 协议的第三方网关
+- 可配置 STT / TTS：可接入你自己的服务 URL、API Key、模型、音色和语言
+- 运行时覆盖：可通过串口 CLI 修改 WiFi、provider、model、API base、代理和 token，无需改代码
 
-![](assets/mimiclaw.png)
+## 快速开始
 
-你在 Telegram 发一条消息，ESP32-S3 通过 WiFi 收到后送进 Agent 循环 — LLM 思考、调用工具、读取记忆 — 再把回复发回来。同时支持 **Anthropic (Claude)** 和 **OpenAI (GPT)** 两种提供商，运行时可切换。一切都跑在一颗 $5 的芯片上，所有数据存在本地 Flash。
+### 依赖条件
 
-## 快速开始
+- 一套 reSpeaker XVF3800 USB 4 Microphone Array 搭配 XIAO ESP32S3 开发板
+- 一路 I2S 输出到扬声器 / DAC / 功放
+- 一根用于烧录和串口监控的 USB 线
+- 可用的 WiFi
+- ESP-IDF v5.5+
+- 可选：如果你要使用 Telegram，需要 Telegram Bot Token
+- 可选：如果你要使用飞书，需要飞书应用凭证
+- 一个兼容 Anthropic 或 OpenAI 协议的 LLM API Key
+- 一套用于语音模式的 STT 服务和 TTS 服务
 
-### 你需要
+### 克隆与构建环境
 
-- 一块 **ESP32-S3 开发板**，16MB Flash + 8MB PSRAM（如小智 AI 开发板，~¥30）
-- 一根 **USB Type-C 数据线**
-- 一个 **Telegram Bot Token** — 在 Telegram 找 [@BotFather](https://t.me/BotFather) 创建
-- 一个 **Anthropic API Key** — 从 [console.anthropic.com](https://console.anthropic.com) 获取，或一个 **OpenAI API Key** — 从 [platform.openai.com](https://platform.openai.com) 获取
+先参考官方指南刷入 I2S 固件：
+[SeeedStudio wiki](https://wiki.seeedstudio.com/respeaker_xvf3800_introduction/#flash-firmware)
 
-### 安装
+然后克隆本项目并设置目标：
 
 ```bash
-# 需要先安装 ESP-IDF v5.5+:
-# https://docs.espressif.com/projects/esp-idf/en/v5.5.2/esp32s3/get-started/
-
-git clone https://github.com/memovai/mimiclaw.git
-cd mimiclaw
+git clone https://github.com/Seeed-Projects/reSpeaker-claw
+cd reSpeaker-claw
 
 idf.py set-target esp32s3
 ```
 
-<details>
-<summary>Ubuntu 安装</summary>
-
-建议基线：
-
-- Ubuntu 22.04/24.04
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb-1.0-0`、`libffi-dev`、`libssl-dev`
+先安装 ESP-IDF：[ESP-IDF 安装](https://docs.espressif.com/projects/esp-idf/en/v5.5.3/esp32s3/get-started/)
 
-Ubuntu 安装与构建：
+Ubuntu 辅助脚本：
 
 ```bash
-sudo apt-get update
-sudo apt-get install -y git wget flex bison gperf python3 python3-pip python3-venv \
-  cmake ninja-build ccache libffi-dev libssl-dev dfu-util libusb-1.0-0
-
 ./scripts/setup_idf_ubuntu.sh
 ./scripts/build_ubuntu.sh
 ```
 
-</details>
-
-<details>
-<summary>macOS 安装</summary>
-
-建议基线：
-
-- macOS 12/13/14
-- Xcode Command Line Tools
-- Homebrew
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb`、`libffi`、`openssl`
-
-macOS 安装与构建：
+macOS 辅助脚本：
 
 ```bash
-xcode-select --install
-/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
-
 ./scripts/setup_idf_macos.sh
 ./scripts/build_macos.sh
 ```
 
-</details>
-
-### 配置
+## 配置
 
-MimiClaw 使用**两层配置**：`mimi_secrets.h` 提供编译时默认值，串口 CLI 可在运行时覆盖。CLI 设置的值存在 NVS Flash 中，优先级高于编译时值。
+复制示例 secrets 文件：
 
 ```bash
-cp main/mimi_secrets.h.example main/mimi_secrets.h
+cp "main/mimi_secrets.h.example" "main/mimi_secrets.h"
 ```
 
-编辑 `main/mimi_secrets.h`：
+编辑 `main/mimi_secrets.h`，填写你实际需要的配置项：
 
 ```c
-#define MIMI_SECRET_WIFI_SSID       "你的WiFi名"
-#define MIMI_SECRET_WIFI_PASS       "你的WiFi密码"
-#define MIMI_SECRET_TG_TOKEN        "123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11"
-#define MIMI_SECRET_API_KEY         "sk-ant-api03-xxxxx"
-#define MIMI_SECRET_MODEL_PROVIDER  "anthropic"     // "anthropic" 或 "openai"
-#define MIMI_SECRET_SEARCH_KEY      ""              // 可选：Brave Search API key
-#define MIMI_SECRET_TAVILY_KEY      ""              // 可选：Tavily API key（优先）
-#define MIMI_SECRET_PROXY_HOST      "10.0.0.1"      // 可选：代理地址
-#define MIMI_SECRET_PROXY_PORT      "7897"           // 可选：代理端口
+/* WiFi */
+#define MIMI_SECRET_WIFI_SSID       "YourWiFiName"
+#define MIMI_SECRET_WIFI_PASS       "YourWiFiPassword"
+
+/* Optional text channels */
+#define MIMI_SECRET_TG_TOKEN        ""
+#define MIMI_SECRET_FEISHU_APP_ID   ""
+#define MIMI_SECRET_FEISHU_APP_SECRET ""
+
+/* LLM */
+#define MIMI_SECRET_API_KEY         "your-llm-key"
+#define MIMI_SECRET_MODEL           "your-model"
+#define MIMI_SECRET_MODEL_PROVIDER  "openai"      /* or "anthropic" */
+
+/* Search and proxy */
+#define MIMI_SECRET_TAVILY_KEY      ""
+#define MIMI_SECRET_SEARCH_KEY      ""
+#define MIMI_SECRET_PROXY_HOST      ""
+#define MIMI_SECRET_PROXY_PORT      ""
+#define MIMI_SECRET_PROXY_TYPE      ""            /* "http" or "socks5" */
+
+/* Voice STT / TTS */
+#define MIMI_SECRET_STT_URL         "https://your-stt-endpoint"
+#define MIMI_SECRET_STT_API_KEY     "your-stt-key"
+#define MIMI_SECRET_STT_MODEL       "your-stt-model"
+#define MIMI_SECRET_TTS_URL         "https://your-tts-endpoint"
+#define MIMI_SECRET_TTS_API_KEY     "your-tts-key"
+#define MIMI_SECRET_TTS_MODEL       "your-tts-model"
+#define MIMI_SECRET_TTS_VOICE       ""
+#define MIMI_SECRET_TTS_LANGUAGE    "English"
+
+/* ReSpeaker XVF3800 I2S pin map */
+#define MIMI_VOICE_I2S_PORT         0
+#define MIMI_VOICE_I2S_BCLK         GPIO_NUM_8
+#define MIMI_VOICE_I2S_WS           GPIO_NUM_7
+#define MIMI_VOICE_I2S_DIN          GPIO_NUM_43
+#define MIMI_VOICE_I2S_DOUT         GPIO_NUM_44
 ```
 
-然后编译烧录：
+说明：
 
-```bash
-# 完整编译（修改 mimi_secrets.h 后必须 fullclean）
-idf.py fullclean && idf.py build
-
-# 查找串口
-ls /dev/cu.usb*          # macOS
-ls /dev/ttyACM*          # Linux
-
-# 烧录并监控（将 PORT 替换为你的串口）
-# USB 转接器：大概率是 /dev/cu.usbmodem11401（macOS）或 /dev/ttyACM0（Linux）
-idf.py -p PORT flash monitor
-```
-
-> **注意：请插对 USB 口！** 大多数 ESP32-S3 开发板有两个 Type-C 接口，必须插标有 **USB** 的那个口（原生 USB Serial/JTAG），**不要**插标有 **COM** 的口（外部 UART 桥接）。插错口会导致烧录/监控失败。
->
-> <details>
-> <summary>查看参考图片</summary>
->
-> <img src="assets/esp32s3-usb-port.jpg" alt="请插 USB 口，不要插 COM 口" width="480" />
->
-> </details>
+- `MIMI_SECRET_MODEL_PROVIDER` 选择的是请求协议，而不只是厂商名
+- 兼容 OpenAI 协议的网关使用 `openai`
+- 兼容 Anthropic 协议的网关使用 `anthropic`
+- 语音模式要求 STT 与 TTS 的 URL / Key 成对配置
+- LLM API base 可在运行时通过 `set_api_base` 修改
 
-### 代理配置（国内用户）
+## 添加 STT 和 TTS
 
-在国内需要代理才能访问 Telegram 和 Anthropic API。MimiClaw 内置 HTTP CONNECT 隧道支持。
+这个项目不再把语音当成附属功能。要启用完整的 ReSpeaker 体验：
 
-**前提**：局域网内有一个支持 HTTP CONNECT 的代理（Clash Verge、V2Ray 等），并开启了「允许局域网连接」。
+1. 配置 `MIMI_SECRET_STT_URL`、`MIMI_SECRET_STT_API_KEY` 和 `MIMI_SECRET_STT_MODEL`
+2. 配置 `MIMI_SECRET_TTS_URL`、`MIMI_SECRET_TTS_API_KEY`、`MIMI_SECRET_TTS_MODEL`、`MIMI_SECRET_TTS_VOICE` 和 `MIMI_SECRET_TTS_LANGUAGE`
+3. 在 I2S 配置段中设置 XVF3800 的输入引脚和扬声器输出引脚
+4. 如果 DAC 或功放播放出来像噪音，设置 `MIMI_VOICE_I2S_STD_SLOT_STYLE` 以匹配硬件时序
+5. 如果房间环境导致误触发，调节 `MIMI_VOICE_VAD_START_FRAMES`、`MIMI_VOICE_VAD_MIN_FRAMES` 和 `MIMI_VOICE_STT_COOLDOWN_MS`
+6. 如果 TTS 音频过长，调节 `MIMI_VOICE_TTS_MAX_SECONDS`、`MIMI_VOICE_TTS_CHARS_PER_SEC` 和 `MIMI_VOICE_TTS_MAX_CHARS`
 
-可以在 `mimi_secrets.h` 中编译时设置，也可以通过串口 CLI 随时修改：
+当前固件已经包含完整的语音通道：
 
-```
-mimi> set_proxy 192.168.1.83 7897   # 设置代理
-mimi> clear_proxy                    # 清除代理
-```
+- 输入方向：mic PCM -> VAD -> STT -> message bus
+- 输出方向：agent text -> TTS -> playback
 
-> **提示**：确保 ESP32-S3 和代理机器在同一局域网。Clash Verge 在「设置 → 允许局域网」中开启。
+## 烧录与监控
 
-### CLI 命令（通过 UART/COM 口连接）
+修改 `main/mimi_secrets.h` 后，建议从干净状态重新构建：
 
-通过串口连接即可配置和调试。**配置命令**让你无需重新编译就能修改设置 — 随时随地插上 USB 线就能改。
-
-**运行时配置**（存入 NVS，覆盖编译时默认值）：
-
-```
-mimi> wifi_set MySSID MyPassword   # 换 WiFi
-mimi> set_tg_token 123456:ABC...   # 换 Telegram Bot Token
-mimi> set_api_key sk-ant-api03-... # 换 API Key（Anthropic 或 OpenAI）
-mimi> set_model_provider openai    # 切换提供商（anthropic|openai）
-mimi> set_model gpt-4o             # 换模型
-mimi> set_proxy 192.168.1.83 7897  # 设置代理
-mimi> clear_proxy                  # 清除代理
-mimi> set_search_key BSA...        # 设置 Brave Search API Key
-mimi> set_tavily_key tvly-...      # 设置 Tavily API Key（优先）
-mimi> config_show                  # 查看所有配置（脱敏显示）
-mimi> config_reset                 # 清除 NVS，恢复编译时默认值
+```bash
+idf.py fullclean
+idf.py build
 ```
 
-**调试与运维：**
+查找串口：
 
+```bash
+ls /dev/cu.usb*      # macOS
+ls /dev/ttyACM*      # Linux
 ```
-mimi> wifi_status              # 连上了吗？
-mimi> memory_read              # 看看它记住了什么
-mimi> memory_write "内容"       # 写入 MEMORY.md
-mimi> heap_info                # 还剩多少内存？
-mimi> session_list             # 列出所有会话
-mimi> session_clear 12345      # 删除一个会话
-mimi> heartbeat_trigger           # 手动触发一次心跳检查
-mimi> cron_start                  # 立即启动 cron 调度器
-mimi> restart                     # 重启
-```
-
-### USB (JTAG) 与 UART：哪个口做什么
 
-大多数 ESP32-S3 开发板有 **两个 USB-C 口**：
-
-| 端口 | 用途 |
-|------|------|
-| **USB**（JTAG） | `idf.py flash`、JTAG 调试 |
-| **COM**（UART） | **REPL 命令行**、串口控制台 |
-
-> **REPL 必须连接 UART（COM）口。** USB（JTAG）口不支持交互式 REPL 输入。
-
-<details>
-<summary>端口详情与推荐工作流</summary>
-
-| 端口 | 标注 | 协议 |
-|------|------|------|
-| **USB** | USB / JTAG | 原生 USB Serial/JTAG |
-| **COM** | UART / COM | 外置 UART 桥接芯片（CP2102/CH340） |
-
-ESP-IDF 控制台默认配置为 UART 输出（`CONFIG_ESP_CONSOLE_UART_DEFAULT=y`）。
-
-**同时连接两个口时：**
-
-- USB（JTAG）口负责烧录/下载，并提供辅助串口输出
-- UART（COM）口提供主要的交互式控制台，用于 REPL
-- macOS 下两个口都会显示为 `/dev/cu.usbmodem*` 或 `/dev/cu.usbserial-*`，用 `ls /dev/cu.usb*` 区分
-- Linux 下 USB（JTAG）通常是 `/dev/ttyACM0`，UART 通常是 `/dev/ttyUSB0`
-
-**推荐工作流：**
+烧录并监控：
 
 ```bash
-# 通过 USB（JTAG）口烧录
-idf.py -p /dev/cu.usbmodem11401 flash
-
-# 通过 UART（COM）口打开 REPL
-idf.py -p /dev/cu.usbserial-110 monitor
-# 或使用任意串口工具：screen、minicom、PuTTY，波特率 115200
+idf.py -p PORT flash monitor
 ```
 
-</details>
-
-## 记忆
-
-MimiClaw 把所有数据存为纯文本文件，可以直接读取和编辑：
-
-| 文件 | 说明 |
-|------|------|
-| `SOUL.md` | 机器人的人设 — 编辑它来改变行为方式 |
-| `USER.md` | 关于你的信息 — 姓名、偏好、语言 |
-| `MEMORY.md` | 长期记忆 — 它应该一直记住的事 |
-| `HEARTBEAT.md` | 待办清单 — 机器人定期检查并自主执行 |
-| `cron.json` | 定时任务 — AI 创建的周期性或一次性任务 |
-| `2026-02-05.md` | 每日笔记 — 今天发生了什么 |
-| `tg_12345.jsonl` | 聊天记录 — 你和它的对话 |
-
-## 工具
-
-MimiClaw 同时支持 Anthropic 和 OpenAI 的工具调用 — LLM 在对话中可以调用工具，循环执行直到任务完成（ReAct 模式）。
+将 `PORT` 替换为你的实际设备路径。
+
+## 串口 CLI
+
+串口 CLI 是修改 NVS 运行时配置的最快方式：
+
+```text
+mimi> wifi_set MySSID MyPassword
+mimi> set_tg_token 123456:ABC...
+mimi> set_api_key your-llm-key
+mimi> set_api_base https://your-compatible-endpoint/v1
+mimi> set_model_provider openai
+mimi> set_model gpt-5.2
+mimi> set_proxy 127.0.0.1 7897
+mimi> clear_proxy
+mimi> set_search_key BSA...
+mimi> set_tavily_key tvly-...
+mimi> config_show
+mimi> config_reset
+```
 
-| 工具 | 说明 |
-|------|------|
-| `web_search` | 通过 Tavily（优先）或 Brave 搜索网页，获取实时信息 |
-| `get_current_time` | 通过 HTTP 获取当前日期和时间，并设置系统时钟 |
-| `cron_add` | 创建定时或一次性任务（LLM 自主创建 cron 任务） |
-| `cron_list` | 列出所有已调度的 cron 任务 |
-| `cron_remove` | 按 ID 删除 cron 任务 |
+维护命令：
+
+```text
+mimi> wifi_status
+mimi> memory_read
+mimi> memory_write "remember this"
+mimi> heap_info
+mimi> session_list
+mimi> session_clear 12345
+mimi> heartbeat_trigger
+mimi> cron_start
+mimi> restart
+```
 
-启用网页搜索可在 `mimi_secrets.h` 中设置 [Tavily API key](https://app.tavily.com/home)（优先，`MIMI_SECRET_TAVILY_KEY`），或 [Brave Search API key](https://brave.com/search/api/)（`MIMI_SECRET_SEARCH_KEY`）。
+## 兼容 Provider 模型
 
-## 定时任务（Cron）
+`reSpeaker-claw` 不局限于官方 Anthropic 和 OpenAI 端点。
 
-MimiClaw 内置 cron 调度器，让 AI 可以自主安排任务。LLM 可以通过 `cron_add` 工具创建周期性任务（"每 N 秒"）或一次性任务（"在某个时间戳"）。任务触发时，消息会注入到 Agent 循环 — AI 自动醒来、处理任务并回复。
+它支持：
 
-任务持久化存储在 SPIFFS（`cron.json`），重启后不会丢失。典型用途：每日总结、定时提醒、定期巡检。
+- 兼容 Anthropic 协议的服务，通过 `set_model_provider anthropic` 选择
+- 兼容 OpenAI 协议的服务，通过 `set_model_provider openai` 选择
+- 通过 `set_api_base` 指向任意兼容 API base
 
-## 心跳（Heartbeat）
+这让你可以在不修改 agent loop 的情况下，直接使用本地网关、区域云厂商或统一 API 平台。
 
-心跳服务会定期读取 SPIFFS 上的 `HEARTBEAT.md`，检查是否有待办事项。如果发现未完成的条目（非空行、非标题、非已勾选的 `- [x]`），就会向 Agent 循环发送提示，让 AI 自主处理。
+## 记忆与自动化
 
-这让 MimiClaw 变成一个主动型助理 — 把任务写入 `HEARTBEAT.md`，机器人会在下一次心跳周期自动拾取执行（默认每 30 分钟）。
+Agent 会将状态以纯文本文件形式持久化到 SPIFFS：
 
-## 其他功能
+| 文件 | 用途 |
+|------|------|
+| `SOUL.md` | 助手人格 |
+| `USER.md` | 用户资料 |
+| `MEMORY.md` | 长期记忆 |
+| `HEARTBEAT.md` | 周期性自主任务列表 |
+| `cron.json` | 调度任务 |
+| `tg_12345.jsonl` | 会话历史 |
 
-- **WebSocket 网关** — 端口 18789，局域网内用任意 WebSocket 客户端连接
-- **OTA 更新** — WiFi 远程刷固件，无需 USB
-- **双核** — 网络 I/O 和 AI 处理分别跑在不同 CPU 核心
-- **HTTP 代理** — CONNECT 隧道，适配受限网络
-- **多提供商** — 同时支持 Anthropic (Claude) 和 OpenAI (GPT)，运行时可切换
-- **定时任务** — AI 可自主创建周期性和一次性任务，重启后持久保存
-- **心跳服务** — 定期检查任务文件，驱动 AI 自主执行
-- **工具调用** — ReAct Agent 循环，两种提供商均支持工具调用
+内置自动化能力：
 
-## 开发者
+- `cron_add`、`cron_list`、`cron_remove`
+- heartbeat 驱动的主动任务处理
+- ReAct loop 中的工具调用
+- 重启后仍可保留的本地状态
 
-技术细节在 `docs/` 文件夹：
+## 工具
 
-- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — 系统设计、模块划分、任务布局、内存分配、协议、Flash 分区
-- **[docs/TODO.md](docs/TODO.md)** — 功能差距和路线图
-- **[docs/im-integration/](docs/im-integration/README.md)** — IM 通道集成指南（飞书等）
+内置工具包括：
 
-## 贡献
+- `web_search`
+- `get_current_time`
+- `cron_add`
+- `cron_list`
+- `cron_remove`
+- Agent 运行时使用的 SPIFFS 文件工具
 
-提交 Issue 或 Pull Request 前，请先阅读 **[CONTRIBUTING.md](CONTRIBUTING.md)**。
+如需启用网页搜索，配置以下任一项：
 
-## 贡献者
+- `MIMI_SECRET_TAVILY_KEY`
+- `MIMI_SECRET_SEARCH_KEY`
 
-感谢所有为 MimiClaw 做出贡献的开发者。
+## 致谢
 
-<a href="https://github.com/memovai/mimiclaw/graphs/contributors">
-  <img src="https://contrib.rocks/image?repo=memovai/mimiclaw" alt="MimiClaw contributors" />
-</a>
+本项目基于原始的 [mimiclaw](https://github.com/memovai/mimiclaw)。reSpeaker-claw 将那套嵌入式 agent 基础适配到 ReSpeaker XVF3800 语音硬件之上，扩展了 STT / TTS 流程，并延续了多通道 agent 架构。
 
 ## 许可证
 
 MIT
-
-## 致谢
-
-灵感来自 [OpenClaw](https://github.com/openclaw/openclaw) 和 [Nanobot](https://github.com/HKUDS/nanobot)。MimiClaw 为嵌入式硬件重新实现了核心 AI Agent 架构 — 没有 Linux，没有服务器，只有一颗 $5 的芯片。
-
-## Star History
-
-<a href="https://www.star-history.com/?repos=memovai%2Fmimiclaw&type=date&legend=top-left">
- <picture>
-   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/image?repos=memovai/mimiclaw&type=date&theme=dark&legend=top-left" />
-   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/image?repos=memovai/mimiclaw&type=date&legend=top-left" />
-   <img alt="Star History Chart" src="https://api.star-history.com/image?repos=memovai/mimiclaw&type=date&legend=top-left" />
- </picture>
-</a>
diff --git a/README_JA.md b/README_JA.md
index fe91a8a9..cfd2e2d7 100644
--- a/README_JA.md
+++ b/README_JA.md
@@ -1,324 +1,264 @@
-# MimiClaw: $5チップで動くポケットAIアシスタント
+# reSpeaker-claw: ReSpeaker XVF3800 向け音声 AI Agent
 
 <p align="center">
-  <img src="assets/banner.png" alt="MimiClaw" width="500" />
-</p>
-
-<p align="center">
-  <a href="LICENSE"><img src="https://img.shields.io/badge/License-MIT-yellow.svg" alt="License: MIT"></a>
-  <a href="https://deepwiki.com/memovai/mimiclaw"><img src="https://img.shields.io/badge/DeepWiki-mimiclaw-blue.svg" alt="DeepWiki"></a>
-  <a href="https://discord.gg/r8ZxSvB8Yr"><img src="https://img.shields.io/badge/Discord-mimiclaw-5865F2?logo=discord&logoColor=white" alt="Discord"></a>
-  <a href="https://x.com/ssslvky"><img src="https://img.shields.io/badge/X-@ssslvky-black?logo=x" alt="X"></a>
+  <a href="LICENSE"><img src="https://img.shields.io/badge/license-MIT-yellow.svg" alt="License: MIT"></a>
+  <img src="https://img.shields.io/badge/language-C-00599C.svg" alt="Language: C">
+  <img src="https://img.shields.io/badge/framework-ESP--IDF%20v5.5%2B-E7352C.svg" alt="Framework: ESP-IDF v5.5+">
+  <img src="https://img.shields.io/badge/hardware-ReSpeaker%20XVF3800-1F6FEB.svg" alt="Hardware: ReSpeaker XVF3800">
+  <img src="https://img.shields.io/badge/architecture-Voice%20Agent-0A7E3B.svg" alt="Architecture: Voice Agent">
 </p>
 
 <p align="center">
   <strong><a href="README.md">English</a> | <a href="README_CN.md">中文</a> | <a href="README_JA.md">日本語</a></strong>
 </p>
 
-**$5チップ上の世界初のAIアシスタント（OpenClaw）。Linuxなし、Node.jsなし、純粋なCのみ。**
-
-MimiClawは小さなESP32-S3ボードをパーソナルAIアシスタントに変えます。USB電源に接続し、WiFiにつなげて、Telegramから話しかけるだけ — どんなタスクも処理し、ローカルメモリで時間とともに成長します — すべて親指サイズのチップ上で。
+reSpeaker-claw は、ReSpeaker XVF3800 ベースのデバイスを音声ファーストの AI Agent に変えるプロジェクトです。I2S で音声を取り込み、ローカル VAD を実行し、発話を STT に送って組み込みの agent loop で処理します。システムはリアルタイム音声対話に加えて、ローカルメモリ、ツール呼び出し、スケジューリング、heartbeat、OTA 更新、プロキシ対応を統合し、最終的に TTS でスピーカーから応答を返します。
 
-## MimiClawの特徴
+## reSpeaker-claw とは
 
-- **超小型** — Linux不要、Node.js不要、無駄なし — 純粋なCのみ
-- **便利** — Telegramでメッセージを送るだけ、あとはお任せ
-- **忠実** — メモリから学習し、再起動しても忘れない
-- **省エネ** — USB給電、0.5W、24時間365日稼働
-- **お手頃** — ESP32-S3ボード1枚、$5、それだけ
+- **小さい**: Linux なし、Node.js なし、無駄な依存なし、純粋な C のみ
+- **記憶する**: メモリから学習し、再起動後も文脈を保持
+- **省電力**: USB 給電、より低消費電力で 24/7 稼働可能
+- **自由度が高い**: ReSpeaker XVF3800 のマイクアレイに、好みのアンプや DAC を組み合わせ可能
+- **扱いやすい**: 音声チャネルを内蔵し、XVF3800 とスピーカー経路以外の追加ハードウェアをほぼ必要としない
 
-## 仕組み
+## 特長
 
-![](assets/mimiclaw.png)
-
-Telegramでメッセージを送ると、ESP32-S3がWiFi経由で受信し、エージェントループに送ります — LLMが思考し、ツールを呼び出し、メモリを読み取り — 返答を送り返します。**Anthropic (Claude)** と **OpenAI (GPT)** の両方をサポートし、実行時に切り替え可能です。すべてが$5のチップ上で動作し、データはすべてローカルのFlashに保存されます。
+- 音声入力: ReSpeaker XVF3800 マイクアレイを I2S で接続
+- 音声出力: TTS 音声のダウンロード、WAV デコード、リサンプル、I2S 再生
+- マルチチャネル Agent: 音声、Telegram、Feishu、WebSocket
+- ローカル永続化: SPIFFS にメモリ、設定、セッション、cron ジョブ、日次メモを保存
+- 互換 LLM バックエンド: 公式 Anthropic / OpenAI API に加え、Anthropic 互換または OpenAI 互換エンドポイントも利用可能
+- STT / TTS を柔軟に設定可能: URL、API Key、モデル、音色、言語を自由に差し替え可能
+- 実行時オーバーライド: WiFi、provider、model、API base、proxy、token をシリアル CLI から変更可能
 
 ## クイックスタート
 
 ### 必要なもの
 
-- **ESP32-S3開発ボード**（16MB Flash + 8MB PSRAM搭載、例：小智AIボード、約$10）
-- **USB Type-Cケーブル**
-- **Telegram Botトークン** — Telegramで[@BotFather](https://t.me/BotFather)に話しかけて作成
-- **Anthropic APIキー** — [console.anthropic.com](https://console.anthropic.com)から取得、または **OpenAI APIキー** — [platform.openai.com](https://platform.openai.com)から取得
+- reSpeaker XVF3800 USB 4 Microphone Array と XIAO ESP32S3 ボード
+- I2S 出力で接続するスピーカー / DAC / アンプ経路
+- 書き込みとシリアルモニタ用の USB ケーブル
+- WiFi 接続
+- ESP-IDF v5.5+
+- 任意: Telegram を使う場合は Telegram Bot Token
+- 任意: Feishu を使う場合は Feishu アプリ認証情報
+- Anthropic 互換または OpenAI 互換エンドポイント向けの LLM API Key
+- 音声モード用の STT サービスと TTS サービス
 
-### インストール
+### クローンとビルド環境
 
-```bash
-# まずESP-IDF v5.5+をインストールしてください:
-# https://docs.espressif.com/projects/esp-idf/en/v5.5.2/esp32s3/get-started/
+まず公式ガイドを参照して I2S ファームウェアを書き込んでください:
+[SeeedStudio wiki](https://wiki.seeedstudio.com/respeaker_xvf3800_introduction/#flash-firmware)
+
+その後、このプロジェクトをクローンしてターゲットを設定します:
 
-git clone https://github.com/memovai/mimiclaw.git
-cd mimiclaw
+```bash
+git clone https://github.com/Seeed-Projects/reSpeaker-claw
+cd reSpeaker-claw
 
 idf.py set-target esp32s3
 ```
 
-<details>
-<summary>Ubuntu インストール</summary>
-
-推奨ベースライン:
+ESP-IDF は先にインストールしてください: [ESP-IDF Install](https://docs.espressif.com/projects/esp-idf/en/v5.5.3/esp32s3/get-started/)
 
-- Ubuntu 22.04/24.04
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb-1.0-0`, `libffi-dev`, `libssl-dev`
-
-Ubuntu でのインストールとビルド:
+Ubuntu 用ヘルパースクリプト:
 
 ```bash
-sudo apt-get update
-sudo apt-get install -y git wget flex bison gperf python3 python3-pip python3-venv \
-  cmake ninja-build ccache libffi-dev libssl-dev dfu-util libusb-1.0-0
-
 ./scripts/setup_idf_ubuntu.sh
 ./scripts/build_ubuntu.sh
 ```
 
-</details>
-
-<details>
-<summary>macOS インストール</summary>
-
-推奨ベースライン:
-
-- macOS 12/13/14
-- Xcode Command Line Tools
-- Homebrew
-- Python >= 3.10
-- CMake >= 3.16
-- Ninja >= 1.10
-- Git >= 2.34
-- flex >= 2.6
-- bison >= 3.8
-- gperf >= 3.1
-- dfu-util >= 0.11
-- `libusb`, `libffi`, `openssl`
-
-macOS でのインストールとビルド:
+macOS 用ヘルパースクリプト:
 
 ```bash
-xcode-select --install
-/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
-
 ./scripts/setup_idf_macos.sh
 ./scripts/build_macos.sh
 ```
 
-</details>
+## 設定
 
-### 設定
-
-MimiClawは**2層設定**を採用しています：`mimi_secrets.h`でビルド時のデフォルト値を設定し、シリアルCLIで実行時にオーバーライドできます。CLI設定値はNVS Flashに保存され、ビルド時の値より優先されます。
+まず secrets のサンプルファイルをコピーします:
 
 ```bash
-cp main/mimi_secrets.h.example main/mimi_secrets.h
+cp "main/mimi_secrets.h.example" "main/mimi_secrets.h"
 ```
 
-`main/mimi_secrets.h`を編集：
+`main/mimi_secrets.h` を編集し、実際に使う項目を設定します:
 
 ```c
-#define MIMI_SECRET_WIFI_SSID       "WiFi名"
-#define MIMI_SECRET_WIFI_PASS       "WiFiパスワード"
-#define MIMI_SECRET_TG_TOKEN        "123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11"
-#define MIMI_SECRET_API_KEY         "sk-ant-api03-xxxxx"
-#define MIMI_SECRET_MODEL_PROVIDER  "anthropic"     // "anthropic" または "openai"
-#define MIMI_SECRET_SEARCH_KEY      ""              // 任意：Brave Search APIキー
-#define MIMI_SECRET_TAVILY_KEY      ""              // 任意：Tavily APIキー（優先）
-#define MIMI_SECRET_PROXY_HOST      ""              // 任意：例 "10.0.0.1"
-#define MIMI_SECRET_PROXY_PORT      ""              // 任意：例 "7897"
+/* WiFi */
+#define MIMI_SECRET_WIFI_SSID       "YourWiFiName"
+#define MIMI_SECRET_WIFI_PASS       "YourWiFiPassword"
+
+/* Optional text channels */
+#define MIMI_SECRET_TG_TOKEN        ""
+#define MIMI_SECRET_FEISHU_APP_ID   ""
+#define MIMI_SECRET_FEISHU_APP_SECRET ""
+
+/* LLM */
+#define MIMI_SECRET_API_KEY         "your-llm-key"
+#define MIMI_SECRET_MODEL           "your-model"
+#define MIMI_SECRET_MODEL_PROVIDER  "openai"      /* or "anthropic" */
+
+/* Search and proxy */
+#define MIMI_SECRET_TAVILY_KEY      ""
+#define MIMI_SECRET_SEARCH_KEY      ""
+#define MIMI_SECRET_PROXY_HOST      ""
+#define MIMI_SECRET_PROXY_PORT      ""
+#define MIMI_SECRET_PROXY_TYPE      ""            /* "http" or "socks5" */
+
+/* Voice STT / TTS */
+#define MIMI_SECRET_STT_URL         "https://your-stt-endpoint"
+#define MIMI_SECRET_STT_API_KEY     "your-stt-key"
+#define MIMI_SECRET_STT_MODEL       "your-stt-model"
+#define MIMI_SECRET_TTS_URL         "https://your-tts-endpoint"
+#define MIMI_SECRET_TTS_API_KEY     "your-tts-key"
+#define MIMI_SECRET_TTS_MODEL       "your-tts-model"
+#define MIMI_SECRET_TTS_VOICE       ""
+#define MIMI_SECRET_TTS_LANGUAGE    "English"
+
+/* ReSpeaker XVF3800 I2S pin map */
+#define MIMI_VOICE_I2S_PORT         0
+#define MIMI_VOICE_I2S_BCLK         GPIO_NUM_8
+#define MIMI_VOICE_I2S_WS           GPIO_NUM_7
+#define MIMI_VOICE_I2S_DIN          GPIO_NUM_43
+#define MIMI_VOICE_I2S_DOUT         GPIO_NUM_44
 ```
 
-ビルドとフラッシュ：
+補足:
 
-```bash
-# フルビルド（mimi_secrets.h変更後はfullclean必須）
-idf.py fullclean && idf.py build
+- `MIMI_SECRET_MODEL_PROVIDER` はベンダ名ではなく、リクエストプロトコルを選択します
+- OpenAI 互換ゲートウェイには `openai` を使用します
+- Anthropic 互換ゲートウェイには `anthropic` を使用します
+- 音声モードでは STT と TTS の URL / Key を両方設定する必要があります
+- LLM API base は実行時に `set_api_base` で変更できます
 
-# シリアルポートを確認
-ls /dev/cu.usb*          # macOS
-ls /dev/ttyACM*          # Linux
+## STT と TTS の追加
 
-# フラッシュとモニター（PORTをあなたのポートに置き換え）
-# USBアダプタ：おそらく /dev/cu.usbmodem11401（macOS）または /dev/ttyACM0（Linux）
-idf.py -p PORT flash monitor
-```
+このプロジェクトでは、音声を後付け機能として扱っていません。完全な ReSpeaker 体験を有効にするには:
 
-> **重要：正しいUSBポートに接続してください！** ほとんどのESP32-S3ボードには2つのUSB-Cポートがあります。**USB**（ネイティブUSB Serial/JTAG）と書かれたポートを使用してください。**COM**（外部UARTブリッジ）と書かれたポートは使わないでください。間違ったポートに接続するとフラッシュ/モニターが失敗します。
->
-> <details>
-> <summary>参考画像を表示</summary>
->
-> <img src="assets/esp32s3-usb-port.jpg" alt="USBポートに接続、COMポートではありません" width="480" />
->
-> </details>
+1. `MIMI_SECRET_STT_URL`、`MIMI_SECRET_STT_API_KEY`、`MIMI_SECRET_STT_MODEL` を設定します
+2. `MIMI_SECRET_TTS_URL`、`MIMI_SECRET_TTS_API_KEY`、`MIMI_SECRET_TTS_MODEL`、`MIMI_SECRET_TTS_VOICE`、`MIMI_SECRET_TTS_LANGUAGE` を設定します
+3. I2S セクションで XVF3800 の入力ピンとスピーカー側の出力ピンを設定します
+4. DAC やアンプの音がノイズになる場合は、`MIMI_VOICE_I2S_STD_SLOT_STYLE` をハードウェアのタイミングに合わせて設定します
+5. 室内環境で誤検知が多い場合は、`MIMI_VOICE_VAD_START_FRAMES`、`MIMI_VOICE_VAD_MIN_FRAMES`、`MIMI_VOICE_STT_COOLDOWN_MS` を調整します
+6. TTS 音声が長すぎる場合は、`MIMI_VOICE_TTS_MAX_SECONDS`、`MIMI_VOICE_TTS_CHARS_PER_SEC`、`MIMI_VOICE_TTS_MAX_CHARS` を調整します
 
-### CLIコマンド（UART/COMポート経由）
+現在のファームウェアには、すでに完全な音声チャネルが含まれています:
 
-シリアル接続で設定やデバッグができます。**設定コマンド**により再コンパイル不要で設定変更可能 — USBケーブルを挿すだけ。
+- 入力方向: mic PCM -> VAD -> STT -> message bus
+- 出力方向: agent text -> TTS -> playback
 
-**実行時設定**（NVSに保存、ビルド時のデフォルト値をオーバーライド）：
+## 書き込みとモニタ
 
-```
-mimi> wifi_set MySSID MyPassword   # WiFiネットワークを変更
-mimi> set_tg_token 123456:ABC...   # Telegram Botトークンを変更
-mimi> set_api_key sk-ant-api03-... # APIキーを変更（AnthropicまたはOpenAI）
-mimi> set_model_provider openai    # プロバイダーを切替（anthropic|openai）
-mimi> set_model gpt-4o             # LLMモデルを変更
-mimi> set_proxy 127.0.0.1 7897    # HTTPプロキシを設定
-mimi> clear_proxy                  # プロキシを削除
-mimi> set_search_key BSA...        # Brave Search APIキーを設定
-mimi> set_tavily_key tvly-...      # Tavily APIキーを設定（優先）
-mimi> config_show                  # 全設定を表示（マスク付き）
-mimi> config_reset                 # NVSをクリア、ビルド時デフォルトに戻す
-```
-
-**デバッグ・メンテナンス：**
+`main/mimi_secrets.h` を変更した後は、クリーンな状態から再ビルドしてください:
 
+```bash
+idf.py fullclean
+idf.py build
 ```
-mimi> wifi_status              # 接続されていますか？
-mimi> memory_read              # ボットが何を覚えているか確認
-mimi> memory_write "内容"       # MEMORY.mdに書き込み
-mimi> heap_info                # 空きRAMはどれくらい？
-mimi> session_list             # 全チャットセッションを一覧
-mimi> session_clear 12345      # 会話を削除
-mimi> heartbeat_trigger           # ハートビートチェックを手動トリガー
-mimi> cron_start                  # cronスケジューラを今すぐ開始
-mimi> restart                     # 再起動
-```
-
-### USB（JTAG）vs UART：どのポートで何をするか
-
-ほとんどの ESP32-S3 開発ボードには **2つの USB-C ポート**があります：
 
-| ポート | 用途 |
-|--------|------|
-| **USB**（JTAG） | `idf.py flash`、JTAGデバッグ |
-| **COM**（UART） | **REPL CLI**、シリアルコンソール |
-
-> **REPLにはUART（COM）ポートが必要です。** USB（JTAG）ポートは対話的なREPL入力をサポートしません。
-
-<details>
-<summary>ポート詳細と推奨ワークフロー</summary>
-
-| ポート | ラベル | プロトコル |
-|--------|--------|------------|
-| **USB** | USB / JTAG | ネイティブ USB Serial/JTAG |
-| **COM** | UART / COM | 外部 UART ブリッジ（CP2102/CH340） |
-
-ESP-IDFコンソールはデフォルトでUART出力に設定されています（`CONFIG_ESP_CONSOLE_UART_DEFAULT=y`）。
-
-**両方のポートを同時に接続している場合：**
-
-- USB（JTAG）ポートはフラッシュ/ダウンロードを処理し、補助シリアル出力を提供
-- UART（COM）ポートはREPL用のメインインタラクティブコンソールを提供
-- macOS では両ポートとも `/dev/cu.usbmodem*` または `/dev/cu.usbserial-*` として表示 — `ls /dev/cu.usb*` で確認
-- Linux では USB（JTAG）は通常 `/dev/ttyACM0`、UART は通常 `/dev/ttyUSB0`
-
-**推奨ワークフロー：**
+シリアルポートを確認します:
 
 ```bash
-# USB（JTAG）ポートでフラッシュ
-idf.py -p /dev/cu.usbmodem11401 flash
-
-# UART（COM）ポートでREPLを開く
-idf.py -p /dev/cu.usbserial-110 monitor
-# または任意のシリアルターミナル：screen、minicom、PuTTY（ボーレート 115200）
+ls /dev/cu.usb*      # macOS
+ls /dev/ttyACM*      # Linux
 ```
 
-</details>
-
-## メモリ
-
-MimiClawはすべてのデータをプレーンテキストファイルとして保存します。直接読み取り・編集可能です：
+書き込みとモニタ:
 
-| ファイル | 説明 |
-|----------|------|
-| `SOUL.md` | ボットの性格 — 編集して振る舞いを変更 |
-| `USER.md` | あなたの情報 — 名前、好み、言語 |
-| `MEMORY.md` | 長期記憶 — ボットが常に覚えておくべきこと |
-| `HEARTBEAT.md` | タスクリスト — ボットが定期的にチェックして自律的に実行 |
-| `cron.json` | スケジュールジョブ — AIが作成した定期・単発タスク |
-| `2026-02-05.md` | 日次メモ — 今日あったこと |
-| `tg_12345.jsonl` | チャット履歴 — ボットとの会話 |
-
-## ツール
+```bash
+idf.py -p PORT flash monitor
+```
 
-MimiClawはAnthropicとOpenAI両方のツール呼び出しをサポート — LLMは会話中にツールを呼び出し、タスクが完了するまでループします（ReActパターン）。
+`PORT` は実際のデバイスパスに置き換えてください。
+
+## シリアル CLI
+
+シリアル CLI は、NVS に保存される実行時設定を最も素早く変更する方法です:
+
+```text
+mimi> wifi_set MySSID MyPassword
+mimi> set_tg_token 123456:ABC...
+mimi> set_api_key your-llm-key
+mimi> set_api_base https://your-compatible-endpoint/v1
+mimi> set_model_provider openai
+mimi> set_model gpt-5.2
+mimi> set_proxy 127.0.0.1 7897
+mimi> clear_proxy
+mimi> set_search_key BSA...
+mimi> set_tavily_key tvly-...
+mimi> config_show
+mimi> config_reset
+```
 
-| ツール | 説明 |
-|--------|------|
-| `web_search` | Tavily（優先）またはBraveでウェブ検索し、最新情報を取得 |
-| `get_current_time` | HTTP経由で現在の日時を取得し、システムクロックを設定 |
-| `cron_add` | 定期または単発タスクをスケジュール（LLMが自律的にcronジョブを作成） |
-| `cron_list` | スケジュール済みのcronジョブを一覧表示 |
-| `cron_remove` | IDでcronジョブを削除 |
+メンテナンス用コマンド:
+
+```text
+mimi> wifi_status
+mimi> memory_read
+mimi> memory_write "remember this"
+mimi> heap_info
+mimi> session_list
+mimi> session_clear 12345
+mimi> heartbeat_trigger
+mimi> cron_start
+mimi> restart
+```
 
-ウェブ検索を有効にするには、`mimi_secrets.h`で[Tavily APIキー](https://app.tavily.com/home)（優先、`MIMI_SECRET_TAVILY_KEY`）または[Brave Search APIキー](https://brave.com/search/api/)（`MIMI_SECRET_SEARCH_KEY`）を設定してください。
+## 互換 Provider モデル
 
-## Cronタスク
+`reSpeaker-claw` は公式の Anthropic と OpenAI のエンドポイントだけに限定されません。
 
-MimiClawにはcronスケジューラが内蔵されており、AIが自律的にタスクをスケジュールできます。LLMは`cron_add`ツールで定期ジョブ（「N秒ごと」）や単発ジョブ（「UNIXタイムスタンプで指定」）を作成できます。ジョブが発火すると、メッセージがエージェントループに注入され、AIが起動してタスクを処理・応答します。
+対応内容:
 
-ジョブはSPIFFS（`cron.json`）に永続化され、再起動後も保持されます。活用例：日次サマリー、定期リマインダー、スケジュールチェック。
+- `set_model_provider anthropic` で選択する Anthropic 互換サービス
+- `set_model_provider openai` で選択する OpenAI 互換サービス
+- `set_api_base` で切り替える任意の API base
 
-## ハートビート
+これにより、agent loop を変更せずに、ローカルゲートウェイ、地域クラウド、統合 API プラットフォームを利用できます。
 
-ハートビートサービスはSPIFFS上の`HEARTBEAT.md`を定期的に読み取り、アクション可能なタスクがあるかチェックします。未完了の項目（空行、見出し、チェック済み`- [x]`以外）が見つかると、エージェントループにプロンプトを送信し、AIが自律的に処理します。
+## メモリと自動化
 
-これによりMimiClawはプロアクティブなアシスタントになります — `HEARTBEAT.md`にタスクを書き込めば、次のハートビートサイクルで自動的に拾い上げて実行します（デフォルト：30分ごと）。
+Agent は SPIFFS 上に状態をプレーンテキストファイルとして保存します:
 
-## その他の機能
+| ファイル | 用途 |
+|----------|------|
+| `SOUL.md` | アシスタント人格 |
+| `USER.md` | ユーザープロファイル |
+| `MEMORY.md` | 長期記憶 |
+| `HEARTBEAT.md` | 定期実行する自律タスクリスト |
+| `cron.json` | スケジュールジョブ |
+| `tg_12345.jsonl` | セッション履歴 |
 
-- **WebSocketゲートウェイ** — ポート18789、LAN内から任意のWebSocketクライアントで接続
-- **OTAアップデート** — WiFi経由でファームウェア更新、USB不要
-- **デュアルコア** — ネットワークI/OとAI処理が別々のCPUコアで動作
-- **HTTPプロキシ** — CONNECTトンネル対応、制限付きネットワークに対応
-- **マルチプロバイダー** — Anthropic (Claude) と OpenAI (GPT) の両方をサポート、実行時に切り替え可能
-- **Cronスケジューラ** — AIが定期・単発タスクを自律的にスケジュール、再起動後も永続化
-- **ハートビート** — タスクファイルを定期チェックし、AIを自律的に駆動
-- **ツール呼び出し** — ReActエージェントループ、両プロバイダーでツール呼び出し対応
+組み込みの自動化機能:
 
-## 開発者向け
+- `cron_add`、`cron_list`、`cron_remove`
+- heartbeat 駆動の能動的タスク処理
+- ReAct loop におけるツール呼び出し
+- 再起動後も保持されるローカル状態
 
-技術的な詳細は`docs/`フォルダにあります：
+## ツール
 
-- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — システム設計、モジュール構成、タスクレイアウト、メモリバジェット、プロトコル、Flashパーティション
-- **[docs/TODO.md](docs/TODO.md)** — 機能ギャップとロードマップ
-- **[docs/im-integration/](docs/im-integration/README.md)** — IMチャネル統合ガイド（Feishuなど）
+組み込みツール:
 
-## 貢献
+- `web_search`
+- `get_current_time`
+- `cron_add`
+- `cron_list`
+- `cron_remove`
+- Agent ランタイムが使う SPIFFS ファイル操作ツール
 
-Issue や Pull Request を作成する前に、**[CONTRIBUTING.md](CONTRIBUTING.md)** をご確認ください。
+Web 検索を有効にするには、次のいずれかを設定します:
 
-## コントリビューター
+- `MIMI_SECRET_TAVILY_KEY`
+- `MIMI_SECRET_SEARCH_KEY`
 
-MimiClaw に貢献してくれた皆さんに感謝します。
+## 謝辞
 
-<a href="https://github.com/memovai/mimiclaw/graphs/contributors">
-  <img src="https://contrib.rocks/image?repo=memovai/mimiclaw" alt="MimiClaw contributors" />
-</a>
+本プロジェクトは元の [mimiclaw](https://github.com/memovai/mimiclaw) を基盤としています。reSpeaker-claw は、その組み込み agent 基盤を ReSpeaker XVF3800 の音声ハードウェア向けに適応し、STT / TTS パイプラインを拡張しつつ、マルチチャネル agent アーキテクチャを継承しています。
 
 ## ライセンス
 
 MIT
-
-## 謝辞
-
-[OpenClaw](https://github.com/openclaw/openclaw)と[Nanobot](https://github.com/HKUDS/nanobot)にインスパイアされました。MimiClawはコアAIエージェントアーキテクチャを組み込みハードウェア向けに再実装しました — Linuxなし、サーバーなし、$5のチップだけ。
-
-## Star History
-
-<a href="https://www.star-history.com/?repos=memovai%2Fmimiclaw&type=date&legend=top-left">
- <picture>
-   <source media="(prefers-color-scheme: dark)" srcset="https://api.star-history.com/image?repos=memovai/mimiclaw&type=date&theme=dark&legend=top-left" />
-   <source media="(prefers-color-scheme: light)" srcset="https://api.star-history.com/image?repos=memovai/mimiclaw&type=date&legend=top-left" />
-   <img alt="Star History Chart" src="https://api.star-history.com/image?repos=memovai/mimiclaw&type=date&legend=top-left" />
- </picture>
-</a>
diff --git a/assets/banner.png b/assets/banner.png
deleted file mode 100644
index c3cc4255..00000000
Binary files a/assets/banner.png and /dev/null differ
diff --git a/assets/esp32s3-usb-port.jpg b/assets/esp32s3-usb-port.jpg
deleted file mode 100644
index 706f6107..00000000
Binary files a/assets/esp32s3-usb-port.jpg and /dev/null differ
diff --git a/assets/mimiclaw.png b/assets/mimiclaw.png
deleted file mode 100644
index e22246e7..00000000
Binary files a/assets/mimiclaw.png and /dev/null differ
diff --git a/main/CMakeLists.txt b/main/CMakeLists.txt
index 5f3fe1ea..47afb9ec 100644
--- a/main/CMakeLists.txt
+++ b/main/CMakeLists.txt
@@ -21,10 +21,11 @@ idf_component_register(
         "tools/tool_get_time.c"
         "tools/tool_files.c"
         "skills/skill_loader.c"
+        "voice/voice_channel.c"
     INCLUDE_DIRS
         "."
     REQUIRES
         nvs_flash esp_wifi esp_netif esp_http_client esp_http_server
         esp_https_ota esp_event json spiffs console vfs app_update esp-tls
-        esp_timer esp_websocket_client
+        esp_timer esp_websocket_client driver
 )
diff --git a/main/agent/agent_loop.c b/main/agent/agent_loop.c
index 7e5eae64..2a513078 100644
--- a/main/agent/agent_loop.c
+++ b/main/agent/agent_loop.c
@@ -86,6 +86,30 @@ static void append_turn_context_prompt(char *prompt, size_t size, const mimi_msg
     if (n < 0 || (size_t)n >= (size - off)) {
         prompt[size - 1] = '\0';
     }
+
+    if (msg->channel[0] && strcmp(msg->channel, MIMI_CHAN_VOICE) == 0) {
+        off = strnlen(prompt, size - 1);
+        if (off >= size - 1) {
+            return;
+        }
+
+        n = snprintf(
+            prompt + off, size - off,
+            "\n## Voice Output Constraints\n"
+            "This reply will be converted to speech (TTS) and played on a small speaker.\n"
+            "- Use English, natural spoken style.\n"
+            "- Keep it short: keep playback within ~%d seconds.\n"
+            "- Structure: at most 2 sentences + 1 short follow-up question.\n"
+            "- Length: <= %d characters total.\n"
+            "- No markdown, no lists, no code blocks, no URLs.\n"
+            "- Avoid long explanations; if the answer is long, give a 1–2 sentence summary and ask if the user wants more.\n",
+            (int)MIMI_VOICE_TTS_MAX_SECONDS,
+            (int)MIMI_VOICE_LLM_MAX_CHARS);
+
+        if (n < 0 || (size_t)n >= (size - off)) {
+            prompt[size - 1] = '\0';
+        }
+    }
 }
 
 static char *patch_tool_input_with_context(const llm_tool_call_t *call, const mimi_msg_t *msg)
@@ -218,7 +242,7 @@ static void agent_loop_task(void *arg)
         while (iteration < MIMI_AGENT_MAX_TOOL_ITER) {
             /* Send "working" indicator before each API call */
 #if MIMI_AGENT_SEND_WORKING_STATUS
-            if (!sent_working_status && strcmp(msg.channel, MIMI_CHAN_SYSTEM) != 0) {
+            if (!sent_working_status && strcmp(msg.channel, MIMI_CHAN_SYSTEM) != 0 && strcmp(msg.channel, MIMI_CHAN_VOICE) != 0) {
                 mimi_msg_t status = {0};
                 strncpy(status.channel, msg.channel, sizeof(status.channel) - 1);
                 strncpy(status.chat_id, msg.chat_id, sizeof(status.chat_id) - 1);
diff --git a/main/bus/message_bus.h b/main/bus/message_bus.h
index 1fc2d31d..b0fe8edb 100644
--- a/main/bus/message_bus.h
+++ b/main/bus/message_bus.h
@@ -10,6 +10,7 @@
 #define MIMI_CHAN_WEBSOCKET  "websocket"
 #define MIMI_CHAN_CLI        "cli"
 #define MIMI_CHAN_SYSTEM     "system"
+#define MIMI_CHAN_VOICE      "voice"
 
 /* Message types on the bus */
 typedef struct {
diff --git a/main/cli/serial_cli.c b/main/cli/serial_cli.c
index 4968ff7d..904d90d4 100644
--- a/main/cli/serial_cli.c
+++ b/main/cli/serial_cli.c
@@ -123,6 +123,12 @@ static struct {
     struct arg_end *end;
 } api_key_args;
 
+/* --- set_api_base command --- */
+static struct {
+    struct arg_str *base;
+    struct arg_end *end;
+} api_base_args;
+
 static int cmd_set_api_key(int argc, char **argv)
 {
     int nerrors = arg_parse(argc, argv, (void **)&api_key_args);
@@ -135,6 +141,18 @@ static int cmd_set_api_key(int argc, char **argv)
     return 0;
 }
 
+static int cmd_set_api_base(int argc, char **argv)
+{
+    int nerrors = arg_parse(argc, argv, (void **)&api_base_args);
+    if (nerrors != 0) {
+        arg_print_errors(stderr, api_base_args.end, argv[0]);
+        return 1;
+    }
+    llm_set_api_base(api_base_args.base->sval[0]);
+    printf("API base set.\n");
+    return 0;
+}
+
 /* --- set_model command --- */
 static struct {
     struct arg_str *model;
@@ -535,6 +553,7 @@ static int cmd_config_show(int argc, char **argv)
     print_config("WiFi Pass",  MIMI_NVS_WIFI,   MIMI_NVS_KEY_PASS,     MIMI_SECRET_WIFI_PASS,  true);
     print_config("TG Token",   MIMI_NVS_TG,     MIMI_NVS_KEY_TG_TOKEN, MIMI_SECRET_TG_TOKEN,   true);
     print_config("API Key",    MIMI_NVS_LLM,    MIMI_NVS_KEY_API_KEY,  MIMI_SECRET_API_KEY,    true);
+    print_config("API Base",   MIMI_NVS_LLM,    MIMI_NVS_KEY_API_BASE, MIMI_SECRET_API_BASE,   false);
     print_config("Model",      MIMI_NVS_LLM,    MIMI_NVS_KEY_MODEL,    MIMI_SECRET_MODEL,      false);
     print_config("Provider",   MIMI_NVS_LLM,    MIMI_NVS_KEY_PROVIDER, MIMI_SECRET_MODEL_PROVIDER, false);
     print_config("Proxy Host", MIMI_NVS_PROXY,  MIMI_NVS_KEY_PROXY_HOST, MIMI_SECRET_PROXY_HOST, false);
@@ -849,6 +868,17 @@ esp_err_t serial_cli_init(void)
     };
     esp_console_cmd_register(&api_key_cmd);
 
+    /* set_api_base */
+    api_base_args.base = arg_str1(NULL, NULL, "<base>", "LLM API base (http(s)://host[:port][/path])");
+    api_base_args.end = arg_end(1);
+    esp_console_cmd_t api_base_cmd = {
+        .command = "set_api_base",
+        .help = "Set LLM API base (e.g. https://api.anthropic.com/v1)",
+        .func = &cmd_set_api_base,
+        .argtable = &api_base_args,
+    };
+    esp_console_cmd_register(&api_base_cmd);
+
     /* set_model */
     model_args.model = arg_str1(NULL, NULL, "<model>", "Model identifier");
     model_args.end = arg_end(1);
@@ -1054,4 +1084,4 @@ esp_err_t serial_cli_init(void)
     ESP_LOGI(TAG, "Serial CLI started");
 
     return ESP_OK;
-}
+}
\ No newline at end of file
diff --git a/main/llm/llm_proxy.c b/main/llm/llm_proxy.c
index c6fa1b88..adcd74d5 100644
--- a/main/llm/llm_proxy.c
+++ b/main/llm/llm_proxy.c
@@ -15,12 +15,37 @@ static const char *TAG = "llm";
 
 #define LLM_API_KEY_MAX_LEN 320
 #define LLM_MODEL_MAX_LEN   64
+#define LLM_API_BASE_MAX_LEN 256
+#define LLM_HOST_MAX_LEN    128
+#define LLM_PATH_MAX_LEN    128
 #define LLM_DUMP_MAX_BYTES   (16 * 1024)
 #define LLM_DUMP_CHUNK_BYTES 320
 
 static char s_api_key[LLM_API_KEY_MAX_LEN] = {0};
 static char s_model[LLM_MODEL_MAX_LEN] = MIMI_LLM_DEFAULT_MODEL;
+static char s_model_id[LLM_MODEL_MAX_LEN] = {0};
 static char s_provider[16] = MIMI_LLM_PROVIDER_DEFAULT;
+static char s_api_base[LLM_API_BASE_MAX_LEN] = {0};
+
+typedef enum {
+    LLM_PROTOCOL_ANTHROPIC = 0,
+    LLM_PROTOCOL_OPENAI = 1,
+} llm_protocol_t;
+
+static llm_protocol_t s_protocol = LLM_PROTOCOL_ANTHROPIC;
+static bool s_api_tls = true;
+static char s_api_host[LLM_HOST_MAX_LEN] = {0};
+static uint16_t s_api_port = 443;
+static char s_api_base_path[LLM_PATH_MAX_LEN] = {0};
+static char s_api_req_path[LLM_PATH_MAX_LEN + 32] = {0};
+static char s_api_host_header[LLM_HOST_MAX_LEN + 8] = {0};
+static char s_api_url[LLM_API_BASE_MAX_LEN + 64] = {0};
+static bool s_logged_proxy_bypass_warning = false;
+
+static const char *llm_protocol_name(llm_protocol_t p)
+{
+    return (p == LLM_PROTOCOL_OPENAI) ? "openai" : "anthropic";
+}
 
 static void llm_log_payload(const char *label, const char *payload)
 {
@@ -180,29 +205,157 @@ static esp_err_t http_event_handler(esp_http_client_event_t *evt)
     return ESP_OK;
 }
 
-/* ── Provider helpers ──────────────────────────────────────────── */
+/* ── Protocol config ─────────────────────────────────────────── */
 
-static bool provider_is_openai(void)
-{
-    return strcmp(s_provider, "openai") == 0;
+typedef struct {
+    llm_protocol_t protocol;
+    const char *label;   /* "openai" */
+    const char *prefix;  /* "openai/" */
+    const char *suffix;  /* "/chat/completions" */
+    const char *base;    /* Default API base */
+} llm_proto_cfg_t;
+
+static const llm_proto_cfg_t PROTO_MAP[] = {
+    {LLM_PROTOCOL_OPENAI,    "openai",    "openai/",    "/chat/completions", MIMI_LLM_API_BASE_OPENAI},
+    {LLM_PROTOCOL_ANTHROPIC, "anthropic", "anthropic/", "/messages",        MIMI_LLM_API_BASE_ANTHROPIC}
+};
+
+static const llm_proto_cfg_t* get_current_proto(void) {
+    return &PROTO_MAP[s_protocol == LLM_PROTOCOL_OPENAI ? 0 : 1];
 }
 
-static const char *llm_api_url(void)
-{
-    return provider_is_openai() ? MIMI_OPENAI_API_URL : MIMI_LLM_API_URL;
+/* ── Helpers ─────────────────────────────────────────────────── */
+
+static bool llm_protocol_is_openai(void) {
+    return s_protocol == LLM_PROTOCOL_OPENAI;
 }
 
-static const char *llm_api_host(void)
-{
-    return provider_is_openai() ? "api.openai.com" : "api.anthropic.com";
+/* Validate api_base format without modifying global state */
+static esp_err_t llm_validate_api_base(const char *api_base) {
+    if (!api_base || api_base[0] == '\0') return ESP_ERR_INVALID_ARG;
+
+    /* Check for valid scheme */
+    const char *p;
+    if (strncmp(api_base, "https://", 8) == 0) {
+        p = api_base + 8;
+    } else if (strncmp(api_base, "http://", 7) == 0) {
+        p = api_base + 7;
+    } else {
+        return ESP_ERR_INVALID_ARG;
+    }
+
+    /* Basic format validation - ensure there's content after the scheme */
+    if (p[0] == '\0' || p[0] == '/' || p[0] == ':') {
+        return ESP_ERR_INVALID_ARG;
+    }
+
+    /* Check for valid host part (before colon or slash) */
+    const char *slash = strchr(p, '/');
+    const char *colon = strchr(p, ':');
+    if (colon && slash && colon > slash) colon = NULL; /* Colon is part of path */
+
+    const char *host_end = colon ? colon : (slash ? slash : p + strlen(p));
+    if (host_end == p) return ESP_ERR_INVALID_ARG; /* Empty host */
+
+    /* Validate port if present */
+    if (colon) {
+        char *endptr;
+        long port = strtol(colon + 1, &endptr, 10);
+        if (endptr == colon + 1 || (*endptr != '\0' && *endptr != '/') ||
+            port < 1 || port > 65535) {
+            return ESP_ERR_INVALID_ARG;
+        }
+    }
+
+    return ESP_OK;
 }
 
-static const char *llm_api_path(void)
-{
-    return provider_is_openai() ? "/v1/chat/completions" : "/v1/messages";
+/* Parse api_base: scheme (http/https), host[:port], optional base path. */
+static esp_err_t llm_parse_api_base(const char *api_base) {
+    if (!api_base || api_base[0] == '\0') return ESP_ERR_INVALID_ARG;
+
+    const char *p;
+    if (strncmp(api_base, "https://", 8) == 0) {
+        s_api_tls = true; p = api_base + 8; s_api_port = 443;
+    } else if (strncmp(api_base, "http://", 7) == 0) {
+        s_api_tls = false; p = api_base + 7; s_api_port = 80;
+    } else return ESP_ERR_INVALID_ARG;
+
+    const char *slash = strchr(p, '/');
+    const char *colon = strchr(p, ':');
+    if (colon && slash && colon > slash) colon = NULL; /* Colon is part of path */
+
+    const char *host_end = colon ? colon : (slash ? slash : p + strlen(p));
+    snprintf(s_api_host, sizeof(s_api_host), "%.*s", (int)(host_end - p), p);
+
+    if (colon) {
+        char *endptr;
+        long port = strtol(colon + 1, &endptr, 10);
+        if (endptr != colon + 1 && (*endptr == '\0' || *endptr == '/') &&
+            port >= 1 && port <= 65535) {
+            s_api_port = (uint16_t)port;
+        }
+        /* If port parsing fails, keep the default port (443 for HTTPS, 80 for HTTP) */
+    }
+
+    s_api_base_path[0] = '\0';
+    if (slash) {
+        safe_copy(s_api_base_path, sizeof(s_api_base_path), slash);
+        size_t len = strlen(s_api_base_path);
+        while (len > 0 && s_api_base_path[len - 1] == '/') s_api_base_path[--len] = '\0';
+    }
+    return ESP_OK;
+}
+
+/* Build derived request path, Host header, and full URL strings. */
+static void llm_build_request_targets(void) {
+    const llm_proto_cfg_t *cfg = get_current_proto();
+
+    snprintf(s_api_req_path, sizeof(s_api_req_path), "%s%s", s_api_base_path, cfg->suffix);
+    if (s_api_req_path[0] == '\0') strcpy(s_api_req_path, "/");
+
+    bool is_std = (s_api_tls && s_api_port == 443) || (!s_api_tls && s_api_port == 80);
+    if (is_std) {
+        snprintf(s_api_host_header, sizeof(s_api_host_header), "%s", s_api_host);
+    } else {
+        snprintf(s_api_host_header, sizeof(s_api_host_header), "%s:%u", s_api_host, s_api_port);
+    }
+
+    snprintf(s_api_url, sizeof(s_api_url), "%s://%s%s", 
+             s_api_tls ? "https" : "http", s_api_host_header, s_api_req_path);
 }
 
-/* ── Init ─────────────────────────────────────────────────────── */
+/* ── Derived config ──────────────────────────────────────────── */
+
+static void llm_recompute_effective_config(void) {
+    /* Determine protocol + model_id (prefix overrides provider), and update request targets. */
+    s_logged_proxy_bypass_warning = false;  /* Reset warning flag when config changes */
+    s_protocol = (strcmp(s_provider, "openai") == 0) ? LLM_PROTOCOL_OPENAI : LLM_PROTOCOL_ANTHROPIC;
+    const char *model_id = s_model;
+
+    for (int i = 0; i < 2; i++) {
+        size_t len = strlen(PROTO_MAP[i].prefix);
+        if (strncmp(s_model, PROTO_MAP[i].prefix, len) == 0 && s_model[len] != '\0') {
+            s_protocol = PROTO_MAP[i].protocol;
+            model_id = s_model + len;
+            break;
+        }
+    }
+    safe_copy(s_model_id, sizeof(s_model_id), model_id);
+
+    const char *default_base = get_current_proto()->base;
+    const char *base = (s_api_base[0] != '\0') ? s_api_base : default_base;
+
+    if (llm_parse_api_base(base) != ESP_OK) {
+        ESP_LOGE(TAG, "Failed to parse API base: %s. Using default.", base);
+        llm_parse_api_base(default_base);
+    }
+
+    llm_build_request_targets();
+
+    ESP_LOGI(TAG, "Configured: Protocol=%s, Model=%s, URL=%s", 
+             get_current_proto()->label, s_model_id, s_api_url);
+}
 
 esp_err_t llm_proxy_init(void)
 {
@@ -210,6 +363,9 @@ esp_err_t llm_proxy_init(void)
     if (MIMI_SECRET_API_KEY[0] != '\0') {
         safe_copy(s_api_key, sizeof(s_api_key), MIMI_SECRET_API_KEY);
     }
+    if (MIMI_SECRET_API_BASE[0] != '\0') {
+        safe_copy(s_api_base, sizeof(s_api_base), MIMI_SECRET_API_BASE);
+    }
     if (MIMI_SECRET_MODEL[0] != '\0') {
         safe_copy(s_model, sizeof(s_model), MIMI_SECRET_MODEL);
     }
@@ -225,6 +381,11 @@ esp_err_t llm_proxy_init(void)
         if (nvs_get_str(nvs, MIMI_NVS_KEY_API_KEY, tmp, &len) == ESP_OK && tmp[0]) {
             safe_copy(s_api_key, sizeof(s_api_key), tmp);
         }
+        char base_tmp[LLM_API_BASE_MAX_LEN] = {0};
+        len = sizeof(base_tmp);
+        if (nvs_get_str(nvs, MIMI_NVS_KEY_API_BASE, base_tmp, &len) == ESP_OK && base_tmp[0]) {
+            safe_copy(s_api_base, sizeof(s_api_base), base_tmp);
+        }
         char model_tmp[LLM_MODEL_MAX_LEN] = {0};
         len = sizeof(model_tmp);
         if (nvs_get_str(nvs, MIMI_NVS_KEY_MODEL, model_tmp, &len) == ESP_OK && model_tmp[0]) {
@@ -238,9 +399,9 @@ esp_err_t llm_proxy_init(void)
         nvs_close(nvs);
     }
 
-    if (s_api_key[0]) {
-        ESP_LOGI(TAG, "LLM proxy initialized (provider: %s, model: %s)", s_provider, s_model);
-    } else {
+    llm_recompute_effective_config();
+
+    if (s_api_key[0] == '\0') {
         ESP_LOGW(TAG, "No API key. Use CLI: set_api_key <KEY>");
     }
     return ESP_OK;
@@ -251,7 +412,7 @@ esp_err_t llm_proxy_init(void)
 static esp_err_t llm_http_direct(const char *post_data, resp_buf_t *rb, int *out_status)
 {
     esp_http_client_config_t config = {
-        .url = llm_api_url(),
+        .url = s_api_url,
         .event_handler = http_event_handler,
         .user_data = rb,
         .timeout_ms = 120 * 1000,
@@ -265,14 +426,16 @@ static esp_err_t llm_http_direct(const char *post_data, resp_buf_t *rb, int *out
 
     esp_http_client_set_method(client, HTTP_METHOD_POST);
     esp_http_client_set_header(client, "Content-Type", "application/json");
-    if (provider_is_openai()) {
+    if (llm_protocol_is_openai()) {
         if (s_api_key[0]) {
             char auth[LLM_API_KEY_MAX_LEN + 16];
             snprintf(auth, sizeof(auth), "Bearer %s", s_api_key);
             esp_http_client_set_header(client, "Authorization", auth);
         }
     } else {
-        esp_http_client_set_header(client, "x-api-key", s_api_key);
+        if (s_api_key[0] != '\0') {
+            esp_http_client_set_header(client, "x-api-key", s_api_key);
+        }
         esp_http_client_set_header(client, "anthropic-version", MIMI_LLM_API_VERSION);
     }
     esp_http_client_set_post_field(client, post_data, strlen(post_data));
@@ -287,80 +450,71 @@ static esp_err_t llm_http_direct(const char *post_data, resp_buf_t *rb, int *out
 
 static esp_err_t llm_http_via_proxy(const char *post_data, resp_buf_t *rb, int *out_status)
 {
-    proxy_conn_t *conn = proxy_conn_open(llm_api_host(), 443, 30000);
+    proxy_conn_t *conn = proxy_conn_open(s_api_host, s_api_port, 30000);
     if (!conn) return ESP_ERR_HTTP_CONNECT;
 
-    int body_len = strlen(post_data);
-    char header[1024];
-    int hlen = 0;
-    if (provider_is_openai()) {
-        hlen = snprintf(header, sizeof(header),
-            "POST %s HTTP/1.1\r\n"
-            "Host: %s\r\n"
-            "Content-Type: application/json\r\n"
-            "Authorization: Bearer %s\r\n"
-            "Content-Length: %d\r\n"
-            "Connection: close\r\n\r\n",
-            llm_api_path(), llm_api_host(), s_api_key, body_len);
+    /* Build request headers */
+    char h[1024];
+    int off = snprintf(h, sizeof(h), "POST %s HTTP/1.1\r\nHost: %s\r\nContent-Type: application/json\r\n", 
+                       s_api_req_path, s_api_host_header);
+
+    if (llm_protocol_is_openai()) {
+        if (s_api_key[0] != '\0') {
+            off += snprintf(h + off, sizeof(h) - off, "Authorization: Bearer %s\r\n", s_api_key);
+        }
     } else {
-        hlen = snprintf(header, sizeof(header),
-            "POST %s HTTP/1.1\r\n"
-            "Host: %s\r\n"
-            "Content-Type: application/json\r\n"
-            "x-api-key: %s\r\n"
-            "anthropic-version: %s\r\n"
-            "Content-Length: %d\r\n"
-            "Connection: close\r\n\r\n",
-            llm_api_path(), llm_api_host(), s_api_key, MIMI_LLM_API_VERSION, body_len);
-    }
-
-    if (proxy_conn_write(conn, header, hlen) < 0 ||
-        proxy_conn_write(conn, post_data, body_len) < 0) {
+        if (s_api_key[0] != '\0') {
+            off += snprintf(h + off, sizeof(h) - off, "x-api-key: %s\r\n", s_api_key);
+        }
+        off += snprintf(h + off, sizeof(h) - off, "anthropic-version: %s\r\n", MIMI_LLM_API_VERSION);
+    }
+
+    off += snprintf(h + off, sizeof(h) - off, "Content-Length: %zu\r\nConnection: close\r\n\r\n", strlen(post_data));
+
+    /* Send */
+    if (off >= sizeof(h) || proxy_conn_write(conn, h, off) < 0 || 
+        proxy_conn_write(conn, post_data, strlen(post_data)) < 0) {
         proxy_conn_close(conn);
         return ESP_ERR_HTTP_WRITE_DATA;
     }
 
-    /* Read full response into buffer */
-    char tmp[4096];
-    while (1) {
-        int n = proxy_conn_read(conn, tmp, sizeof(tmp), 120000);
-        if (n <= 0) break;
+    /* Receive full response */
+    char tmp[1024];
+    int n;
+    while ((n = proxy_conn_read(conn, tmp, sizeof(tmp), 120000)) > 0) {
         if (resp_buf_append(rb, tmp, n) != ESP_OK) break;
+        vTaskDelay(pdMS_TO_TICKS(1));
     }
     proxy_conn_close(conn);
 
-    /* Parse status line */
-    *out_status = 0;
-    if (rb->len > 5 && strncmp(rb->data, "HTTP/", 5) == 0) {
-        const char *sp = strchr(rb->data, ' ');
-        if (sp) *out_status = atoi(sp + 1);
-    }
+    /* Parse status */
+    *out_status = (rb->len > 12 && strncmp(rb->data, "HTTP/", 5) == 0) ? atoi(rb->data + 9) : 0;
 
-    /* Strip HTTP headers, keep body only */
+    /* Strip headers */
     char *body = strstr(rb->data, "\r\n\r\n");
     if (body) {
         body += 4;
-        size_t blen = rb->len - (body - rb->data);
-        memmove(rb->data, body, blen);
-        rb->len = blen;
+        rb->len -= (body - rb->data);
+        memmove(rb->data, body, rb->len);
         rb->data[rb->len] = '\0';
     }
 
-    /* Decode chunked transfer encoding if present */
     resp_buf_decode_chunked(rb);
-
     return ESP_OK;
 }
 
-/* ── Shared HTTP dispatch ─────────────────────────────────────── */
-
 static esp_err_t llm_http_call(const char *post_data, resp_buf_t *rb, int *out_status)
 {
     if (http_proxy_is_enabled()) {
-        return llm_http_via_proxy(post_data, rb, out_status);
-    } else {
-        return llm_http_direct(post_data, rb, out_status);
+        if (s_api_tls) {
+            return llm_http_via_proxy(post_data, rb, out_status);
+        }
+        if (!s_logged_proxy_bypass_warning) {
+            ESP_LOGW(TAG, "Proxy configured but api_base is http; bypassing proxy");
+            s_logged_proxy_bypass_warning = true;
+        }
     }
+    return llm_http_direct(post_data, rb, out_status);
 }
 
 static cJSON *convert_tools_openai(const char *tools_json)
@@ -554,18 +708,16 @@ esp_err_t llm_chat_tools(const char *system_prompt,
 {
     memset(resp, 0, sizeof(*resp));
 
-    if (s_api_key[0] == '\0') return ESP_ERR_INVALID_STATE;
-
     /* Build request body (non-streaming) */
     cJSON *body = cJSON_CreateObject();
-    cJSON_AddStringToObject(body, "model", s_model);
-    if (provider_is_openai()) {
+    cJSON_AddStringToObject(body, "model", s_model_id);
+    if (strncasecmp(s_model_id, "gpt-5", 5) == 0 || strncasecmp(s_model_id, "o1", 2) == 0) {
         cJSON_AddNumberToObject(body, "max_completion_tokens", MIMI_LLM_MAX_TOKENS);
     } else {
         cJSON_AddNumberToObject(body, "max_tokens", MIMI_LLM_MAX_TOKENS);
     }
 
-    if (provider_is_openai()) {
+    if (llm_protocol_is_openai()) {
         cJSON *openai_msgs = convert_messages_openai(system_prompt, messages);
         cJSON_AddItemToObject(body, "messages", openai_msgs);
 
@@ -596,8 +748,8 @@ esp_err_t llm_chat_tools(const char *system_prompt,
     cJSON_Delete(body);
     if (!post_data) return ESP_ERR_NO_MEM;
 
-    ESP_LOGI(TAG, "Calling LLM API with tools (provider: %s, model: %s, body: %d bytes)",
-             s_provider, s_model, (int)strlen(post_data));
+    ESP_LOGI(TAG, "Calling LLM API with tools (protocol: %s, model: %s, body: %d bytes)",
+             llm_protocol_name(s_protocol), s_model_id, (int)strlen(post_data));
     llm_log_payload("LLM tools request", post_data);
 
     /* HTTP call */
@@ -635,7 +787,7 @@ esp_err_t llm_chat_tools(const char *system_prompt,
         return ESP_FAIL;
     }
 
-    if (provider_is_openai()) {
+    if (llm_protocol_is_openai()) {
         cJSON *choices = cJSON_GetObjectItem(root, "choices");
         cJSON *choice0 = choices && cJSON_IsArray(choices) ? cJSON_GetArrayItem(choices, 0) : NULL;
         if (choice0) {
@@ -784,6 +936,27 @@ esp_err_t llm_set_api_key(const char *api_key)
     return ESP_OK;
 }
 
+esp_err_t llm_set_api_base(const char *api_base)
+{
+    /* Validate before persisting - use validation-only function */
+    esp_err_t err = llm_validate_api_base(api_base);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "Invalid API base format: %s", api_base ? api_base : "<null>");
+        return err;
+    }
+
+    nvs_handle_t nvs;
+    ESP_ERROR_CHECK(nvs_open(MIMI_NVS_LLM, NVS_READWRITE, &nvs));
+    ESP_ERROR_CHECK(nvs_set_str(nvs, MIMI_NVS_KEY_API_BASE, api_base));
+    ESP_ERROR_CHECK(nvs_commit(nvs));
+    nvs_close(nvs);
+
+    safe_copy(s_api_base, sizeof(s_api_base), api_base);
+    llm_recompute_effective_config();
+    ESP_LOGI(TAG, "API base set");
+    return ESP_OK;
+}
+
 esp_err_t llm_set_model(const char *model)
 {
     nvs_handle_t nvs;
@@ -793,6 +966,7 @@ esp_err_t llm_set_model(const char *model)
     nvs_close(nvs);
 
     safe_copy(s_model, sizeof(s_model), model);
+    llm_recompute_effective_config();
     ESP_LOGI(TAG, "Model set to: %s", s_model);
     return ESP_OK;
 }
@@ -806,6 +980,7 @@ esp_err_t llm_set_provider(const char *provider)
     nvs_close(nvs);
 
     safe_copy(s_provider, sizeof(s_provider), provider);
+    llm_recompute_effective_config();
     ESP_LOGI(TAG, "Provider set to: %s", s_provider);
     return ESP_OK;
-}
+}
\ No newline at end of file
diff --git a/main/llm/llm_proxy.h b/main/llm/llm_proxy.h
index b667f624..7d333b84 100644
--- a/main/llm/llm_proxy.h
+++ b/main/llm/llm_proxy.h
@@ -17,6 +17,19 @@ esp_err_t llm_proxy_init(void);
  */
 esp_err_t llm_set_api_key(const char *api_key);
 
+/**
+ * Save the LLM API base URL to NVS.
+ *
+ * Expected format: http(s)://host[:port][/path]
+ * Examples:
+ *   - https://api.anthropic.com/v1
+ *   - https://api.openai.com/v1
+ *   - http://localhost:11434/v1
+ *   - https://api.minimaxi.com/anthropic/v1
+ *   - https://open.bigmodel.cn/api/paas/v4
+ */
+esp_err_t llm_set_api_base(const char *api_base);
+
 /**
  * Save the LLM provider to NVS. (e.g. "anthropic", "openai")
  */
@@ -58,4 +71,4 @@ void llm_response_free(llm_response_t *resp);
 esp_err_t llm_chat_tools(const char *system_prompt,
                          cJSON *messages,
                          const char *tools_json,
-                         llm_response_t *resp);
+                         llm_response_t *resp);
\ No newline at end of file
diff --git a/main/mimi.c b/main/mimi.c
index 0e8e8fa7..430b9b88 100644
--- a/main/mimi.c
+++ b/main/mimi.c
@@ -25,6 +25,7 @@
 #include "cron/cron_service.h"
 #include "heartbeat/heartbeat.h"
 #include "skills/skill_loader.h"
+#include "voice/voice_channel.h"
 
 static const char *TAG = "mimi";
 
@@ -60,41 +61,73 @@ static esp_err_t init_spiffs(void)
 
     return ESP_OK;
 }
-
+static void voice_speak_task(void *arg)
+{
+    char *text = (char *)arg;
+    if (text) {
+        esp_err_t err = voice_channel_speak_text(text);
+        if (err != ESP_OK) {
+            ESP_LOGW(TAG, "Voice playback failed: %s", esp_err_to_name(err));
+        }
+        free(text);
+    }
+    vTaskDelete(NULL);
+}
 /* Outbound dispatch task: reads from outbound queue and routes to channels */
 static void outbound_dispatch_task(void *arg)
 {
-    ESP_LOGI(TAG, "Outbound dispatch started");
+    (void)arg;
+    ESP_LOGI(TAG, "Outbound dispatch started on core %d", xPortGetCoreID());
 
     while (1) {
-        mimi_msg_t msg;
-        if (message_bus_pop_outbound(&msg, UINT32_MAX) != ESP_OK) continue;
+        mimi_msg_t msg = {0};
+        if (message_bus_pop_outbound(&msg, UINT32_MAX) != ESP_OK) {
+            continue;
+        }
+
+        ESP_LOGI(TAG, "Dispatching response to %s:%s",
+                 msg.channel[0] ? msg.channel : "(unknown)",
+                 msg.chat_id[0] ? msg.chat_id : "(empty)");
 
-        ESP_LOGI(TAG, "Dispatching response to %s:%s", msg.channel, msg.chat_id);
+        if (!msg.content || !msg.content[0]) {
+            free(msg.content);
+            continue;
+        }
 
         if (strcmp(msg.channel, MIMI_CHAN_TELEGRAM) == 0) {
-            esp_err_t send_err = telegram_send_message(msg.chat_id, msg.content);
-            if (send_err != ESP_OK) {
-                ESP_LOGE(TAG, "Telegram send failed for %s: %s", msg.chat_id, esp_err_to_name(send_err));
-            } else {
-                ESP_LOGI(TAG, "Telegram send success for %s (%d bytes)", msg.chat_id, (int)strlen(msg.content));
-            }
+            telegram_send_message(msg.chat_id, msg.content);
+
         } else if (strcmp(msg.channel, MIMI_CHAN_FEISHU) == 0) {
-            esp_err_t send_err = feishu_send_message(msg.chat_id, msg.content);
-            if (send_err != ESP_OK) {
-                ESP_LOGE(TAG, "Feishu send failed for %s: %s", msg.chat_id, esp_err_to_name(send_err));
-            } else {
-                ESP_LOGI(TAG, "Feishu send success for %s (%d bytes)", msg.chat_id, (int)strlen(msg.content));
-            }
+            feishu_send_message(msg.chat_id, msg.content);
+
         } else if (strcmp(msg.channel, MIMI_CHAN_WEBSOCKET) == 0) {
-            esp_err_t ws_err = ws_server_send(msg.chat_id, msg.content);
-            if (ws_err != ESP_OK) {
-                ESP_LOGW(TAG, "WS send failed for %s: %s", msg.chat_id, esp_err_to_name(ws_err));
+            ws_server_send(msg.chat_id, msg.content);
+
+        } else if (strcmp(msg.channel, MIMI_CHAN_VOICE) == 0) {
+            char *copy = strdup(msg.content);
+            if (!copy) {
+                ESP_LOGW(TAG, "No memory for voice speak task");
+            } else {
+                BaseType_t ok = xTaskCreatePinnedToCore(
+                    voice_speak_task,
+                    "voice_speak",
+                    MIMI_VOICE_SPEAK_STACK,
+                    copy,
+                    MIMI_VOICE_SPEAK_PRIO,
+                    NULL,
+                    MIMI_VOICE_SPEAK_CORE
+                );
+                if (ok != pdPASS) {
+                    ESP_LOGW(TAG, "Failed to create voice_speak task");
+                    free(copy);
+                }
             }
-        } else if (strcmp(msg.channel, MIMI_CHAN_SYSTEM) == 0) {
-            ESP_LOGI(TAG, "System message [%s]: %.128s", msg.chat_id, msg.content);
+
+        } else if (strcmp(msg.channel, MIMI_CHAN_CLI) == 0) {
+            printf("\n%s\n", msg.content);
+
         } else {
-            ESP_LOGW(TAG, "Unknown channel: %s", msg.channel);
+            ESP_LOGW(TAG, "Unknown outbound channel: %s", msg.channel);
         }
 
         free(msg.content);
@@ -134,6 +167,7 @@ void app_main(void)
     ESP_ERROR_CHECK(tool_registry_init());
     ESP_ERROR_CHECK(cron_service_init());
     ESP_ERROR_CHECK(heartbeat_init());
+    ESP_ERROR_CHECK(voice_channel_init());
     ESP_ERROR_CHECK(agent_loop_init());
 
     /* Start Serial CLI first (works without WiFi) */
@@ -161,6 +195,7 @@ void app_main(void)
             ESP_ERROR_CHECK(feishu_bot_start());
             cron_service_start();
             heartbeat_start();
+            voice_channel_start();
             ESP_ERROR_CHECK(ws_server_start());
 
             ESP_LOGI(TAG, "All services started!");
diff --git a/main/mimi_config.h b/main/mimi_config.h
index 9be7c087..68f99369 100644
--- a/main/mimi_config.h
+++ b/main/mimi_config.h
@@ -19,6 +19,9 @@
 #ifndef MIMI_SECRET_API_KEY
 #define MIMI_SECRET_API_KEY         ""
 #endif
+#ifndef MIMI_SECRET_LLM_API_URL
+#define MIMI_SECRET_LLM_API_URL     ""
+#endif
 #ifndef MIMI_SECRET_MODEL
 #define MIMI_SECRET_MODEL           ""
 #endif
@@ -46,6 +49,36 @@
 #ifndef MIMI_SECRET_TAVILY_KEY
 #define MIMI_SECRET_TAVILY_KEY      ""
 #endif
+#ifndef MIMI_SECRET_STT_URL
+#define MIMI_SECRET_STT_URL         ""
+#endif
+#ifndef MIMI_SECRET_STT_API_KEY
+#define MIMI_SECRET_STT_API_KEY     ""
+#endif
+#ifndef MIMI_SECRET_STT_MODEL
+#define MIMI_SECRET_STT_MODEL       ""
+#endif
+#ifndef MIMI_SECRET_TTS_URL
+#define MIMI_SECRET_TTS_URL         ""
+#endif
+#ifndef MIMI_SECRET_TTS_API_KEY
+#define MIMI_SECRET_TTS_API_KEY     ""
+#endif
+#ifndef MIMI_SECRET_TTS_VOICE
+#define MIMI_SECRET_TTS_VOICE       "Cherry"
+#endif
+#ifndef MIMI_SECRET_TTS_MODEL
+#define MIMI_SECRET_TTS_MODEL       ""
+#endif
+#ifndef MIMI_SECRET_TTS_LANGUAGE
+#define MIMI_SECRET_TTS_LANGUAGE    "English"
+#endif
+
+/* Qwen voice API defaults (DashScope) */
+#define MIMI_QWEN_STT_URL           "https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions"
+#define MIMI_QWEN_STT_MODEL         "qwen3-asr-flash"
+#define MIMI_QWEN_TTS_URL           "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation"
+#define MIMI_QWEN_TTS_MODEL         "qwen3-tts-flash"
 
 /* WiFi */
 #define MIMI_WIFI_MAX_RETRY          10
@@ -79,6 +112,40 @@
 #define MIMI_MAX_TOOL_CALLS          4
 #define MIMI_AGENT_SEND_WORKING_STATUS 1
 
+/* Voice UX (LLM -> TTS) */
+/* Rough speaking rate for Simplified Chinese TTS is often ~4–6 chars/sec depending on voice.
+    * Default limits aim to keep playback under ~20 seconds in typical conditions.
+    * Override these in mimi_secrets.h per your preferred voice/speed.
+    */
+#ifndef MIMI_VOICE_TTS_MAX_SECONDS
+#define MIMI_VOICE_TTS_MAX_SECONDS  20
+#endif
+
+#ifndef MIMI_VOICE_TTS_CHARS_PER_SEC
+#define MIMI_VOICE_TTS_CHARS_PER_SEC 7
+#endif
+
+#ifndef MIMI_VOICE_LLM_MAX_CHARS
+#define MIMI_VOICE_LLM_MAX_CHARS     (MIMI_VOICE_TTS_MAX_SECONDS * MIMI_VOICE_TTS_CHARS_PER_SEC)
+#endif
+
+#ifndef MIMI_VOICE_TTS_MAX_CHARS
+#define MIMI_VOICE_TTS_MAX_CHARS     (MIMI_VOICE_LLM_MAX_CHARS + 10)
+#endif
+
+/* Voice capture (VAD / STT trigger) */
+#ifndef MIMI_VOICE_VAD_START_FRAMES
+#define MIMI_VOICE_VAD_START_FRAMES  4   /* consecutive frames above threshold to enter speech */
+#endif
+
+#ifndef MIMI_VOICE_VAD_MIN_FRAMES
+#define MIMI_VOICE_VAD_MIN_FRAMES    50  /* minimum utterance frames before sending to STT */
+#endif
+
+#ifndef MIMI_VOICE_STT_COOLDOWN_MS
+#define MIMI_VOICE_STT_COOLDOWN_MS   2000 /* cooldown after an STT attempt to reduce re-trigger */
+#endif
+
 /* Timezone (POSIX TZ format) */
 #define MIMI_TIMEZONE                "PST8PDT,M3.2.0,M11.1.0"
 
@@ -86,8 +153,8 @@
 #define MIMI_LLM_DEFAULT_MODEL       "claude-opus-4-5"
 #define MIMI_LLM_PROVIDER_DEFAULT    "anthropic"
 #define MIMI_LLM_MAX_TOKENS          4096
-#define MIMI_LLM_API_URL             "https://api.anthropic.com/v1/messages"
-#define MIMI_OPENAI_API_URL          "https://api.openai.com/v1/chat/completions"
+#define MIMI_LLM_API_BASE_ANTHROPIC  "https://api.anthropic.com/v1"
+#define MIMI_LLM_API_BASE_OPENAI     "https://api.openai.com/v1"
 #define MIMI_LLM_API_VERSION         "2023-06-01"
 #define MIMI_LLM_STREAM_BUF_SIZE     (32 * 1024)
 #define MIMI_LLM_LOG_VERBOSE_PAYLOAD 0
@@ -99,6 +166,22 @@
 #define MIMI_OUTBOUND_PRIO           5
 #define MIMI_OUTBOUND_CORE           0
 
+/* Voice speak task (TTS download + resample + playback) */
+#ifndef MIMI_VOICE_SPEAK_STACK
+#define MIMI_VOICE_SPEAK_STACK       (12 * 1024)
+#endif
+#ifndef MIMI_VOICE_SPEAK_PRIO
+#define MIMI_VOICE_SPEAK_PRIO        5
+#endif
+#ifndef MIMI_VOICE_SPEAK_CORE
+#define MIMI_VOICE_SPEAK_CORE        1
+#endif
+
+/* WiFi reliability */
+#ifndef MIMI_WIFI_DISABLE_POWERSAVE
+#define MIMI_WIFI_DISABLE_POWERSAVE  1
+#endif
+
 /* Memory / SPIFFS */
 #define MIMI_SPIFFS_BASE             "/spiffs"
 #define MIMI_SPIFFS_CONFIG_DIR       MIMI_SPIFFS_BASE "/config"
@@ -144,6 +227,7 @@
 #define MIMI_NVS_KEY_FEISHU_APP_ID   "app_id"
 #define MIMI_NVS_KEY_FEISHU_APP_SECRET "app_secret"
 #define MIMI_NVS_KEY_API_KEY         "api_key"
+#define MIMI_NVS_KEY_API_BASE        "api_base"
 #define MIMI_NVS_KEY_TAVILY_KEY      "tavily_key"
 #define MIMI_NVS_KEY_MODEL           "model"
 #define MIMI_NVS_KEY_PROVIDER        "provider"
diff --git a/main/mimi_secrets.h.example b/main/mimi_secrets.h.example
index ecebf54e..1852f66c 100644
--- a/main/mimi_secrets.h.example
+++ b/main/mimi_secrets.h.example
@@ -21,8 +21,9 @@
 #define MIMI_SECRET_FEISHU_APP_ID   ""
 #define MIMI_SECRET_FEISHU_APP_SECRET ""
 
-/* Anthropic API */
+/* LLM */
 #define MIMI_SECRET_API_KEY         ""
+#define MIMI_SECRET_LLM_API_URL     ""  /* optional: full URL incl scheme/host/port/path */
 #define MIMI_SECRET_MODEL           ""
 #define MIMI_SECRET_MODEL_PROVIDER  "anthropic"
 
@@ -33,5 +34,53 @@
 
 /* Brave Search API */
 #define MIMI_SECRET_SEARCH_KEY      ""
+
+/* Voice STT / TTS services */
+#define MIMI_SECRET_STT_URL         ""
+#define MIMI_SECRET_STT_API_KEY     ""
+#define MIMI_SECRET_STT_MODEL       ""
+#define MIMI_SECRET_TTS_URL         ""
+#define MIMI_SECRET_TTS_API_KEY     ""
+#define MIMI_SECRET_TTS_VOICE       "Cherry"
+#define MIMI_SECRET_TTS_MODEL       ""
+#define MIMI_SECRET_TTS_LANGUAGE    "English"
+
+/* ReSpeaker XVF3800 I2S pin map (set per board) */
+#define MIMI_VOICE_I2S_PORT         0
+#define MIMI_VOICE_I2S_BCLK         (-1)
+#define MIMI_VOICE_I2S_WS           (-1)
+#define MIMI_VOICE_I2S_DIN          (-1)
+#define MIMI_VOICE_I2S_DOUT         (-1)
+
+/* I2S slot/timing style (set per DAC/codec):
+ *   0: Philips (I2S)
+ *   1: MSB (left-justified)
+ *   2: PCM (short frame sync)
+ */
+/* #define MIMI_VOICE_I2S_STD_SLOT_STYLE 1 */
+
+/* Optional: tune DMA and silence tail to suppress post-playback "咚咚" on some DAC/amps */
+/* #define MIMI_VOICE_I2S_DMA_DESC_NUM 6 */
+/* #define MIMI_VOICE_I2S_DMA_FRAME_NUM 240 */
+/* #define MIMI_VOICE_TX_SILENCE_TAIL_MS 400 */
+
+/* Optional: voice conversation pacing (LLM -> TTS)
+ * Target: <= 20s playback, <= 2 sentences + 1 follow-up question.
+ */
+/* #define MIMI_VOICE_TTS_MAX_SECONDS 20 */
+/* #define MIMI_VOICE_TTS_CHARS_PER_SEC 5 */
+/* #define MIMI_VOICE_LLM_MAX_CHARS 100 */
+/* #define MIMI_VOICE_TTS_MAX_CHARS 110 */
+
+/* Optional: reduce STT false triggers (VAD tuning) */
+/* #define MIMI_VOICE_VAD_START_FRAMES 3 */
+/* #define MIMI_VOICE_VAD_MIN_FRAMES 15 */
+/* #define MIMI_VOICE_STT_COOLDOWN_MS 1200 */
+
+/* Optional: WiFi reliability tuning (may increase power draw) */
+/* #define MIMI_WIFI_DISABLE_POWERSAVE 1 */
+
+/* Optional: move TTS/resample/playback off WiFi core to reduce bcn_timeout under load */
+/* #define MIMI_VOICE_SPEAK_CORE 1 */
 /* Tavily Search API */
 #define MIMI_SECRET_TAVILY_KEY      ""
diff --git a/main/proxy/http_proxy.c b/main/proxy/http_proxy.c
index fdb75541..3745144d 100644
--- a/main/proxy/http_proxy.c
+++ b/main/proxy/http_proxy.c
@@ -104,6 +104,61 @@ bool http_proxy_is_enabled(void)
     return s_proxy_host[0] != '\0' && s_proxy_port != 0;
 }
 
+/* ── Raw tunnels (no TLS) ────────────────────────────────────── */
+
+static int open_connect_tunnel(const char *host, int port, int timeout_ms);
+static int open_socks5_tunnel(const char *host, int port, int timeout_ms);
+
+int proxy_tunnel_open(const char *host, int port, int timeout_ms)
+{
+    if (!http_proxy_is_enabled()) {
+        ESP_LOGE(TAG, "proxy_tunnel_open called but no proxy configured");
+        return -1;
+    }
+
+    if (!host || !host[0] || port <= 0 || port > 65535) {
+        ESP_LOGE(TAG, "proxy_tunnel_open invalid target");
+        return -1;
+    }
+
+    if (strcmp(s_proxy_type, "socks5") == 0) {
+        return open_socks5_tunnel(host, port, timeout_ms);
+    }
+    return open_connect_tunnel(host, port, timeout_ms);
+}
+
+int proxy_tunnel_write(int sock, const char *data, int len)
+{
+    if (sock < 0 || !data || len <= 0) return -1;
+
+    int written = 0;
+    while (written < len) {
+        int n = send(sock, data + written, len - written, 0);
+        if (n <= 0) return -1;
+        written += n;
+    }
+    return written;
+}
+
+int proxy_tunnel_read(int sock, char *buf, int len, int timeout_ms)
+{
+    if (sock < 0 || !buf || len <= 0) return -1;
+
+    struct timeval tv = { .tv_sec = timeout_ms / 1000, .tv_usec = (timeout_ms % 1000) * 1000 };
+    setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv));
+
+    int n = recv(sock, buf, len, 0);
+    if (n < 0) return -1;
+    return n;
+}
+
+void proxy_tunnel_close(int sock)
+{
+    if (sock >= 0) {
+        close(sock);
+    }
+}
+
 /* ── Proxied TLS connection ───────────────────────────────────── */
 
 struct proxy_conn {
diff --git a/main/proxy/http_proxy.h b/main/proxy/http_proxy.h
index f324700e..382dc742 100644
--- a/main/proxy/http_proxy.h
+++ b/main/proxy/http_proxy.h
@@ -24,6 +24,27 @@ esp_err_t http_proxy_set(const char *host, uint16_t port, const char *type);
  */
 esp_err_t http_proxy_clear(void);
 
+/* ── Proxy tunnels (no TLS) ──────────────────────────────────── */
+
+/**
+ * Open a raw TCP tunnel to target host:port through the configured proxy.
+ *
+ * - If proxy type is "http": uses HTTP CONNECT
+ * - If proxy type is "socks5": uses SOCKS5 CONNECT
+ *
+ * Returns a socket fd on success, or -1 on failure.
+ */
+int proxy_tunnel_open(const char *host, int port, int timeout_ms);
+
+/** Write raw bytes through the tunnel. Returns bytes written or -1. */
+int proxy_tunnel_write(int sock, const char *data, int len);
+
+/** Read raw bytes from the tunnel. Returns bytes read or -1. */
+int proxy_tunnel_read(int sock, char *buf, int len, int timeout_ms);
+
+/** Close the tunnel socket. */
+void proxy_tunnel_close(int sock);
+
 /* ── Proxied HTTPS connection ─────────────────────────────────── */
 
 typedef struct proxy_conn proxy_conn_t;
diff --git a/main/tools/tool_cron.c b/main/tools/tool_cron.c
index 048e8902..5670678f 100644
--- a/main/tools/tool_cron.c
+++ b/main/tools/tool_cron.c
@@ -66,16 +66,21 @@ esp_err_t tool_cron_add_execute(const char *input_json, char *output, size_t out
         job.delete_after_run = false;
     } else if (strcmp(schedule_type, "at") == 0) {
         job.kind = CRON_KIND_AT;
-        cJSON *at_epoch = cJSON_GetObjectItem(root, "at_epoch");
-        if (!at_epoch || !cJSON_IsNumber(at_epoch)) {
-            snprintf(output, output_size, "Error: 'at' schedule requires 'at_epoch' (unix timestamp)");
-            cJSON_Delete(root);
-            return ESP_ERR_INVALID_ARG;
+        time_t now = time(NULL);
+        cJSON *delay_s = cJSON_GetObjectItem(root, "delay_s");
+        if (delay_s && cJSON_IsNumber(delay_s) && delay_s->valuedouble > 0) {
+            job.at_epoch = (int64_t)now + (int64_t)delay_s->valuedouble;
+        } else {
+            cJSON *at_epoch = cJSON_GetObjectItem(root, "at_epoch");
+            if (!at_epoch || !cJSON_IsNumber(at_epoch)) {
+                snprintf(output, output_size, "Error: 'at' schedule requires 'at_epoch' (unix timestamp) or positive 'delay_s'");
+                cJSON_Delete(root);
+                return ESP_ERR_INVALID_ARG;
+            }
+            job.at_epoch = (int64_t)at_epoch->valuedouble;
         }
-        job.at_epoch = (int64_t)at_epoch->valuedouble;
 
         /* Check if already in the past */
-        time_t now = time(NULL);
         if (job.at_epoch <= now) {
             snprintf(output, output_size, "Error: at_epoch %lld is in the past (now=%lld)",
                      (long long)job.at_epoch, (long long)now);
diff --git a/main/tools/tool_registry.c b/main/tools/tool_registry.c
index 6c82a3ef..e6251f8a 100644
--- a/main/tools/tool_registry.c
+++ b/main/tools/tool_registry.c
@@ -135,14 +135,15 @@ esp_err_t tool_registry_init(void)
     /* Register cron_add */
     mimi_tool_t ca = {
         .name = "cron_add",
-        .description = "Schedule a recurring or one-shot task. The message will trigger an agent turn when the job fires.",
+        .description = "Schedule a recurring or one-shot task. For relative reminders (e.g. 'in 2 minutes'), prefer delay_s to avoid timestamp math. The message will trigger an agent turn when the job fires.",
         .input_schema_json =
             "{\"type\":\"object\","
             "\"properties\":{"
             "\"name\":{\"type\":\"string\",\"description\":\"Short name for the job\"},"
             "\"schedule_type\":{\"type\":\"string\",\"description\":\"'every' for recurring interval or 'at' for one-shot at a unix timestamp\"},"
             "\"interval_s\":{\"type\":\"integer\",\"description\":\"Interval in seconds (required for 'every')\"},"
-            "\"at_epoch\":{\"type\":\"integer\",\"description\":\"Unix timestamp to fire at (required for 'at')\"},"
+            "\"at_epoch\":{\"type\":\"integer\",\"description\":\"Unix timestamp to fire at (for 'at'). Prefer delay_s for relative reminders.\"},"
+            "\"delay_s\":{\"type\":\"integer\",\"description\":\"Delay in seconds from now (preferred for 'at' when user says 'in N minutes')\"},"
             "\"message\":{\"type\":\"string\",\"description\":\"Message to inject when the job fires, triggering an agent turn\"},"
             "\"channel\":{\"type\":\"string\",\"description\":\"Optional reply channel (e.g. 'telegram'). If omitted, current turn channel is used when available\"},"
             "\"chat_id\":{\"type\":\"string\",\"description\":\"Optional reply chat_id. Required when channel='telegram'. If omitted during a Telegram turn, current chat_id is used\"}"
diff --git a/main/voice/voice_channel.c b/main/voice/voice_channel.c
new file mode 100644
index 00000000..039c6e26
--- /dev/null
+++ b/main/voice/voice_channel.c
@@ -0,0 +1,1613 @@
+#include "voice/voice_channel.h"
+
+#include <string.h>
+#include <stdlib.h>
+#include <stdio.h>
+#include <stdint.h>
+#include <stdbool.h>
+#include <limits.h>
+#include <ctype.h>
+
+#include "mimi_config.h"
+#include "bus/message_bus.h"
+#include "proxy/http_proxy.h"
+
+#include "freertos/FreeRTOS.h"
+#include "freertos/task.h"
+#include "freertos/semphr.h"
+
+#include "esp_log.h"
+#include "esp_err.h"
+#include "esp_http_client.h"
+#include "esp_crt_bundle.h"
+#include "esp_heap_caps.h"
+#include "driver/i2s_std.h"
+#include "driver/i2s_common.h"
+
+#include "cJSON.h"
+#include "mbedtls/base64.h"
+
+static const char *TAG = "voice";
+
+/*
+ * I2S timing / slot style selection:
+ *   0: Philips (I2S, 1-bit delay after WS edge)
+ *   1: MSB (left-justified, no 1-bit delay)
+ *   2: PCM (short frame sync, ws_width=1, ws_pol=true)
+ *
+ * Many DAC/codec parts are sensitive to this. If your audio sounds like loud
+ * "沙沙" noise but speech is partially recognizable, this is a prime suspect.
+ */
+#ifndef MIMI_VOICE_I2S_STD_SLOT_STYLE
+#define MIMI_VOICE_I2S_STD_SLOT_STYLE 0
+#endif
+
+#ifndef MIMI_VOICE_I2S_DMA_DESC_NUM
+#define MIMI_VOICE_I2S_DMA_DESC_NUM 6
+#endif
+
+#ifndef MIMI_VOICE_I2S_DMA_FRAME_NUM
+#define MIMI_VOICE_I2S_DMA_FRAME_NUM 240
+#endif
+
+#ifndef MIMI_VOICE_TX_SILENCE_TAIL_MS
+#define MIMI_VOICE_TX_SILENCE_TAIL_MS 400
+#endif
+
+#define MIMI_VOICE_TX_BYTES_PER_FRAME (2U * sizeof(int32_t))
+#define MIMI_VOICE_TX_DMA_TOTAL_BYTES \
+    ((uint32_t)MIMI_VOICE_I2S_DMA_DESC_NUM * (uint32_t)MIMI_VOICE_I2S_DMA_FRAME_NUM * (uint32_t)MIMI_VOICE_TX_BYTES_PER_FRAME)
+
+#if MIMI_VOICE_I2S_STD_SLOT_STYLE == 1
+#define MIMI_VOICE_I2S_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) \
+    I2S_STD_MSB_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo)
+#elif MIMI_VOICE_I2S_STD_SLOT_STYLE == 2
+#define MIMI_VOICE_I2S_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) \
+    I2S_STD_PCM_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo)
+#else
+#define MIMI_VOICE_I2S_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) \
+    I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo)
+#endif
+
+static const char *i2s_slot_style_str(void)
+{
+#if MIMI_VOICE_I2S_STD_SLOT_STYLE == 1
+    return "MSB";
+#elif MIMI_VOICE_I2S_STD_SLOT_STYLE == 2
+    return "PCM";
+#else
+    return "PHILIPS";
+#endif
+}
+
+/* =========================
+ * Fallback config defaults
+ * ========================= */
+
+#ifndef MIMI_VOICE_ENABLED_DEFAULT
+#define MIMI_VOICE_ENABLED_DEFAULT        0
+#endif
+
+#ifndef MIMI_VOICE_CHAT_ID
+#define MIMI_VOICE_CHAT_ID                "voice_local"
+#endif
+
+#ifndef MIMI_VOICE_SAMPLE_RATE
+#define MIMI_VOICE_SAMPLE_RATE            16000
+#endif
+
+#ifndef MIMI_VOICE_FRAME_MS
+#define MIMI_VOICE_FRAME_MS               20
+#endif
+
+#ifndef MIMI_VOICE_MAX_UTTERANCE_MS
+#define MIMI_VOICE_MAX_UTTERANCE_MS       10000
+#endif
+
+#ifndef MIMI_VOICE_SILENCE_END_MS
+#define MIMI_VOICE_SILENCE_END_MS         600
+#endif
+
+#ifndef MIMI_VOICE_VAD_THRESHOLD
+#define MIMI_VOICE_VAD_THRESHOLD          700
+#endif
+
+#ifndef MIMI_VOICE_CAPTURE_STACK
+#define MIMI_VOICE_CAPTURE_STACK          (8 * 1024)
+#endif
+
+#ifndef MIMI_VOICE_TASK_PRIO
+#define MIMI_VOICE_TASK_PRIO              5
+#endif
+
+#ifndef MIMI_VOICE_CORE
+#define MIMI_VOICE_CORE                   0
+#endif
+
+#ifndef MIMI_SECRET_STT_URL
+#define MIMI_SECRET_STT_URL               ""
+#endif
+
+#ifndef MIMI_SECRET_STT_API_KEY
+#define MIMI_SECRET_STT_API_KEY           ""
+#endif
+
+#ifndef MIMI_SECRET_STT_MODEL
+#define MIMI_SECRET_STT_MODEL             "qwen3-asr-flash"
+#endif
+
+#ifndef MIMI_SECRET_TTS_URL
+#define MIMI_SECRET_TTS_URL               ""
+#endif
+
+#ifndef MIMI_SECRET_TTS_API_KEY
+#define MIMI_SECRET_TTS_API_KEY           ""
+#endif
+
+#ifndef MIMI_SECRET_TTS_MODEL
+#define MIMI_SECRET_TTS_MODEL             "qwen3-tts-flash"
+#endif
+
+#ifndef MIMI_SECRET_TTS_VOICE
+#define MIMI_SECRET_TTS_VOICE             "Cherry"
+#endif
+
+#ifndef MIMI_SECRET_TTS_LANGUAGE
+#define MIMI_SECRET_TTS_LANGUAGE          "English"
+#endif
+
+#ifndef MIMI_SECRET_API_KEY
+#define MIMI_SECRET_API_KEY               ""
+#endif
+
+/* TTS text constraints (can override in mimi_secrets.h) */
+#ifndef MIMI_VOICE_TTS_MAX_CHARS
+#define MIMI_VOICE_TTS_MAX_CHARS          140
+#endif
+
+#ifndef MIMI_VOICE_I2S_PORT
+#define MIMI_VOICE_I2S_PORT               0
+#endif
+
+#ifndef MIMI_VOICE_I2S_BCLK
+#define MIMI_VOICE_I2S_BCLK               42
+#endif
+
+#ifndef MIMI_VOICE_I2S_WS
+#define MIMI_VOICE_I2S_WS                 41
+#endif
+
+#ifndef MIMI_VOICE_I2S_DIN
+#define MIMI_VOICE_I2S_DIN                40
+#endif
+
+#ifndef MIMI_VOICE_I2S_DOUT
+#define MIMI_VOICE_I2S_DOUT               39
+#endif
+
+/* XVF3800 fixed digital format in your current design:
+ * 16 kHz, stereo, 32-bit samples over I2S.
+ */
+#define VOICE_I2S_CHANNELS                 2
+#define VOICE_I2S_BYTES_PER_SAMPLE         4
+#define VOICE_I2S_BYTES_PER_STEREO_FRAME   (VOICE_I2S_CHANNELS * VOICE_I2S_BYTES_PER_SAMPLE)
+#define VOICE_PCM_BITS                     16
+
+typedef struct {
+    char *buf;
+    size_t len;
+    size_t cap;
+} http_resp_t;
+
+typedef struct {
+    uint16_t audio_format;   /* 1 = PCM */
+    uint16_t channels;
+    uint32_t sample_rate;
+    uint16_t bits_per_sample;
+} wav_fmt_t;
+
+static bool s_enabled = false;
+static bool s_i2s_ready = false;
+static volatile bool s_is_playing = false;
+
+static i2s_chan_handle_t s_tx_chan = NULL;
+static i2s_chan_handle_t s_rx_chan = NULL;
+static TaskHandle_t s_capture_task = NULL;
+static SemaphoreHandle_t s_http_lock = NULL;
+
+/* =========================
+ * Secrets / config helpers
+ * ========================= */
+
+static const char *stt_api_url(void)
+{
+    return (MIMI_SECRET_STT_URL[0] != '\0') ? MIMI_SECRET_STT_URL : "";
+}
+
+static const char *stt_api_key(void)
+{
+    return (MIMI_SECRET_STT_API_KEY[0] != '\0') ? MIMI_SECRET_STT_API_KEY :
+           (MIMI_SECRET_API_KEY[0] != '\0') ? MIMI_SECRET_API_KEY : "";
+}
+
+static const char *stt_model(void)
+{
+    return (MIMI_SECRET_STT_MODEL[0] != '\0') ? MIMI_SECRET_STT_MODEL : "qwen3-asr-flash";
+}
+
+static const char *tts_api_url(void)
+{
+    return (MIMI_SECRET_TTS_URL[0] != '\0') ? MIMI_SECRET_TTS_URL : "";
+}
+
+static const char *tts_api_key(void)
+{
+    return (MIMI_SECRET_TTS_API_KEY[0] != '\0') ? MIMI_SECRET_TTS_API_KEY :
+           (MIMI_SECRET_API_KEY[0] != '\0') ? MIMI_SECRET_API_KEY : "";
+}
+
+static const char *tts_model(void)
+{
+    return (MIMI_SECRET_TTS_MODEL[0] != '\0') ? MIMI_SECRET_TTS_MODEL : "qwen3-tts-flash";
+}
+
+static const char *tts_voice(void)
+{
+    return (MIMI_SECRET_TTS_VOICE[0] != '\0') ? MIMI_SECRET_TTS_VOICE : "Cherry";
+}
+
+static const char *tts_language(void)
+{
+    return (MIMI_SECRET_TTS_LANGUAGE[0] != '\0') ? MIMI_SECRET_TTS_LANGUAGE : "English";
+}
+
+/* =========================
+ * HTTP helpers
+ * ========================= */
+
+static esp_err_t http_event_handler(esp_http_client_event_t *evt)
+{
+    http_resp_t *resp = (http_resp_t *)evt->user_data;
+    if (evt->event_id != HTTP_EVENT_ON_DATA || !resp || !evt->data || evt->data_len <= 0) {
+        return ESP_OK;
+    }
+
+    size_t need = resp->len + (size_t)evt->data_len + 1;
+    if (need > resp->cap) {
+        size_t new_cap = resp->cap ? resp->cap * 2 : 1024;
+        while (new_cap < need) {
+            new_cap *= 2;
+        }
+        char *tmp = realloc(resp->buf, new_cap);
+        if (!tmp) {
+            return ESP_ERR_NO_MEM;
+        }
+        resp->buf = tmp;
+        resp->cap = new_cap;
+    }
+
+    memcpy(resp->buf + resp->len, evt->data, evt->data_len);
+    resp->len += (size_t)evt->data_len;
+    resp->buf[resp->len] = '\0';
+    return ESP_OK;
+}
+
+static esp_err_t http_post_json(const char *url,
+                                const char *bearer_key,
+                                const char *json_body,
+                                bool enable_sse,
+                                http_resp_t *resp,
+                                int *http_status_out)
+{
+    if (!url || !url[0] || !json_body || !resp) {
+        return ESP_ERR_INVALID_ARG;
+    }
+
+    memset(resp, 0, sizeof(*resp));
+
+    esp_http_client_config_t cfg = {
+        .url = url,
+        .method = HTTP_METHOD_POST,
+        .event_handler = http_event_handler,
+        .user_data = resp,
+        .crt_bundle_attach = esp_crt_bundle_attach,
+        .timeout_ms = 30000,
+        .buffer_size = 2048,
+        .buffer_size_tx = 2048,
+    };
+
+    esp_http_client_handle_t client = esp_http_client_init(&cfg);
+    if (!client) {
+        return ESP_FAIL;
+    }
+
+    esp_http_client_set_header(client, "Content-Type", "application/json");
+    if (bearer_key && bearer_key[0]) {
+        char auth[320];
+        snprintf(auth, sizeof(auth), "Bearer %s", bearer_key);
+        esp_http_client_set_header(client, "Authorization", auth);
+    }
+    esp_http_client_set_header(client, "X-DashScope-SSE", enable_sse ? "enable" : "disable");
+    esp_http_client_set_post_field(client, json_body, (int)strlen(json_body));
+
+    esp_err_t err = esp_http_client_perform(client);
+    if (err == ESP_OK && http_status_out) {
+        *http_status_out = esp_http_client_get_status_code(client);
+    }
+
+    esp_http_client_cleanup(client);
+    return err;
+}
+
+static esp_err_t http_get_binary(const char *url, http_resp_t *resp, int *http_status_out)
+{
+    if (!url || !url[0] || !resp) {
+        return ESP_ERR_INVALID_ARG;
+    }
+
+    memset(resp, 0, sizeof(*resp));
+
+    esp_http_client_config_t cfg = {
+        .url = url,
+        .method = HTTP_METHOD_GET,
+        .event_handler = http_event_handler,
+        .user_data = resp,
+        .crt_bundle_attach = esp_crt_bundle_attach,
+        .timeout_ms = 30000,
+        .buffer_size = 2048,
+        .buffer_size_tx = 1024,
+    };
+
+    esp_http_client_handle_t client = esp_http_client_init(&cfg);
+    if (!client) {
+        return ESP_FAIL;
+    }
+
+    esp_err_t err = esp_http_client_perform(client);
+    if (err == ESP_OK && http_status_out) {
+        *http_status_out = esp_http_client_get_status_code(client);
+    }
+
+    esp_http_client_cleanup(client);
+    return err;
+}
+
+/* =========================
+ * Audio helpers
+ * ========================= */
+
+static void *malloc_prefer_spiram(size_t bytes)
+{
+    if (bytes == 0) {
+        return NULL;
+    }
+
+    void *p = heap_caps_malloc(bytes, MALLOC_CAP_SPIRAM);
+    if (p) {
+        return p;
+    }
+    return malloc(bytes);
+}
+
+static bool utf8_is_continuation_byte(uint8_t b)
+{
+    return (b & 0xC0U) == 0x80U;
+}
+
+static bool utf8_starts_with(const char *s, size_t i, size_t len, const char *lit)
+{
+    size_t lit_len = strlen(lit);
+    if (i + lit_len > len) {
+        return false;
+    }
+    return memcmp(s + i, lit, lit_len) == 0;
+}
+
+static bool is_speech_cut_punct(const char *s, size_t i, size_t len)
+{
+    const uint8_t b = (uint8_t)s[i];
+    if (b == '\n' || b == '\r') {
+        return true;
+    }
+    if (b == '.' || b == '!' || b == '?' || b == ',' || b == ';' || b == ':') {
+        return true;
+    }
+
+    /* Common CJK punctuation in UTF-8 */
+    if (utf8_starts_with(s, i, len, "。") ||
+        utf8_starts_with(s, i, len, "！") ||
+        utf8_starts_with(s, i, len, "？") ||
+        utf8_starts_with(s, i, len, "，") ||
+        utf8_starts_with(s, i, len, "；") ||
+        utf8_starts_with(s, i, len, "：") ||
+        utf8_starts_with(s, i, len, "、")) {
+        return true;
+    }
+
+    return false;
+}
+
+static size_t utf8_truncate_for_tts(const char *text, size_t max_chars, size_t *out_char_count, bool *out_truncated)
+{
+    if (!text || max_chars == 0) {
+        if (out_char_count) *out_char_count = 0;
+        if (out_truncated) *out_truncated = false;
+        return 0;
+    }
+
+    const size_t len = strlen(text);
+    size_t char_count = 0;
+    size_t last_punct_cut = 0;
+    size_t i = 0;
+
+    while (i < len) {
+        if (char_count >= max_chars) {
+            break;
+        }
+
+        if (!utf8_is_continuation_byte((uint8_t)text[i])) {
+            char_count++;
+            if (is_speech_cut_punct(text, i, len)) {
+                /* Cut after this codepoint (best-effort) */
+                size_t j = i + 1;
+                while (j < len && utf8_is_continuation_byte((uint8_t)text[j])) {
+                    j++;
+                }
+                last_punct_cut = j;
+            }
+        }
+        i++;
+    }
+
+    if (out_char_count) {
+        *out_char_count = char_count;
+    }
+
+    if (i >= len) {
+        if (out_truncated) *out_truncated = false;
+        return len;
+    }
+
+    /* Prefer cutting at punctuation, but avoid cutting too early */
+    size_t cut = i;
+    if (last_punct_cut > 0) {
+        const size_t min_reasonable = (max_chars >= 20) ? (max_chars / 2) : 0;
+        if (last_punct_cut >= min_reasonable) {
+            cut = last_punct_cut;
+        }
+    }
+
+    while (cut > 0 && utf8_is_continuation_byte((uint8_t)text[cut])) {
+        cut--;
+    }
+
+    if (out_truncated) *out_truncated = true;
+    return cut;
+}
+
+static char *voice_build_tts_text(const char *text)
+{
+    if (!text) {
+        return NULL;
+    }
+
+    size_t char_count = 0;
+    bool truncated = false;
+    size_t cut_bytes = utf8_truncate_for_tts(text, MIMI_VOICE_TTS_MAX_CHARS, &char_count, &truncated);
+
+    if (!truncated) {
+        return NULL; /* caller can use original text */
+    }
+
+    char *out = (char *)malloc(cut_bytes + 1);
+    if (!out) {
+        return NULL;
+    }
+    memcpy(out, text, cut_bytes);
+    out[cut_bytes] = '\0';
+
+    ESP_LOGW(TAG, "TTS text truncated: max=%u chars, cut_bytes=%u", (unsigned)MIMI_VOICE_TTS_MAX_CHARS, (unsigned)cut_bytes);
+    return out;
+}
+
+static int16_t fir5_s16_at_clamped(const int16_t *src, size_t src_samples, size_t idx)
+{
+    if (!src || src_samples == 0) {
+        return 0;
+    }
+
+    size_t i0 = (idx >= 2) ? (idx - 2) : 0;
+    size_t i1 = (idx >= 1) ? (idx - 1) : 0;
+    size_t i2 = idx;
+    if (i2 >= src_samples) i2 = src_samples - 1;
+    size_t i3 = (idx + 1 < src_samples) ? (idx + 1) : (src_samples - 1);
+    size_t i4 = (idx + 2 < src_samples) ? (idx + 2) : (src_samples - 1);
+
+    int32_t acc =
+        (int32_t)src[i0] * 1 +
+        (int32_t)src[i1] * 4 +
+        (int32_t)src[i2] * 6 +
+        (int32_t)src[i3] * 4 +
+        (int32_t)src[i4] * 1;
+
+    acc = acc / 16;
+    if (acc > INT16_MAX) acc = INT16_MAX;
+    if (acc < INT16_MIN) acc = INT16_MIN;
+    return (int16_t)acc;
+}
+
+static esp_err_t i2s_tx_write_silence_ms(uint32_t ms)
+{
+    if (!s_i2s_ready || !s_tx_chan || ms == 0) {
+        return ESP_OK;
+    }
+
+    uint64_t frames_total = ((uint64_t)MIMI_VOICE_SAMPLE_RATE * (uint64_t)ms) / 1000ULL;
+    while (frames_total > 0) {
+        const size_t frames_chunk = (frames_total > 256) ? 256 : (size_t)frames_total;
+        int32_t zeros[256 * 2] = {0};
+
+        const uint8_t *p = (const uint8_t *)zeros;
+        size_t bytes_total = frames_chunk * 2 * sizeof(int32_t);
+        size_t bytes_sent = 0;
+
+        while (bytes_sent < bytes_total) {
+            size_t written = 0;
+            esp_err_t err = i2s_channel_write(s_tx_chan,
+                                              p + bytes_sent,
+                                              bytes_total - bytes_sent,
+                                              &written,
+                                              pdMS_TO_TICKS(1000));
+            if (err != ESP_OK) {
+                return err;
+            }
+            if (written == 0) {
+                return ESP_FAIL;
+            }
+            bytes_sent += written;
+        }
+
+        frames_total -= frames_chunk;
+    }
+
+    return ESP_OK;
+}
+
+static esp_err_t i2s_tx_overwrite_dma_with_zeros(void)
+{
+    if (!s_i2s_ready || !s_tx_chan) {
+        return ESP_ERR_INVALID_STATE;
+    }
+
+    uint32_t remaining = MIMI_VOICE_TX_DMA_TOTAL_BYTES;
+    while (remaining > 0) {
+        int32_t zeros[256 * 2] = {0};
+        size_t chunk = sizeof(zeros);
+        if (chunk > remaining) {
+            chunk = remaining;
+        }
+
+        const uint8_t *p = (const uint8_t *)zeros;
+        size_t sent = 0;
+        while (sent < chunk) {
+            size_t written = 0;
+            esp_err_t err = i2s_channel_write(s_tx_chan,
+                                              p + sent,
+                                              chunk - sent,
+                                              &written,
+                                              pdMS_TO_TICKS(1000));
+            if (err != ESP_OK) {
+                return err;
+            }
+            if (written == 0) {
+                return ESP_FAIL;
+            }
+            sent += written;
+        }
+
+        remaining -= (uint32_t)chunk;
+    }
+
+    return ESP_OK;
+}
+
+static void pcm_s32_stereo_to_s16_mono(const uint8_t *src, size_t src_len, int16_t *dst, size_t *out_samples)
+{
+    size_t frames = src_len / VOICE_I2S_BYTES_PER_STEREO_FRAME;
+    const int32_t *p = (const int32_t *)src;
+
+    for (size_t i = 0; i < frames; i++) {
+        int32_t l = p[i * 2 + 0];
+        int32_t r = p[i * 2 + 1];
+
+        /* XVF3800 / many I2S MEMS frontends deliver valid audio in high 16 bits of s32 slot. */
+        int16_t ls = (int16_t)(l >> 16);
+        int16_t rs = (int16_t)(r >> 16);
+        int32_t mono = ((int32_t)ls + (int32_t)rs) / 2;
+
+        if (mono > INT16_MAX) mono = INT16_MAX;
+        if (mono < INT16_MIN) mono = INT16_MIN;
+        dst[i] = (int16_t)mono;
+    }
+
+    if (out_samples) {
+        *out_samples = frames;
+    }
+}
+
+static uint32_t pcm_energy_absavg(const int16_t *pcm, size_t samples)
+{
+    if (!pcm || samples == 0) return 0;
+
+    uint64_t sum = 0;
+    for (size_t i = 0; i < samples; i++) {
+        int32_t v = pcm[i];
+        if (v < 0) v = -v;
+        sum += (uint32_t)v;
+    }
+    return (uint32_t)(sum / samples);
+}
+
+static size_t wav_build_from_pcm16(const int16_t *pcm,
+                                   size_t pcm_bytes,
+                                   uint32_t sample_rate,
+                                   uint16_t channels,
+                                   uint8_t **out_buf)
+{
+    if (!pcm || !out_buf || pcm_bytes == 0) {
+        return 0;
+    }
+
+    const size_t wav_size = 44 + pcm_bytes;
+    uint8_t *buf = (uint8_t *)malloc(wav_size);
+    if (!buf) {
+        return 0;
+    }
+
+    const uint32_t byte_rate = sample_rate * channels * 2;
+    const uint16_t block_align = channels * 2;
+    const uint32_t riff_size = (uint32_t)(wav_size - 8);
+    const uint32_t data_size = (uint32_t)pcm_bytes;
+
+    memcpy(buf + 0, "RIFF", 4);
+    memcpy(buf + 4, &riff_size, 4);
+    memcpy(buf + 8, "WAVE", 4);
+
+    memcpy(buf + 12, "fmt ", 4);
+    uint32_t fmt_size = 16;
+    uint16_t audio_format = 1;
+    uint16_t bits_per_sample = 16;
+    memcpy(buf + 16, &fmt_size, 4);
+    memcpy(buf + 20, &audio_format, 2);
+    memcpy(buf + 22, &channels, 2);
+    memcpy(buf + 24, &sample_rate, 4);
+    memcpy(buf + 28, &byte_rate, 4);
+    memcpy(buf + 32, &block_align, 2);
+    memcpy(buf + 34, &bits_per_sample, 2);
+
+    memcpy(buf + 36, "data", 4);
+    memcpy(buf + 40, &data_size, 4);
+    memcpy(buf + 44, pcm, pcm_bytes);
+
+    *out_buf = buf;
+    return wav_size;
+}
+
+static esp_err_t wav_find_data_chunk(const uint8_t *wav,
+                                     size_t wav_len,
+                                     wav_fmt_t *fmt,
+                                     const uint8_t **data_out,
+                                     size_t *data_len_out)
+{
+    if (!wav || wav_len < 44 || !fmt || !data_out || !data_len_out) {
+        return ESP_ERR_INVALID_ARG;
+    }
+
+    if (memcmp(wav, "RIFF", 4) != 0 || memcmp(wav + 8, "WAVE", 4) != 0) {
+        return ESP_ERR_INVALID_RESPONSE;
+    }
+
+    memset(fmt, 0, sizeof(*fmt));
+    *data_out = NULL;
+    *data_len_out = 0;
+
+    size_t pos = 12;
+    bool got_fmt = false;
+
+    while (pos + 8 <= wav_len) {
+        const uint8_t *chunk = wav + pos;
+        uint32_t chunk_size = 0;
+        memcpy(&chunk_size, chunk + 4, 4);
+
+        size_t chunk_data_pos = pos + 8;
+        if (chunk_data_pos > wav_len) {
+            break;
+        }
+
+        size_t available = wav_len - chunk_data_pos;
+        size_t declared = (size_t)chunk_size;
+        size_t actual = declared <= available ? declared : available;
+
+        char id[5] = {0};
+        memcpy(id, chunk, 4);
+        ESP_LOGI(TAG, "WAV chunk id=%s declared=%u actual=%u pos=%u",
+                 id, (unsigned)chunk_size, (unsigned)actual, (unsigned)pos);
+
+        if (memcmp(chunk, "fmt ", 4) == 0) {
+            if (actual < 16) {
+                ESP_LOGW(TAG, "WAV fmt chunk too short: %u", (unsigned)actual);
+            } else {
+                memcpy(&fmt->audio_format, wav + chunk_data_pos + 0, 2);
+                memcpy(&fmt->channels, wav + chunk_data_pos + 2, 2);
+                memcpy(&fmt->sample_rate, wav + chunk_data_pos + 4, 4);
+                memcpy(&fmt->bits_per_sample, wav + chunk_data_pos + 14, 2);
+                got_fmt = true;
+            }
+        } else if (memcmp(chunk, "data", 4) == 0) {
+            *data_out = wav + chunk_data_pos;
+            *data_len_out = actual;
+            break;
+        }
+
+        size_t step = 8 + actual;
+        if (declared <= available) {
+            step += (declared & 1U);
+        }
+
+        if (step == 0 || pos + step <= pos) {
+            break;
+        }
+        pos += step;
+    }
+
+    if (!*data_out || *data_len_out == 0) {
+        return ESP_ERR_NOT_FOUND;
+    }
+
+    if (!got_fmt) {
+        ESP_LOGW(TAG, "WAV fmt chunk not found, assume PCM16 mono/stereo fallback");
+        fmt->audio_format = 1;
+        fmt->channels = 1;
+        fmt->sample_rate = MIMI_VOICE_SAMPLE_RATE;
+        fmt->bits_per_sample = 16;
+    }
+
+    if (fmt->audio_format != 1) {
+        ESP_LOGE(TAG, "Unsupported WAV audio_format=%u", (unsigned)fmt->audio_format);
+        return ESP_ERR_NOT_SUPPORTED;
+    }
+
+    if (fmt->bits_per_sample != 16) {
+        ESP_LOGE(TAG, "Unsupported WAV bits_per_sample=%u", (unsigned)fmt->bits_per_sample);
+        return ESP_ERR_NOT_SUPPORTED;
+    }
+
+    return ESP_OK;
+}
+
+/* =========================
+ * STT / TTS JSON helpers
+ * ========================= */
+
+static char *build_data_url_from_wav(const uint8_t *wav, size_t wav_len)
+{
+    if (!wav || wav_len == 0) {
+        return NULL;
+    }
+
+    size_t b64_len = 0;
+    int rc = mbedtls_base64_encode(NULL, 0, &b64_len, wav, wav_len);
+    if (rc != MBEDTLS_ERR_BASE64_BUFFER_TOO_SMALL && rc != 0) {
+        return NULL;
+    }
+
+    const char *prefix = "data:audio/wav;base64,";
+    size_t prefix_len = strlen(prefix);
+    char *out = (char *)malloc(prefix_len + b64_len + 1);
+    if (!out) {
+        return NULL;
+    }
+
+    memcpy(out, prefix, prefix_len);
+
+    size_t actual = 0;
+    rc = mbedtls_base64_encode((unsigned char *)(out + prefix_len),
+                               b64_len,
+                               &actual,
+                               wav,
+                               wav_len);
+    if (rc != 0) {
+        free(out);
+        return NULL;
+    }
+
+    out[prefix_len + actual] = '\0';
+    return out;
+}
+
+static esp_err_t parse_stt_response_text(const char *json, char *out_text, size_t out_size)
+{
+    if (!json || !out_text || out_size == 0) {
+        return ESP_ERR_INVALID_ARG;
+    }
+
+    out_text[0] = '\0';
+
+    cJSON *root = cJSON_Parse(json);
+    if (!root) {
+        return ESP_ERR_INVALID_RESPONSE;
+    }
+
+    cJSON *choices = cJSON_GetObjectItem(root, "choices");
+    cJSON *choice0 = (choices && cJSON_IsArray(choices)) ? cJSON_GetArrayItem(choices, 0) : NULL;
+    cJSON *message = choice0 ? cJSON_GetObjectItem(choice0, "message") : NULL;
+    cJSON *content = message ? cJSON_GetObjectItem(message, "content") : NULL;
+
+    if (cJSON_IsString(content) && content->valuestring) {
+        strlcpy(out_text, content->valuestring, out_size);
+        cJSON_Delete(root);
+        return ESP_OK;
+    }
+
+    cJSON_Delete(root);
+    return ESP_ERR_NOT_FOUND;
+}
+
+static esp_err_t parse_tts_audio_url(const char *json, char *out_url, size_t out_size)
+{
+    if (!json || !out_url || out_size == 0) {
+        return ESP_ERR_INVALID_ARG;
+    }
+
+    out_url[0] = '\0';
+
+    cJSON *root = cJSON_Parse(json);
+    if (!root) {
+        return ESP_ERR_INVALID_RESPONSE;
+    }
+
+    cJSON *output = cJSON_GetObjectItem(root, "output");
+    cJSON *audio = output ? cJSON_GetObjectItem(output, "audio") : NULL;
+    cJSON *url = audio ? cJSON_GetObjectItem(audio, "url") : NULL;
+
+    if (cJSON_IsString(url) && url->valuestring && url->valuestring[0]) {
+        strlcpy(out_url, url->valuestring, out_size);
+        cJSON_Delete(root);
+        return ESP_OK;
+    }
+
+    cJSON_Delete(root);
+    return ESP_ERR_NOT_FOUND;
+}
+
+/* =========================
+ * Bus integration
+ * ========================= */
+
+static void push_voice_inbound(const char *text)
+{
+    if (!text || !text[0]) {
+        return;
+    }
+
+    mimi_msg_t msg = {0};
+    strlcpy(msg.channel, MIMI_CHAN_VOICE, sizeof(msg.channel));
+    strlcpy(msg.chat_id, MIMI_VOICE_CHAT_ID, sizeof(msg.chat_id));
+    msg.content = strdup(text);
+
+    if (!msg.content) {
+        ESP_LOGE(TAG, "No memory for voice inbound text");
+        return;
+    }
+
+    if (message_bus_push_inbound(&msg) != ESP_OK) {
+        ESP_LOGW(TAG, "Inbound queue full, drop voice transcript");
+        free(msg.content);
+    }
+}
+
+/* =========================
+ * I2S init / playback
+ * ========================= */
+
+static esp_err_t i2s_init_xvf3800(void)
+{
+    esp_err_t err;
+
+    i2s_chan_config_t chan_cfg = {
+        .id = (i2s_port_t)MIMI_VOICE_I2S_PORT,
+        .role = I2S_ROLE_MASTER,
+        .dma_desc_num = MIMI_VOICE_I2S_DMA_DESC_NUM,
+        .dma_frame_num = MIMI_VOICE_I2S_DMA_FRAME_NUM,
+        .auto_clear_after_cb = false,
+        .auto_clear_before_cb = false,
+        .allow_pd = false,
+        .intr_priority = 0,
+    };
+
+    err = i2s_new_channel(&chan_cfg, &s_tx_chan, &s_rx_chan);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "i2s_new_channel failed: %s", esp_err_to_name(err));
+        return err;
+    }
+
+    i2s_std_config_t rx_cfg = {
+        .clk_cfg  = I2S_STD_CLK_DEFAULT_CONFIG(MIMI_VOICE_SAMPLE_RATE),
+        .slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_STEREO),
+        .gpio_cfg = {
+            .mclk = I2S_GPIO_UNUSED,
+            .bclk = MIMI_VOICE_I2S_BCLK,
+            .ws   = MIMI_VOICE_I2S_WS,
+            .dout = I2S_GPIO_UNUSED,
+            .din  = MIMI_VOICE_I2S_DIN,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv   = false,
+            },
+        },
+    };
+
+    i2s_std_config_t tx_cfg = {
+        .clk_cfg  = I2S_STD_CLK_DEFAULT_CONFIG(MIMI_VOICE_SAMPLE_RATE),
+        .slot_cfg = I2S_STD_MSB_SLOT_DEFAULT_CONFIG(I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_STEREO),
+        .gpio_cfg = {
+            .mclk = I2S_GPIO_UNUSED,
+            .bclk = MIMI_VOICE_I2S_BCLK,
+            .ws   = MIMI_VOICE_I2S_WS,
+            .dout = MIMI_VOICE_I2S_DOUT,
+            .din  = I2S_GPIO_UNUSED,
+            .invert_flags = {
+                .mclk_inv = false,
+                .bclk_inv = false,
+                .ws_inv   = false,
+            },
+        },
+    };
+
+    err = i2s_channel_init_std_mode(s_rx_chan, &rx_cfg);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "i2s rx init failed: %s", esp_err_to_name(err));
+        return err;
+    }
+
+    err = i2s_channel_init_std_mode(s_tx_chan, &tx_cfg);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "i2s tx init failed: %s", esp_err_to_name(err));
+        return err;
+    }
+
+    /* Seed TX DMA with silence, otherwise some DAC/amps output "咚咚/沙沙" due to
+     * undefined initial DMA content or repeating last buffer when idle.
+     *
+     * Only allowed before enabling the channel.
+     */
+    {
+        int32_t zeros[128 * 2] = {0};
+        size_t loaded = 0;
+        (void)i2s_channel_preload_data(s_tx_chan, zeros, sizeof(zeros), &loaded);
+    }
+
+    err = i2s_channel_enable(s_rx_chan);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "i2s rx enable failed: %s", esp_err_to_name(err));
+        return err;
+    }
+
+    err = i2s_channel_enable(s_tx_chan);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "i2s tx enable failed: %s", esp_err_to_name(err));
+        return err;
+    }
+
+    s_i2s_ready = true;
+    ESP_LOGI(TAG, "I2S ready: %dHz stereo s32 in / stereo s32 out (%s timing)",
+             MIMI_VOICE_SAMPLE_RATE,
+             i2s_slot_style_str());
+    return ESP_OK;
+}
+static int16_t *resample_s16_mono_linear(const int16_t *src,
+                                         size_t src_samples,
+                                         uint32_t src_rate,
+                                         uint32_t dst_rate,
+                                         size_t *out_samples)
+{
+    if (!src || src_samples == 0 || !out_samples || src_rate == 0 || dst_rate == 0) {
+        return NULL;
+    }
+
+    if (src_rate == dst_rate) {
+        int16_t *copy = (int16_t *)malloc_prefer_spiram(src_samples * sizeof(int16_t));
+        if (!copy) {
+            return NULL;
+        }
+        memcpy(copy, src, src_samples * sizeof(int16_t));
+        *out_samples = src_samples;
+        return copy;
+    }
+
+    const bool is_downsampling = src_rate > dst_rate;
+
+    size_t dst_samples = (size_t)(((uint64_t)src_samples * dst_rate) / src_rate);
+    if (dst_samples == 0) {
+        return NULL;
+    }
+
+    int16_t *dst = (int16_t *)malloc_prefer_spiram(dst_samples * sizeof(int16_t));
+    if (!dst) {
+        return NULL;
+    }
+
+    /* When downsampling (e.g. 24k -> 16k), naive linear interpolation tends to fold
+     * high-frequency content above the new Nyquist into the audible band (aliasing),
+     * often perceived as "沙沙" on sibilants/background.
+     *
+     * Apply a tiny 5-tap low-pass FIR [1,4,6,4,1]/16 on the source indices we touch.
+     * This is cheap and improves subjective quality significantly without pulling in DSP deps.
+     */
+    for (size_t i = 0; i < dst_samples; i++) {
+        float src_pos = ((float)i * (float)src_rate) / (float)dst_rate;
+        size_t idx = (size_t)src_pos;
+        float frac = src_pos - (float)idx;
+
+        if (idx >= src_samples - 1) {
+            dst[i] = src[src_samples - 1];
+        } else {
+            float a = (float)(is_downsampling ? fir5_s16_at_clamped(src, src_samples, idx) : src[idx]);
+            float b = (float)(is_downsampling ? fir5_s16_at_clamped(src, src_samples, idx + 1) : src[idx + 1]);
+            float v = a + (b - a) * frac;
+
+            if (v > 32767.0f) v = 32767.0f;
+            if (v < -32768.0f) v = -32768.0f;
+            dst[i] = (int16_t)v;
+        }
+    }
+
+    *out_samples = dst_samples;
+    return dst;
+}
+static esp_err_t i2s_play_wav_pcm16(const uint8_t *wav, size_t wav_len)
+{
+    if (!s_i2s_ready || !s_tx_chan || !wav || wav_len == 0) {
+        return ESP_ERR_INVALID_STATE;
+    }
+
+    wav_fmt_t fmt;
+    const uint8_t *pcm = NULL;
+    size_t pcm_len = 0;
+
+    esp_err_t err = wav_find_data_chunk(wav, wav_len, &fmt, &pcm, &pcm_len);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "wav_find_data_chunk failed: %s", esp_err_to_name(err));
+        return err;
+    }
+
+    ESP_LOGI(TAG, "WAV fmt: format=%u channels=%u sample_rate=%u bits=%u data_len=%u",
+             (unsigned)fmt.audio_format,
+             (unsigned)fmt.channels,
+             (unsigned)fmt.sample_rate,
+             (unsigned)fmt.bits_per_sample,
+             (unsigned)pcm_len);
+
+    if (fmt.audio_format != 1 || fmt.bits_per_sample != 16) {
+        return ESP_ERR_NOT_SUPPORTED;
+    }
+
+    const int16_t *src16 = (const int16_t *)pcm;
+    size_t src_samples_total = pcm_len / sizeof(int16_t);
+
+    const int16_t *mono_src = NULL;
+    int16_t *mono_owned = NULL;
+    size_t mono_samples = 0;
+
+    if (fmt.channels == 1) {
+        mono_src = src16;
+        mono_samples = src_samples_total;
+    } else if (fmt.channels == 2) {
+        mono_samples = src_samples_total / 2;
+        mono_owned = (int16_t *)malloc_prefer_spiram(mono_samples * sizeof(int16_t));
+        if (!mono_owned) {
+            return ESP_ERR_NO_MEM;
+        }
+
+        for (size_t i = 0, j = 0; j < mono_samples; i += 2, j++) {
+            int32_t v = ((int32_t)src16[i] + (int32_t)src16[i + 1]) / 2;
+            mono_owned[j] = (int16_t)v;
+        }
+        mono_src = mono_owned;
+    } else {
+        return ESP_ERR_NOT_SUPPORTED;
+    }
+
+    const int16_t *play_src = NULL;
+    int16_t *play_owned = NULL;
+    size_t play_samples = 0;
+
+    if (fmt.sample_rate == MIMI_VOICE_SAMPLE_RATE) {
+        play_src = mono_src;
+        play_samples = mono_samples;
+    } else {
+        play_owned = resample_s16_mono_linear(
+            mono_src,
+            mono_samples,
+            fmt.sample_rate,
+            MIMI_VOICE_SAMPLE_RATE,
+            &play_samples
+        );
+        free(mono_owned);
+        mono_owned = NULL;
+
+        if (!play_owned || play_samples == 0) {
+            return ESP_ERR_NO_MEM;
+        }
+        play_src = play_owned;
+    }
+
+    ESP_LOGI(TAG, "Playback PCM: %u samples @ %u Hz (~%u ms)",
+             (unsigned)play_samples,
+             (unsigned)MIMI_VOICE_SAMPLE_RATE,
+             (unsigned)((play_samples * 1000ULL) / MIMI_VOICE_SAMPLE_RATE));
+
+    s_is_playing = true;
+
+    size_t frames_total = play_samples;
+    size_t frames_sent = 0;
+
+    while (frames_sent < frames_total) {
+        const size_t frames_chunk = (frames_total - frames_sent > 256) ? 256 : (frames_total - frames_sent);
+
+        int32_t tx_buf[256 * 2];
+        for (size_t i = 0; i < frames_chunk; i++) {
+            int16_t s16 = play_src[frames_sent + i];
+            int32_t s32 = ((int32_t)s16) << 16;
+            tx_buf[i * 2 + 0] = s32;
+            tx_buf[i * 2 + 1] = s32;
+        }
+
+        const uint8_t *p = (const uint8_t *)tx_buf;
+        size_t bytes_total = frames_chunk * 2 * sizeof(int32_t);
+        size_t bytes_sent = 0;
+
+        while (bytes_sent < bytes_total) {
+            size_t written = 0;
+            err = i2s_channel_write(s_tx_chan,
+                                    p + bytes_sent,
+                                    bytes_total - bytes_sent,
+                                    &written,
+                                    pdMS_TO_TICKS(1000));
+            if (err != ESP_OK) {
+                ESP_LOGE(TAG, "i2s write failed: %s", esp_err_to_name(err));
+                free(play_owned);
+                free(mono_owned);
+                s_is_playing = false;
+                return err;
+            }
+            if (written == 0) {
+                ESP_LOGE(TAG, "i2s write returned 0 bytes");
+                free(play_owned);
+                free(mono_owned);
+                s_is_playing = false;
+                return ESP_FAIL;
+            }
+            bytes_sent += written;
+        }
+
+        frames_sent += frames_chunk;
+    }
+
+    /* Leave a short silence tail so the TX engine doesn't keep repeating the last
+     * non-zero DMA buffer (often perceived as continuous "咚咚" when idle).
+     */
+    (void)i2s_tx_write_silence_ms(MIMI_VOICE_TX_SILENCE_TAIL_MS);
+    (void)i2s_tx_overwrite_dma_with_zeros();
+
+    free(play_owned);
+    free(mono_owned);
+    s_is_playing = false;
+    return ESP_OK;
+}
+
+/* =========================
+ * STT / TTS core
+ * ========================= */
+
+static esp_err_t stt_transcribe_pcm(const int16_t *pcm,
+                                    size_t pcm_bytes,
+                                    char *out_text,
+                                    size_t out_text_size)
+{
+    if (!pcm || pcm_bytes == 0 || !out_text || out_text_size == 0) {
+        return ESP_ERR_INVALID_ARG;
+    }
+    if (!stt_api_url()[0] || !stt_api_key()[0]) {
+        return ESP_ERR_INVALID_STATE;
+    }
+
+    out_text[0] = '\0';
+
+    uint8_t *wav = NULL;
+    size_t wav_len = wav_build_from_pcm16(pcm, pcm_bytes, MIMI_VOICE_SAMPLE_RATE, 1, &wav);
+    if (!wav || wav_len == 0) {
+        return ESP_ERR_NO_MEM;
+    }
+
+    char *data_url = build_data_url_from_wav(wav, wav_len);
+    free(wav);
+    if (!data_url) {
+        return ESP_ERR_NO_MEM;
+    }
+
+    cJSON *root = cJSON_CreateObject();
+    cJSON_AddStringToObject(root, "model", stt_model());
+    cJSON_AddBoolToObject(root, "stream", false);
+
+    cJSON *messages = cJSON_CreateArray();
+    cJSON *msg = cJSON_CreateObject();
+    cJSON_AddStringToObject(msg, "role", "user");
+
+    cJSON *content = cJSON_CreateArray();
+    cJSON *audio_item = cJSON_CreateObject();
+    cJSON_AddStringToObject(audio_item, "type", "input_audio");
+
+    cJSON *input_audio = cJSON_CreateObject();
+    cJSON_AddStringToObject(input_audio, "data", data_url);
+    cJSON_AddItemToObject(audio_item, "input_audio", input_audio);
+    cJSON_AddItemToArray(content, audio_item);
+
+    cJSON_AddItemToObject(msg, "content", content);
+    cJSON_AddItemToArray(messages, msg);
+    cJSON_AddItemToObject(root, "messages", messages);
+
+    cJSON *asr_options = cJSON_CreateObject();
+    cJSON_AddBoolToObject(asr_options, "enable_itn", false);
+    cJSON_AddItemToObject(root, "asr_options", asr_options);
+
+    char *body = cJSON_PrintUnformatted(root);
+    cJSON_Delete(root);
+    free(data_url);
+
+    if (!body) {
+        return ESP_ERR_NO_MEM;
+    }
+
+    http_resp_t resp = {0};
+    int http_status = 0;
+    esp_err_t err = http_post_json(stt_api_url(), stt_api_key(), body, false, &resp, &http_status);
+    free(body);
+
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "STT HTTP failed: %s", esp_err_to_name(err));
+        free(resp.buf);
+        return err;
+    }
+    if (http_status < 200 || http_status >= 300) {
+        ESP_LOGE(TAG, "STT HTTP status=%d body=%s", http_status, resp.buf ? resp.buf : "");
+        free(resp.buf);
+        return ESP_FAIL;
+    }
+
+    err = parse_stt_response_text(resp.buf ? resp.buf : "", out_text, out_text_size);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "STT parse failed, body=%s", resp.buf ? resp.buf : "");
+    } else {
+        ESP_LOGI(TAG, "STT transcript: %s", out_text);
+    }
+
+    free(resp.buf);
+    return err;
+}
+
+static esp_err_t tts_stream_play(const char *text)
+{
+    if (!text || !text[0]) {
+        return ESP_ERR_INVALID_ARG;
+    }
+    if (!tts_api_url()[0] || !tts_api_key()[0]) {
+        return ESP_ERR_INVALID_STATE;
+    }
+
+    cJSON *body = cJSON_CreateObject();
+    cJSON_AddStringToObject(body, "model", tts_model());
+
+    cJSON *input = cJSON_CreateObject();
+    cJSON_AddStringToObject(input, "text", text);
+    cJSON_AddStringToObject(input, "voice", tts_voice());
+    cJSON_AddStringToObject(input, "language_type", tts_language());
+    cJSON_AddItemToObject(body, "input", input);
+
+    char *json = cJSON_PrintUnformatted(body);
+    cJSON_Delete(body);
+
+    if (!json) {
+        return ESP_ERR_NO_MEM;
+    }
+
+    http_resp_t resp = {0};
+    int http_status = 0;
+    esp_err_t err = http_post_json(tts_api_url(), tts_api_key(), json, false, &resp, &http_status);
+    free(json);
+
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "TTS HTTP failed: %s", esp_err_to_name(err));
+        free(resp.buf);
+        return err;
+    }
+    if (http_status < 200 || http_status >= 300) {
+        ESP_LOGE(TAG, "TTS HTTP status=%d body=%s", http_status, resp.buf ? resp.buf : "");
+        free(resp.buf);
+        return ESP_FAIL;
+    }
+
+    char wav_url[1024] = {0};
+    err = parse_tts_audio_url(resp.buf ? resp.buf : "", wav_url, sizeof(wav_url));
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "TTS parse failed, body=%s", resp.buf ? resp.buf : "");
+        free(resp.buf);
+        return err;
+    }
+    free(resp.buf);
+
+    ESP_LOGI(TAG, "TTS audio url: %s", wav_url);
+
+    http_resp_t wav_resp = {0};
+    http_status = 0;
+    err = http_get_binary(wav_url, &wav_resp, &http_status);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "TTS wav download failed: %s", esp_err_to_name(err));
+        free(wav_resp.buf);
+        return err;
+    }
+    if (http_status < 200 || http_status >= 300) {
+        ESP_LOGE(TAG, "TTS wav status=%d", http_status);
+        free(wav_resp.buf);
+        return ESP_FAIL;
+    }
+    ESP_LOGI(TAG, "TTS wav http_status=%d len=%d", http_status, (int)wav_resp.len);
+
+    if (wav_resp.len >= 12) {
+        ESP_LOGI(TAG, "TTS wav magic: %.4s / %.4s",
+                wav_resp.buf,
+                wav_resp.buf + 8);
+    }
+
+    if (wav_resp.len >= 4 && memcmp(wav_resp.buf, "RIFF", 4) != 0) {
+        ESP_LOGE(TAG, "TTS response is not WAV, preview: %.120s", wav_resp.buf);
+    }
+
+    err = i2s_play_wav_pcm16((const uint8_t *)wav_resp.buf, wav_resp.len);
+    free(wav_resp.buf);
+    return err;
+}
+
+/* =========================
+ * Voice capture loop
+ * ========================= */
+
+static void voice_capture_task(void *arg)
+{
+    (void)arg;
+
+    const size_t frame_samples = (MIMI_VOICE_SAMPLE_RATE * MIMI_VOICE_FRAME_MS) / 1000;
+    const size_t stereo_frame_bytes = frame_samples * VOICE_I2S_BYTES_PER_STEREO_FRAME;
+    const size_t mono16_frame_bytes = frame_samples * sizeof(int16_t);
+    const size_t max_frames = MIMI_VOICE_MAX_UTTERANCE_MS / MIMI_VOICE_FRAME_MS;
+    const size_t silence_frames_end = MIMI_VOICE_SILENCE_END_MS / MIMI_VOICE_FRAME_MS;
+
+    uint8_t *rx_buf = (uint8_t *)heap_caps_malloc(stereo_frame_bytes, MALLOC_CAP_SPIRAM);
+    int16_t *mono_frame = (int16_t *)heap_caps_malloc(mono16_frame_bytes, MALLOC_CAP_SPIRAM);
+    int16_t *utterance = (int16_t *)heap_caps_malloc(max_frames * frame_samples * sizeof(int16_t), MALLOC_CAP_SPIRAM);
+
+    if (!rx_buf || !mono_frame || !utterance) {
+        ESP_LOGE(TAG, "voice_capture_task alloc failed");
+        free(rx_buf);
+        free(mono_frame);
+        free(utterance);
+        vTaskDelete(NULL);
+        return;
+    }
+
+    bool in_speech = false;
+    size_t total_frames = 0;
+    size_t silence_frames = 0;
+    size_t start_frames = 0;
+    TickType_t cooldown_until = 0;
+
+    /* Simple adaptive noise floor */
+    uint32_t noise_floor = MIMI_VOICE_VAD_THRESHOLD / 2;
+    if (noise_floor < 100) noise_floor = 100;
+
+    while (1) {
+        if (!s_i2s_ready || !s_rx_chan) {
+            vTaskDelay(pdMS_TO_TICKS(100));
+            continue;
+        }
+
+        if (s_is_playing) {
+            /* Avoid self-trigger during playback */
+            vTaskDelay(pdMS_TO_TICKS(MIMI_VOICE_FRAME_MS));
+            continue;
+        }
+
+        TickType_t now = xTaskGetTickCount();
+        if (cooldown_until != 0 && now < cooldown_until) {
+            vTaskDelay(cooldown_until - now);
+            continue;
+        }
+
+        size_t bytes_read = 0;
+        esp_err_t err = i2s_channel_read(s_rx_chan,
+                                         rx_buf,
+                                         stereo_frame_bytes,
+                                         &bytes_read,
+                                         pdMS_TO_TICKS(1000));
+        if (err != ESP_OK || bytes_read == 0) {
+            continue;
+        }
+
+        size_t mono_samples = 0;
+        pcm_s32_stereo_to_s16_mono(rx_buf, bytes_read, mono_frame, &mono_samples);
+        if (mono_samples == 0) {
+            continue;
+        }
+
+        uint32_t energy = pcm_energy_absavg(mono_frame, mono_samples);
+
+        /* Update noise floor only when not in speech */
+        if (!in_speech) {
+            noise_floor = (noise_floor * 15 + energy) / 16;
+        }
+
+        uint32_t dynamic_threshold = noise_floor + MIMI_VOICE_VAD_THRESHOLD;
+        bool speech_now = (energy > dynamic_threshold);
+
+        if (!in_speech) {
+            if (!speech_now) {
+                start_frames = 0;
+                continue;
+            }
+            start_frames++;
+            if (start_frames < MIMI_VOICE_VAD_START_FRAMES) {
+                continue;
+            }
+            in_speech = true;
+            total_frames = 0;
+            silence_frames = 0;
+            start_frames = 0;
+        }
+
+        if (total_frames < max_frames) {
+            memcpy(&utterance[total_frames * frame_samples], mono_frame, mono16_frame_bytes);
+            total_frames++;
+        }
+
+        if (speech_now) {
+            silence_frames = 0;
+        } else {
+            silence_frames++;
+        }
+
+        bool end_by_silence = (silence_frames >= silence_frames_end);
+        bool end_by_limit = (total_frames >= max_frames);
+
+        if (!end_by_silence && !end_by_limit) {
+            continue;
+        }
+
+        in_speech = false;
+
+        /* Ignore ultra-short bursts */
+        if (total_frames < MIMI_VOICE_VAD_MIN_FRAMES) {
+            total_frames = 0;
+            silence_frames = 0;
+            cooldown_until = xTaskGetTickCount() + pdMS_TO_TICKS(MIMI_VOICE_STT_COOLDOWN_MS);
+            continue;
+        }
+
+        size_t pcm_bytes = total_frames * frame_samples * sizeof(int16_t);
+        char text[512] = {0};
+
+        if (xSemaphoreTake(s_http_lock, pdMS_TO_TICKS(30000)) == pdTRUE) {
+            esp_err_t stt_err = stt_transcribe_pcm(utterance, pcm_bytes, text, sizeof(text));
+            xSemaphoreGive(s_http_lock);
+
+            if (stt_err == ESP_OK && text[0]) {
+                ESP_LOGI(TAG, "Voice STT: %s", text);
+                push_voice_inbound(text);
+            } else {
+                ESP_LOGW(TAG, "STT failed or empty transcript");
+            }
+        }
+
+        total_frames = 0;
+        silence_frames = 0;
+        cooldown_until = xTaskGetTickCount() + pdMS_TO_TICKS(MIMI_VOICE_STT_COOLDOWN_MS);
+    }
+}
+
+/* =========================
+ * Public API
+ * ========================= */
+
+esp_err_t voice_channel_init(void)
+{
+    s_enabled = (MIMI_VOICE_ENABLED_DEFAULT != 0) ||
+                (stt_api_key()[0] && tts_api_key()[0]);
+
+    if (!s_enabled) {
+        ESP_LOGI(TAG, "Voice channel disabled (set STT/TTS API key or enable default)");
+        return ESP_OK;
+    }
+
+    esp_err_t err = i2s_init_xvf3800();
+    if (err != ESP_OK) {
+        s_enabled = false;
+        return ESP_OK;
+    }
+
+    s_http_lock = xSemaphoreCreateMutex();
+    if (!s_http_lock) {
+        ESP_LOGE(TAG, "Voice init failed: cannot allocate mutex");
+        s_enabled = false;
+        return ESP_ERR_NO_MEM;
+    }
+
+    return ESP_OK;
+}
+
+esp_err_t voice_channel_start(void)
+{
+    if (!s_enabled || !s_i2s_ready) {
+        return ESP_OK;
+    }
+
+    if (!s_capture_task) {
+        if (xTaskCreatePinnedToCore(voice_capture_task,
+                                    "voice_cap",
+                                    MIMI_VOICE_CAPTURE_STACK,
+                                    NULL,
+                                    MIMI_VOICE_TASK_PRIO,
+                                    &s_capture_task,
+                                    MIMI_VOICE_CORE) != pdPASS) {
+            return ESP_FAIL;
+        }
+    }
+
+    ESP_LOGI(TAG, "Voice channel started");
+    return ESP_OK;
+}
+
+esp_err_t voice_channel_speak_text(const char *text)
+{
+    if (!s_enabled || !s_i2s_ready || !text || text[0] == '\0') {
+        return ESP_ERR_INVALID_STATE;
+    }
+
+    if (xSemaphoreTake(s_http_lock, pdMS_TO_TICKS(30000)) != pdTRUE) {
+        return ESP_ERR_TIMEOUT;
+    }
+
+    char *tts_text = voice_build_tts_text(text);
+    esp_err_t err = tts_stream_play(tts_text ? tts_text : text);
+    free(tts_text);
+
+    xSemaphoreGive(s_http_lock);
+    return err;
+}
+
+bool voice_channel_is_enabled(void)
+{
+    return s_enabled;
+}
+
+void voice_channel_get_status(voice_channel_status_t *status)
+{
+    if (!status) {
+        return;
+    }
+
+    status->enabled = s_enabled;
+    status->i2s_ready = s_i2s_ready;
+    status->is_playing = s_is_playing;
+    status->stt_configured = (stt_api_url()[0] != '\0' && stt_api_key()[0] != '\0');
+    status->tts_configured = (tts_api_url()[0] != '\0' && tts_api_key()[0] != '\0');
+}
diff --git a/main/voice/voice_channel.h b/main/voice/voice_channel.h
new file mode 100644
index 00000000..ffcc2504
--- /dev/null
+++ b/main/voice/voice_channel.h
@@ -0,0 +1,32 @@
+#pragma once
+
+#include <stdbool.h>
+#include "esp_err.h"
+
+typedef struct {
+    bool enabled;
+    bool i2s_ready;
+    bool is_playing;
+    bool stt_configured;
+    bool tts_configured;
+} voice_channel_status_t;
+
+/*
+ * Voice channel for ReSpeaker XVF3800 over I2S.
+ *
+ * Inbound path:
+ *   Mic PCM -> VAD utterance -> STT -> message_bus inbound (channel=voice)
+ *
+ * Outbound path:
+ *   Agent text (channel=voice) -> TTS -> speaker playback (I2S)
+ */
+esp_err_t voice_channel_init(void);
+esp_err_t voice_channel_start(void);
+
+/*
+ * Convert text to speech and enqueue for playback.
+ */
+esp_err_t voice_channel_speak_text(const char *text);
+
+bool voice_channel_is_enabled(void);
+void voice_channel_get_status(voice_channel_status_t *status);
diff --git a/partitions.csv b/partitions.csv
index 24c87784..017cf429 100644
--- a/partitions.csv
+++ b/partitions.csv
@@ -4,5 +4,5 @@ otadata,   data, ota,     0xF000,   0x2000
 phy_init,  data, phy,     0x11000,  0x1000
 ota_0,     app,  ota_0,   0x20000,  0x200000
 ota_1,     app,  ota_1,   0x220000, 0x200000
-spiffs,    data, spiffs,  0x420000, 0xBD0000
-coredump,  data, coredump,0xFF0000, 0x10000
+spiffs,    data, spiffs,  0x420000, 0x3D0000
+coredump,  data, coredump,0x7F0000, 0x10000
diff --git a/sdkconfig.defaults.esp32s3 b/sdkconfig.defaults.esp32s3
index 4774cd93..eed91926 100644
--- a/sdkconfig.defaults.esp32s3
+++ b/sdkconfig.defaults.esp32s3
@@ -2,7 +2,7 @@
 CONFIG_IDF_TARGET="esp32s3"
 
 # Flash 16MB + QIO
-CONFIG_ESPTOOLPY_FLASHSIZE_16MB=y
+CONFIG_ESPTOOLPY_FLASHSIZE_8MB=y
 CONFIG_ESPTOOLPY_FLASHMODE_QIO=y
 
 # CPU 240MHz