diff --git a/README.md b/README.md index b733fbf8..241477b5 100644 --- a/README.md +++ b/README.md @@ -1,318 +1,265 @@ -# MimiClaw: Pocket AI Assistant on a $5 Chip +# reSpeaker-claw: Voice AI Agent for ReSpeaker XVF3800

- MimiClaw -

- -

- License: MIT - DeepWiki - Discord - X + License: MIT + Language: C + Framework: ESP-IDF v5.5+ + Hardware: ReSpeaker XVF3800 + Architecture: Voice Agent

English | 中文 | 日本語

-**The world's first AI assistant(OpenClaw) on a $5 chip. No Linux. No Node.js. Just pure C** - -MimiClaw turns a tiny ESP32-S3 board into a personal AI assistant. Plug it into USB power, connect to WiFi, and talk to it through Telegram — it handles any task you throw at it and evolves over time with local memory — all on a chip the size of a thumb. +reSpeaker-claw turns a ReSpeaker XVF3800–based device into a voice-first AI agent. It captures audio over I2S, performs local VAD, sends utterances to STT, and processes them through an embedded agent loop. The system combines real-time speech interaction with local memory, tool calling, scheduling, heartbeat processes, OTA updates, and proxy support, and returns responses via TTS through the speaker. -## Meet MimiClaw +## Meet reSpeaker-claw - **Tiny** — No Linux, no Node.js, no bloat — just pure C -- **Handy** — Message it from Telegram, it handles the rest - **Loyal** — Learns from memory, remembers across reboots -- **Energetic** — USB power, 0.5 W, runs 24/7 -- **Lovable** — One ESP32-S3 board, $5, nothing else +- **Energetic** — USB power, lower power consumption, runs 24/7 +- **Freedom** — ReSpeaker XVF3800's mic array + your choice of speaker amp/DAC +- **Handy** — Built-in voice channel, no extra hardware needed beyond the XVF3800 and a speaker path -## How It Works +## Highlights -![](assets/mimiclaw.png) +- Voice input: ReSpeaker XVF3800 microphone array over I2S +- Voice output: TTS audio download, WAV decode, resample, and speaker playback over I2S +- Multi-channel agent: voice, Telegram, Feishu, WebSocket +- Local persistence: SPIFFS stores memory, profiles, sessions, cron jobs, and daily notes +- Compatible LLM backends: official Anthropic/OpenAI APIs or third-party gateways that expose Anthropic-compatible or OpenAI-compatible endpoints +- Configurable STT/TTS: plug in your own service URL, API key, model, voice, and language +- Runtime overrides: change WiFi, provider, model, API base, proxy, and tokens from the serial CLI without editing code -You send a message on Telegram. The ESP32-S3 picks it up over WiFi, feeds it into an agent loop — the LLM thinks, calls tools, reads memory — and sends the reply back. Supports both **Anthropic (Claude)** and **OpenAI (GPT)** as providers, switchable at runtime. Everything runs on a single $5 chip with all your data stored locally on flash. ## Quick Start -### What You Need +### Requirements -- An **ESP32-S3 dev board** with 16 MB flash and 8 MB PSRAM (e.g. Xiaozhi AI board, ~$10) -- A **USB Type-C cable** -- A **Telegram bot token** — talk to [@BotFather](https://t.me/BotFather) on Telegram to create one -- An **Anthropic API key** — from [console.anthropic.com](https://console.anthropic.com), or an **OpenAI API key** — from [platform.openai.com](https://platform.openai.com) +- A reSpeaker XVF3800 USB 4 Microphone Array with XIAO ESP32S3 board +- A speaker / DAC / amp path on I2S output +- A USB cable for flashing and serial monitoring +- WiFi access +- ESP-IDF v5.5+ +- Optional: Telegram bot token if you want Telegram +- Optional: Feishu app credentials if you want Feishu +- One LLM API key for an Anthropic-compatible or OpenAI-compatible endpoint +- One STT service and one TTS service for voice mode -### Install +### Clone and Build Environment -```bash -# You need ESP-IDF v5.5+ installed first: -# https://docs.espressif.com/projects/esp-idf/en/v5.5.2/esp32s3/get-started/ +Refer to the official guide to flash the I2S firmware: +[SeeedStudio wiki](https://wiki.seeedstudio.com/respeaker_xvf3800_introduction/#flash-firmware) + +Then clone this project and set the target: -git clone https://github.com/memovai/mimiclaw.git -cd mimiclaw +```bash +git clone https://github.com/Seeed-Projects/reSpeaker-claw +cd reSpeaker-claw idf.py set-target esp32s3 ``` -
-Ubuntu Install +Install ESP-IDF first: [ESP-IDF Install](https://docs.espressif.com/projects/esp-idf/en/v5.5.3/esp32s3/get-started/) -Recommended baseline: - -- Ubuntu 22.04/24.04 -- Python >= 3.10 -- CMake >= 3.16 -- Ninja >= 1.10 -- Git >= 2.34 -- flex >= 2.6 -- bison >= 3.8 -- gperf >= 3.1 -- dfu-util >= 0.11 -- `libusb-1.0-0`, `libffi-dev`, `libssl-dev` - -Install and build on Ubuntu: +Ubuntu helper scripts: ```bash -sudo apt-get update -sudo apt-get install -y git wget flex bison gperf python3 python3-pip python3-venv \ - cmake ninja-build ccache libffi-dev libssl-dev dfu-util libusb-1.0-0 - ./scripts/setup_idf_ubuntu.sh ./scripts/build_ubuntu.sh ``` -
- -
-macOS Install - -Recommended baseline: - -- macOS 12/13/14 -- Xcode Command Line Tools -- Homebrew -- Python >= 3.10 -- CMake >= 3.16 -- Ninja >= 1.10 -- Git >= 2.34 -- flex >= 2.6 -- bison >= 3.8 -- gperf >= 3.1 -- dfu-util >= 0.11 -- `libusb`, `libffi`, `openssl` - -Install and build on macOS: +macOS helper scripts: ```bash -xcode-select --install -/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" - ./scripts/setup_idf_macos.sh ./scripts/build_macos.sh ``` -
- -### Configure +## Configure -MimiClaw uses a **two-layer config** system: build-time defaults in `mimi_secrets.h`, with runtime overrides via the serial CLI. CLI values are stored in NVS flash and take priority over build-time values. +Copy the example secrets file: ```bash -cp main/mimi_secrets.h.example main/mimi_secrets.h +cp "main/mimi_secrets.h.example" "main/mimi_secrets.h" ``` -Edit `main/mimi_secrets.h`: +Edit `main/mimi_secrets.h` and set the fields you actually use: ```c +/* WiFi */ #define MIMI_SECRET_WIFI_SSID "YourWiFiName" #define MIMI_SECRET_WIFI_PASS "YourWiFiPassword" -#define MIMI_SECRET_TG_TOKEN "123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11" -#define MIMI_SECRET_API_KEY "sk-ant-api03-xxxxx" -#define MIMI_SECRET_MODEL_PROVIDER "anthropic" // "anthropic" or "openai" -#define MIMI_SECRET_SEARCH_KEY "" // optional: Brave Search API key -#define MIMI_SECRET_TAVILY_KEY "" // optional: Tavily API key (preferred) -#define MIMI_SECRET_PROXY_HOST "" // optional: e.g. "10.0.0.1" -#define MIMI_SECRET_PROXY_PORT "" // optional: e.g. "7897" -``` - -Then build and flash: - -```bash -# Clean build (required after any mimi_secrets.h change) -idf.py fullclean && idf.py build - -# Find your serial port -ls /dev/cu.usb* # macOS -ls /dev/ttyACM* # Linux -# Flash and monitor (replace PORT with your port) -# USB adapter: likely /dev/cu.usbmodem11401 (macOS) or /dev/ttyACM0 (Linux) -idf.py -p PORT flash monitor +/* Optional text channels */ +#define MIMI_SECRET_TG_TOKEN "" +#define MIMI_SECRET_FEISHU_APP_ID "" +#define MIMI_SECRET_FEISHU_APP_SECRET "" + +/* LLM */ +#define MIMI_SECRET_API_KEY "your-llm-key" +#define MIMI_SECRET_MODEL "your-model" +#define MIMI_SECRET_MODEL_PROVIDER "openai" /* or "anthropic" */ + +/* Search and proxy */ +#define MIMI_SECRET_TAVILY_KEY "" +#define MIMI_SECRET_SEARCH_KEY "" +#define MIMI_SECRET_PROXY_HOST "" +#define MIMI_SECRET_PROXY_PORT "" +#define MIMI_SECRET_PROXY_TYPE "" /* "http" or "socks5" */ + +/* Voice STT / TTS */ +#define MIMI_SECRET_STT_URL "https://your-stt-endpoint" +#define MIMI_SECRET_STT_API_KEY "your-stt-key" +#define MIMI_SECRET_STT_MODEL "your-stt-model" +#define MIMI_SECRET_TTS_URL "https://your-tts-endpoint" +#define MIMI_SECRET_TTS_API_KEY "your-tts-key" +#define MIMI_SECRET_TTS_MODEL "your-tts-model" +#define MIMI_SECRET_TTS_VOICE "" +#define MIMI_SECRET_TTS_LANGUAGE "English" + +/* ReSpeaker XVF3800 I2S pin map */ +#define MIMI_VOICE_I2S_PORT 0 +#define MIMI_VOICE_I2S_BCLK GPIO_NUM_8 +#define MIMI_VOICE_I2S_WS GPIO_NUM_7 +#define MIMI_VOICE_I2S_DIN GPIO_NUM_43 +#define MIMI_VOICE_I2S_DOUT GPIO_NUM_44 ``` -> **Important: Plug into the correct USB port!** Most ESP32-S3 boards have two USB-C ports. You must use the one labeled **USB** (native USB Serial/JTAG), **not** the one labeled **COM** (external UART bridge). Plugging into the wrong port will cause flash/monitor failures. -> ->
-> Show reference photo -> -> Plug into the USB port, not COM -> ->
+Notes: -### CLI Commands (via UART/COM port) +- `MIMI_SECRET_MODEL_PROVIDER` selects the request protocol, not just the vendor name. +- Use `openai` for OpenAI-compatible gateways. +- Use `anthropic` for Anthropic-compatible gateways. +- Voice mode requires STT and TTS URL/key pairs to be configured. +- LLM API base can be changed at runtime with `set_api_base`. -Connect via serial to configure or debug. **Config commands** let you change settings without recompiling — just plug in a USB cable anywhere. +## Adding STT and TTS -**Runtime config** (saved to NVS, overrides build-time defaults): +This project no longer treats speech as an afterthought. To enable the full ReSpeaker experience: -``` -mimi> wifi_set MySSID MyPassword # change WiFi network -mimi> set_tg_token 123456:ABC... # change Telegram bot token -mimi> set_api_key sk-ant-api03-... # change API key (Anthropic or OpenAI) -mimi> set_model_provider openai # switch provider (anthropic|openai) -mimi> set_model gpt-4o # change LLM model -mimi> set_proxy 127.0.0.1 7897 # set HTTP proxy -mimi> clear_proxy # remove proxy -mimi> set_search_key BSA... # set Brave Search API key -mimi> set_tavily_key tvly-... # set Tavily API key (preferred) -mimi> config_show # show all config (masked) -mimi> config_reset # clear NVS, revert to build-time defaults -``` +1. Configure `MIMI_SECRET_STT_URL`, `MIMI_SECRET_STT_API_KEY`, and `MIMI_SECRET_STT_MODEL`. +2. Configure `MIMI_SECRET_TTS_URL`, `MIMI_SECRET_TTS_API_KEY`, `MIMI_SECRET_TTS_MODEL`, `MIMI_SECRET_TTS_VOICE`, and `MIMI_SECRET_TTS_LANGUAGE`. +3. Set the XVF3800 input pins and your speaker output pins in the I2S section. +4. If your DAC or amp sounds noisy, set `MIMI_VOICE_I2S_STD_SLOT_STYLE` to match the hardware timing style. +5. If your room causes false triggers, tune `MIMI_VOICE_VAD_START_FRAMES`, `MIMI_VOICE_VAD_MIN_FRAMES`, and `MIMI_VOICE_STT_COOLDOWN_MS`. +6. If your TTS audio is too long, tune `MIMI_VOICE_TTS_MAX_SECONDS`, `MIMI_VOICE_TTS_CHARS_PER_SEC`, and `MIMI_VOICE_TTS_MAX_CHARS`. -**Debug & maintenance:** +The current firmware already contains the full voice channel: -``` -mimi> wifi_status # am I connected? -mimi> memory_read # see what the bot remembers -mimi> memory_write "content" # write to MEMORY.md -mimi> heap_info # how much RAM is free? -mimi> session_list # list all chat sessions -mimi> session_clear 12345 # wipe a conversation -mimi> heartbeat_trigger # manually trigger a heartbeat check -mimi> cron_start # start cron scheduler now -mimi> restart # reboot -``` - -### USB (JTAG) vs UART: Which Port for What - -Most ESP32-S3 dev boards expose **two USB-C ports**: - -| Port | Use for | -|------|---------| -| **USB** (JTAG) | `idf.py flash`, JTAG debugging | -| **COM** (UART) | **REPL CLI**, serial console | - -> **REPL requires the UART (COM) port.** The USB (JTAG) port does not support interactive REPL input. +- inbound: mic PCM -> VAD -> STT -> message bus +- outbound: agent text -> TTS -> playback -
-Port details & recommended workflow +## Flash and Monitor -| Port | Label | Protocol | -|------|-------|----------| -| **USB** | USB / JTAG | Native USB Serial/JTAG | -| **COM** | UART / COM | External UART bridge (CP2102/CH340) | +After changing `main/mimi_secrets.h`, rebuild from a clean state: -The ESP-IDF console/REPL is configured to use UART by default (`CONFIG_ESP_CONSOLE_UART_DEFAULT=y`). - -**If you have both ports connected simultaneously:** - -- USB (JTAG) handles flash/download and provides secondary serial output -- UART (COM) provides the primary interactive console for the REPL -- macOS: both appear as `/dev/cu.usbmodem*` or `/dev/cu.usbserial-*` — run `ls /dev/cu.usb*` to identify -- Linux: USB (JTAG) → `/dev/ttyACM0`, UART → `/dev/ttyUSB0` +```bash +idf.py fullclean +idf.py build +``` -**Recommended workflow:** +Find your serial port: ```bash -# Flash via USB (JTAG) port -idf.py -p /dev/cu.usbmodem11401 flash - -# Open REPL via UART (COM) port -idf.py -p /dev/cu.usbserial-110 monitor -# or use any serial terminal: screen, minicom, PuTTY at 115200 baud +ls /dev/cu.usb* # macOS +ls /dev/ttyACM* # Linux ``` -
+Flash and monitor: -## Memory +```bash +idf.py -p PORT flash monitor +``` -MimiClaw stores everything as plain text files you can read and edit: +Replace `PORT` with your actual device path. -| File | What it is | -|------|------------| -| `SOUL.md` | The bot's personality — edit this to change how it behaves | -| `USER.md` | Info about you — name, preferences, language | -| `MEMORY.md` | Long-term memory — things the bot should always remember | -| `HEARTBEAT.md` | Task list the bot checks periodically and acts on autonomously | -| `cron.json` | Scheduled jobs — recurring or one-shot tasks created by the AI | -| `2026-02-05.md` | Daily notes — what happened today | -| `tg_12345.jsonl` | Chat history — your conversation with the bot | +## Serial CLI -## Tools +The serial CLI is the fastest way to change runtime settings stored in NVS: -MimiClaw supports tool calling for both Anthropic and OpenAI — the LLM can call tools during a conversation and loop until the task is done (ReAct pattern). +``` +mimi> wifi_set MySSID MyPassword +mimi> set_tg_token 123456:ABC... +mimi> set_api_key your-llm-key +mimi> set_api_base https://your-compatible-endpoint/v1 +mimi> set_model_provider openai +mimi> set_model gpt-5.2 +mimi> set_proxy 127.0.0.1 7897 +mimi> clear_proxy +mimi> set_search_key BSA... +mimi> set_tavily_key tvly-... +mimi> config_show +mimi> config_reset +``` -| Tool | Description | -|------|-------------| -| `web_search` | Search the web via Tavily (preferred) or Brave for current information | -| `get_current_time` | Fetch current date/time via HTTP and set the system clock | -| `cron_add` | Schedule a recurring or one-shot task (the LLM creates cron jobs on its own) | -| `cron_list` | List all scheduled cron jobs | -| `cron_remove` | Remove a cron job by ID | +Maintenance commands: + +```text +mimi> wifi_status +mimi> memory_read +mimi> memory_write "remember this" +mimi> heap_info +mimi> session_list +mimi> session_clear 12345 +mimi> heartbeat_trigger +mimi> cron_start +mimi> restart +``` -To enable web search, set a [Tavily API key](https://app.tavily.com/home) via `MIMI_SECRET_TAVILY_KEY` (preferred), or a [Brave Search API key](https://brave.com/search/api/) via `MIMI_SECRET_SEARCH_KEY` in `mimi_secrets.h`. +## Compatible Provider Model -## Cron Tasks +`reSpeaker-claw` is not limited to the official Anthropic and OpenAI endpoints. -MimiClaw has a built-in cron scheduler that lets the AI schedule its own tasks. The LLM can create recurring jobs ("every N seconds") or one-shot jobs ("at unix timestamp") via the `cron_add` tool. When a job fires, its message is injected into the agent loop — so the AI wakes up, processes the task, and responds. +It supports: -Jobs are persisted to SPIFFS (`cron.json`) and survive reboots. Example use cases: daily summaries, periodic reminders, scheduled check-ins. +- Anthropic protocol compatible services, selected with `set_model_provider anthropic` +- OpenAI protocol compatible services, selected with `set_model_provider openai` +- Custom API bases through `set_api_base` -## Heartbeat +This makes it practical to use local gateways, regional cloud vendors, or unified API platforms without changing the agent loop. -The heartbeat service periodically reads `HEARTBEAT.md` from SPIFFS and checks for actionable tasks. If uncompleted items are found (anything that isn't an empty line, a header, or a checked `- [x]` box), it sends a prompt to the agent loop so the AI can act on them autonomously. +## Memory and Automation -This turns MimiClaw into a proactive assistant — write tasks to `HEARTBEAT.md` and the bot will pick them up on the next heartbeat cycle (default: every 30 minutes). +The agent persists its state in plain files on SPIFFS: -## Also Included +| File | Purpose | +|------|---------| +| `SOUL.md` | Assistant persona | +| `USER.md` | User profile | +| `MEMORY.md` | Long-term memory | +| `HEARTBEAT.md` | Periodic autonomous task list | +| `cron.json` | Scheduled jobs | +| `tg_12345.jsonl` | Session history | -- **WebSocket gateway** on port 18789 — connect from your LAN with any WebSocket client -- **OTA updates** — flash new firmware over WiFi, no USB needed -- **Dual-core** — network I/O and AI processing run on separate CPU cores -- **HTTP proxy** — CONNECT tunnel support for restricted networks -- **Multi-provider** — supports both Anthropic (Claude) and OpenAI (GPT), switchable at runtime -- **Cron scheduler** — the AI can schedule its own recurring and one-shot tasks, persisted across reboots -- **Heartbeat** — periodically checks a task file and prompts the AI to act autonomously -- **Tool use** — ReAct agent loop with tool calling for both providers +Built-in automation features: -## For Developers +- `cron_add`, `cron_list`, `cron_remove` +- heartbeat-driven proactive task handling +- tool calling in the ReAct loop +- local storage that survives reboot -Technical details live in the `docs/` folder: +## Tooling -- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — system design, module map, task layout, memory budget, protocols, flash partitions -- **[docs/TODO.md](docs/TODO.md)** — feature gap tracker and roadmap -- **[docs/tool-setup/](docs/tool-setup/README.md)** — configuration guides for external service integrations (Tavily, etc.) +Built-in tools include: -## Contributing +- `web_search` +- `get_current_time` +- `cron_add` +- `cron_list` +- `cron_remove` +- SPIFFS file tools used by the agent runtime -Please read **[CONTRIBUTING.md](CONTRIBUTING.md)** before opening issues or pull requests. +For web search, configure either: -## Contributors +- `MIMI_SECRET_TAVILY_KEY` +- `MIMI_SECRET_SEARCH_KEY` -Thanks to everyone who has contributed to MimiClaw. +## Acknowledgments - - MimiClaw contributors - +This project builds on the original [mimiclaw](https://github.com/memovai/mimiclaw). reSpeaker-claw adapts that embedded agent foundation to ReSpeaker XVF3800 voice hardware, extends the STT / TTS pipeline, and continues the multi-channel agent architecture. ## License MIT - -## Acknowledgments - -Inspired by [OpenClaw](https://github.com/openclaw/openclaw) and [Nanobot](https://github.com/HKUDS/nanobot). MimiClaw reimplements the core AI agent architecture for embedded hardware — no Linux, no server, just a $5 chip. - -## Star History - -[![Star History Chart](https://api.star-history.com/svg?repos=memovai/mimiclaw&type=Date)](https://star-history.com/#memovai/mimiclaw&Date) diff --git a/README_CN.md b/README_CN.md index f1cefa51..4a36f65f 100644 --- a/README_CN.md +++ b/README_CN.md @@ -1,339 +1,264 @@ -# MimiClaw: $5 芯片上的口袋 AI 助理 +# reSpeaker-claw:面向 ReSpeaker XVF3800 的语音 AI Agent

- MimiClaw -

- -

- License: MIT - DeepWiki - Discord - X + License: MIT + Language: C + Framework: ESP-IDF v5.5+ + Hardware: ReSpeaker XVF3800 + Architecture: Voice Agent

English | 中文 | 日本語

-**$5 芯片上的 AI 助理(OpenClaw)。没有 Linux,没有 Node.js,纯 C。** +reSpeaker-claw 将基于 ReSpeaker XVF3800 的设备变成一个以语音为主入口的 AI Agent。它通过 I2S 采集音频,在本地执行 VAD,将话语送入 STT,并通过嵌入式 agent loop 处理。系统把实时语音交互、本地记忆、工具调用、调度、heartbeat、OTA 更新和代理支持整合在一起,最后通过 TTS 从扬声器返回响应。 -MimiClaw 把一块小小的 ESP32-S3 开发板变成你的私人 AI 助理。插上 USB 供电,连上 WiFi,通过 Telegram 跟它对话 — 它能处理你丢给它的任何任务,还会随时间积累本地记忆不断进化 — 全部跑在一颗拇指大小的芯片上。 +## 认识 reSpeaker-claw -## 认识 MimiClaw +- **小巧**:没有 Linux,没有 Node.js,没有臃肿依赖,只有纯 C +- **忠诚**:从记忆中学习,重启后依然保留上下文 +- **高效**:USB 供电,功耗更低,可 24/7 运行 +- **自由**:ReSpeaker XVF3800 麦克风阵列,配合你自己选择的功放或 DAC +- **顺手**:内置语音通道,除了 XVF3800 和扬声器链路,不需要额外硬件 -- **小巧** — 没有 Linux,没有 Node.js,没有臃肿依赖 — 纯 C -- **好用** — 在 Telegram 发消息,剩下的它来搞定 -- **忠诚** — 从记忆中学习,跨重启也不会忘 -- **能干** — USB 供电,0.5W,24/7 运行 -- **可爱** — 一块 ESP32-S3 开发板,$5,没了 +## 亮点 -## 工作原理 +- 语音输入:ReSpeaker XVF3800 麦克风阵列,通过 I2S 接入 +- 语音输出:TTS 音频下载、WAV 解码、重采样与 I2S 播放 +- 多通道 Agent:语音、Telegram、飞书、WebSocket +- 本地持久化:SPIFFS 保存记忆、配置、会话、cron 任务和每日笔记 +- 兼容 LLM 后端:支持官方 Anthropic / OpenAI API,也支持兼容 Anthropic 或 OpenAI 协议的第三方网关 +- 可配置 STT / TTS:可接入你自己的服务 URL、API Key、模型、音色和语言 +- 运行时覆盖:可通过串口 CLI 修改 WiFi、provider、model、API base、代理和 token,无需改代码 -![](assets/mimiclaw.png) +## 快速开始 -你在 Telegram 发一条消息,ESP32-S3 通过 WiFi 收到后送进 Agent 循环 — LLM 思考、调用工具、读取记忆 — 再把回复发回来。同时支持 **Anthropic (Claude)** 和 **OpenAI (GPT)** 两种提供商,运行时可切换。一切都跑在一颗 $5 的芯片上,所有数据存在本地 Flash。 +### 依赖条件 -## 快速开始 +- 一套 reSpeaker XVF3800 USB 4 Microphone Array 搭配 XIAO ESP32S3 开发板 +- 一路 I2S 输出到扬声器 / DAC / 功放 +- 一根用于烧录和串口监控的 USB 线 +- 可用的 WiFi +- ESP-IDF v5.5+ +- 可选:如果你要使用 Telegram,需要 Telegram Bot Token +- 可选:如果你要使用飞书,需要飞书应用凭证 +- 一个兼容 Anthropic 或 OpenAI 协议的 LLM API Key +- 一套用于语音模式的 STT 服务和 TTS 服务 -### 你需要 +### 克隆与构建环境 -- 一块 **ESP32-S3 开发板**,16MB Flash + 8MB PSRAM(如小智 AI 开发板,~¥30) -- 一根 **USB Type-C 数据线** -- 一个 **Telegram Bot Token** — 在 Telegram 找 [@BotFather](https://t.me/BotFather) 创建 -- 一个 **Anthropic API Key** — 从 [console.anthropic.com](https://console.anthropic.com) 获取,或一个 **OpenAI API Key** — 从 [platform.openai.com](https://platform.openai.com) 获取 +先参考官方指南刷入 I2S 固件: +[SeeedStudio wiki](https://wiki.seeedstudio.com/respeaker_xvf3800_introduction/#flash-firmware) -### 安装 +然后克隆本项目并设置目标: ```bash -# 需要先安装 ESP-IDF v5.5+: -# https://docs.espressif.com/projects/esp-idf/en/v5.5.2/esp32s3/get-started/ - -git clone https://github.com/memovai/mimiclaw.git -cd mimiclaw +git clone https://github.com/Seeed-Projects/reSpeaker-claw +cd reSpeaker-claw idf.py set-target esp32s3 ``` -
-Ubuntu 安装 - -建议基线: - -- Ubuntu 22.04/24.04 -- Python >= 3.10 -- CMake >= 3.16 -- Ninja >= 1.10 -- Git >= 2.34 -- flex >= 2.6 -- bison >= 3.8 -- gperf >= 3.1 -- dfu-util >= 0.11 -- `libusb-1.0-0`、`libffi-dev`、`libssl-dev` +先安装 ESP-IDF:[ESP-IDF 安装](https://docs.espressif.com/projects/esp-idf/en/v5.5.3/esp32s3/get-started/) -Ubuntu 安装与构建: +Ubuntu 辅助脚本: ```bash -sudo apt-get update -sudo apt-get install -y git wget flex bison gperf python3 python3-pip python3-venv \ - cmake ninja-build ccache libffi-dev libssl-dev dfu-util libusb-1.0-0 - ./scripts/setup_idf_ubuntu.sh ./scripts/build_ubuntu.sh ``` -
- -
-macOS 安装 - -建议基线: - -- macOS 12/13/14 -- Xcode Command Line Tools -- Homebrew -- Python >= 3.10 -- CMake >= 3.16 -- Ninja >= 1.10 -- Git >= 2.34 -- flex >= 2.6 -- bison >= 3.8 -- gperf >= 3.1 -- dfu-util >= 0.11 -- `libusb`、`libffi`、`openssl` - -macOS 安装与构建: +macOS 辅助脚本: ```bash -xcode-select --install -/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" - ./scripts/setup_idf_macos.sh ./scripts/build_macos.sh ``` -
- -### 配置 +## 配置 -MimiClaw 使用**两层配置**:`mimi_secrets.h` 提供编译时默认值,串口 CLI 可在运行时覆盖。CLI 设置的值存在 NVS Flash 中,优先级高于编译时值。 +复制示例 secrets 文件: ```bash -cp main/mimi_secrets.h.example main/mimi_secrets.h +cp "main/mimi_secrets.h.example" "main/mimi_secrets.h" ``` -编辑 `main/mimi_secrets.h`: +编辑 `main/mimi_secrets.h`,填写你实际需要的配置项: ```c -#define MIMI_SECRET_WIFI_SSID "你的WiFi名" -#define MIMI_SECRET_WIFI_PASS "你的WiFi密码" -#define MIMI_SECRET_TG_TOKEN "123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11" -#define MIMI_SECRET_API_KEY "sk-ant-api03-xxxxx" -#define MIMI_SECRET_MODEL_PROVIDER "anthropic" // "anthropic" 或 "openai" -#define MIMI_SECRET_SEARCH_KEY "" // 可选:Brave Search API key -#define MIMI_SECRET_TAVILY_KEY "" // 可选:Tavily API key(优先) -#define MIMI_SECRET_PROXY_HOST "10.0.0.1" // 可选:代理地址 -#define MIMI_SECRET_PROXY_PORT "7897" // 可选:代理端口 +/* WiFi */ +#define MIMI_SECRET_WIFI_SSID "YourWiFiName" +#define MIMI_SECRET_WIFI_PASS "YourWiFiPassword" + +/* Optional text channels */ +#define MIMI_SECRET_TG_TOKEN "" +#define MIMI_SECRET_FEISHU_APP_ID "" +#define MIMI_SECRET_FEISHU_APP_SECRET "" + +/* LLM */ +#define MIMI_SECRET_API_KEY "your-llm-key" +#define MIMI_SECRET_MODEL "your-model" +#define MIMI_SECRET_MODEL_PROVIDER "openai" /* or "anthropic" */ + +/* Search and proxy */ +#define MIMI_SECRET_TAVILY_KEY "" +#define MIMI_SECRET_SEARCH_KEY "" +#define MIMI_SECRET_PROXY_HOST "" +#define MIMI_SECRET_PROXY_PORT "" +#define MIMI_SECRET_PROXY_TYPE "" /* "http" or "socks5" */ + +/* Voice STT / TTS */ +#define MIMI_SECRET_STT_URL "https://your-stt-endpoint" +#define MIMI_SECRET_STT_API_KEY "your-stt-key" +#define MIMI_SECRET_STT_MODEL "your-stt-model" +#define MIMI_SECRET_TTS_URL "https://your-tts-endpoint" +#define MIMI_SECRET_TTS_API_KEY "your-tts-key" +#define MIMI_SECRET_TTS_MODEL "your-tts-model" +#define MIMI_SECRET_TTS_VOICE "" +#define MIMI_SECRET_TTS_LANGUAGE "English" + +/* ReSpeaker XVF3800 I2S pin map */ +#define MIMI_VOICE_I2S_PORT 0 +#define MIMI_VOICE_I2S_BCLK GPIO_NUM_8 +#define MIMI_VOICE_I2S_WS GPIO_NUM_7 +#define MIMI_VOICE_I2S_DIN GPIO_NUM_43 +#define MIMI_VOICE_I2S_DOUT GPIO_NUM_44 ``` -然后编译烧录: +说明: -```bash -# 完整编译(修改 mimi_secrets.h 后必须 fullclean) -idf.py fullclean && idf.py build - -# 查找串口 -ls /dev/cu.usb* # macOS -ls /dev/ttyACM* # Linux - -# 烧录并监控(将 PORT 替换为你的串口) -# USB 转接器:大概率是 /dev/cu.usbmodem11401(macOS)或 /dev/ttyACM0(Linux) -idf.py -p PORT flash monitor -``` - -> **注意:请插对 USB 口!** 大多数 ESP32-S3 开发板有两个 Type-C 接口,必须插标有 **USB** 的那个口(原生 USB Serial/JTAG),**不要**插标有 **COM** 的口(外部 UART 桥接)。插错口会导致烧录/监控失败。 -> ->
-> 查看参考图片 -> -> 请插 USB 口,不要插 COM 口 -> ->
+- `MIMI_SECRET_MODEL_PROVIDER` 选择的是请求协议,而不只是厂商名 +- 兼容 OpenAI 协议的网关使用 `openai` +- 兼容 Anthropic 协议的网关使用 `anthropic` +- 语音模式要求 STT 与 TTS 的 URL / Key 成对配置 +- LLM API base 可在运行时通过 `set_api_base` 修改 -### 代理配置(国内用户) +## 添加 STT 和 TTS -在国内需要代理才能访问 Telegram 和 Anthropic API。MimiClaw 内置 HTTP CONNECT 隧道支持。 +这个项目不再把语音当成附属功能。要启用完整的 ReSpeaker 体验: -**前提**:局域网内有一个支持 HTTP CONNECT 的代理(Clash Verge、V2Ray 等),并开启了「允许局域网连接」。 +1. 配置 `MIMI_SECRET_STT_URL`、`MIMI_SECRET_STT_API_KEY` 和 `MIMI_SECRET_STT_MODEL` +2. 配置 `MIMI_SECRET_TTS_URL`、`MIMI_SECRET_TTS_API_KEY`、`MIMI_SECRET_TTS_MODEL`、`MIMI_SECRET_TTS_VOICE` 和 `MIMI_SECRET_TTS_LANGUAGE` +3. 在 I2S 配置段中设置 XVF3800 的输入引脚和扬声器输出引脚 +4. 如果 DAC 或功放播放出来像噪音,设置 `MIMI_VOICE_I2S_STD_SLOT_STYLE` 以匹配硬件时序 +5. 如果房间环境导致误触发,调节 `MIMI_VOICE_VAD_START_FRAMES`、`MIMI_VOICE_VAD_MIN_FRAMES` 和 `MIMI_VOICE_STT_COOLDOWN_MS` +6. 如果 TTS 音频过长,调节 `MIMI_VOICE_TTS_MAX_SECONDS`、`MIMI_VOICE_TTS_CHARS_PER_SEC` 和 `MIMI_VOICE_TTS_MAX_CHARS` -可以在 `mimi_secrets.h` 中编译时设置,也可以通过串口 CLI 随时修改: +当前固件已经包含完整的语音通道: -``` -mimi> set_proxy 192.168.1.83 7897 # 设置代理 -mimi> clear_proxy # 清除代理 -``` +- 输入方向:mic PCM -> VAD -> STT -> message bus +- 输出方向:agent text -> TTS -> playback -> **提示**:确保 ESP32-S3 和代理机器在同一局域网。Clash Verge 在「设置 → 允许局域网」中开启。 +## 烧录与监控 -### CLI 命令(通过 UART/COM 口连接) +修改 `main/mimi_secrets.h` 后,建议从干净状态重新构建: -通过串口连接即可配置和调试。**配置命令**让你无需重新编译就能修改设置 — 随时随地插上 USB 线就能改。 - -**运行时配置**(存入 NVS,覆盖编译时默认值): - -``` -mimi> wifi_set MySSID MyPassword # 换 WiFi -mimi> set_tg_token 123456:ABC... # 换 Telegram Bot Token -mimi> set_api_key sk-ant-api03-... # 换 API Key(Anthropic 或 OpenAI) -mimi> set_model_provider openai # 切换提供商(anthropic|openai) -mimi> set_model gpt-4o # 换模型 -mimi> set_proxy 192.168.1.83 7897 # 设置代理 -mimi> clear_proxy # 清除代理 -mimi> set_search_key BSA... # 设置 Brave Search API Key -mimi> set_tavily_key tvly-... # 设置 Tavily API Key(优先) -mimi> config_show # 查看所有配置(脱敏显示) -mimi> config_reset # 清除 NVS,恢复编译时默认值 +```bash +idf.py fullclean +idf.py build ``` -**调试与运维:** +查找串口: +```bash +ls /dev/cu.usb* # macOS +ls /dev/ttyACM* # Linux ``` -mimi> wifi_status # 连上了吗? -mimi> memory_read # 看看它记住了什么 -mimi> memory_write "内容" # 写入 MEMORY.md -mimi> heap_info # 还剩多少内存? -mimi> session_list # 列出所有会话 -mimi> session_clear 12345 # 删除一个会话 -mimi> heartbeat_trigger # 手动触发一次心跳检查 -mimi> cron_start # 立即启动 cron 调度器 -mimi> restart # 重启 -``` - -### USB (JTAG) 与 UART:哪个口做什么 -大多数 ESP32-S3 开发板有 **两个 USB-C 口**: - -| 端口 | 用途 | -|------|------| -| **USB**(JTAG) | `idf.py flash`、JTAG 调试 | -| **COM**(UART) | **REPL 命令行**、串口控制台 | - -> **REPL 必须连接 UART(COM)口。** USB(JTAG)口不支持交互式 REPL 输入。 - -
-端口详情与推荐工作流 - -| 端口 | 标注 | 协议 | -|------|------|------| -| **USB** | USB / JTAG | 原生 USB Serial/JTAG | -| **COM** | UART / COM | 外置 UART 桥接芯片(CP2102/CH340) | - -ESP-IDF 控制台默认配置为 UART 输出(`CONFIG_ESP_CONSOLE_UART_DEFAULT=y`)。 - -**同时连接两个口时:** - -- USB(JTAG)口负责烧录/下载,并提供辅助串口输出 -- UART(COM)口提供主要的交互式控制台,用于 REPL -- macOS 下两个口都会显示为 `/dev/cu.usbmodem*` 或 `/dev/cu.usbserial-*`,用 `ls /dev/cu.usb*` 区分 -- Linux 下 USB(JTAG)通常是 `/dev/ttyACM0`,UART 通常是 `/dev/ttyUSB0` - -**推荐工作流:** +烧录并监控: ```bash -# 通过 USB(JTAG)口烧录 -idf.py -p /dev/cu.usbmodem11401 flash - -# 通过 UART(COM)口打开 REPL -idf.py -p /dev/cu.usbserial-110 monitor -# 或使用任意串口工具:screen、minicom、PuTTY,波特率 115200 +idf.py -p PORT flash monitor ``` -
- -## 记忆 - -MimiClaw 把所有数据存为纯文本文件,可以直接读取和编辑: - -| 文件 | 说明 | -|------|------| -| `SOUL.md` | 机器人的人设 — 编辑它来改变行为方式 | -| `USER.md` | 关于你的信息 — 姓名、偏好、语言 | -| `MEMORY.md` | 长期记忆 — 它应该一直记住的事 | -| `HEARTBEAT.md` | 待办清单 — 机器人定期检查并自主执行 | -| `cron.json` | 定时任务 — AI 创建的周期性或一次性任务 | -| `2026-02-05.md` | 每日笔记 — 今天发生了什么 | -| `tg_12345.jsonl` | 聊天记录 — 你和它的对话 | - -## 工具 - -MimiClaw 同时支持 Anthropic 和 OpenAI 的工具调用 — LLM 在对话中可以调用工具,循环执行直到任务完成(ReAct 模式)。 +将 `PORT` 替换为你的实际设备路径。 + +## 串口 CLI + +串口 CLI 是修改 NVS 运行时配置的最快方式: + +```text +mimi> wifi_set MySSID MyPassword +mimi> set_tg_token 123456:ABC... +mimi> set_api_key your-llm-key +mimi> set_api_base https://your-compatible-endpoint/v1 +mimi> set_model_provider openai +mimi> set_model gpt-5.2 +mimi> set_proxy 127.0.0.1 7897 +mimi> clear_proxy +mimi> set_search_key BSA... +mimi> set_tavily_key tvly-... +mimi> config_show +mimi> config_reset +``` -| 工具 | 说明 | -|------|------| -| `web_search` | 通过 Tavily(优先)或 Brave 搜索网页,获取实时信息 | -| `get_current_time` | 通过 HTTP 获取当前日期和时间,并设置系统时钟 | -| `cron_add` | 创建定时或一次性任务(LLM 自主创建 cron 任务) | -| `cron_list` | 列出所有已调度的 cron 任务 | -| `cron_remove` | 按 ID 删除 cron 任务 | +维护命令: + +```text +mimi> wifi_status +mimi> memory_read +mimi> memory_write "remember this" +mimi> heap_info +mimi> session_list +mimi> session_clear 12345 +mimi> heartbeat_trigger +mimi> cron_start +mimi> restart +``` -启用网页搜索可在 `mimi_secrets.h` 中设置 [Tavily API key](https://app.tavily.com/home)(优先,`MIMI_SECRET_TAVILY_KEY`),或 [Brave Search API key](https://brave.com/search/api/)(`MIMI_SECRET_SEARCH_KEY`)。 +## 兼容 Provider 模型 -## 定时任务(Cron) +`reSpeaker-claw` 不局限于官方 Anthropic 和 OpenAI 端点。 -MimiClaw 内置 cron 调度器,让 AI 可以自主安排任务。LLM 可以通过 `cron_add` 工具创建周期性任务("每 N 秒")或一次性任务("在某个时间戳")。任务触发时,消息会注入到 Agent 循环 — AI 自动醒来、处理任务并回复。 +它支持: -任务持久化存储在 SPIFFS(`cron.json`),重启后不会丢失。典型用途:每日总结、定时提醒、定期巡检。 +- 兼容 Anthropic 协议的服务,通过 `set_model_provider anthropic` 选择 +- 兼容 OpenAI 协议的服务,通过 `set_model_provider openai` 选择 +- 通过 `set_api_base` 指向任意兼容 API base -## 心跳(Heartbeat) +这让你可以在不修改 agent loop 的情况下,直接使用本地网关、区域云厂商或统一 API 平台。 -心跳服务会定期读取 SPIFFS 上的 `HEARTBEAT.md`,检查是否有待办事项。如果发现未完成的条目(非空行、非标题、非已勾选的 `- [x]`),就会向 Agent 循环发送提示,让 AI 自主处理。 +## 记忆与自动化 -这让 MimiClaw 变成一个主动型助理 — 把任务写入 `HEARTBEAT.md`,机器人会在下一次心跳周期自动拾取执行(默认每 30 分钟)。 +Agent 会将状态以纯文本文件形式持久化到 SPIFFS: -## 其他功能 +| 文件 | 用途 | +|------|------| +| `SOUL.md` | 助手人格 | +| `USER.md` | 用户资料 | +| `MEMORY.md` | 长期记忆 | +| `HEARTBEAT.md` | 周期性自主任务列表 | +| `cron.json` | 调度任务 | +| `tg_12345.jsonl` | 会话历史 | -- **WebSocket 网关** — 端口 18789,局域网内用任意 WebSocket 客户端连接 -- **OTA 更新** — WiFi 远程刷固件,无需 USB -- **双核** — 网络 I/O 和 AI 处理分别跑在不同 CPU 核心 -- **HTTP 代理** — CONNECT 隧道,适配受限网络 -- **多提供商** — 同时支持 Anthropic (Claude) 和 OpenAI (GPT),运行时可切换 -- **定时任务** — AI 可自主创建周期性和一次性任务,重启后持久保存 -- **心跳服务** — 定期检查任务文件,驱动 AI 自主执行 -- **工具调用** — ReAct Agent 循环,两种提供商均支持工具调用 +内置自动化能力: -## 开发者 +- `cron_add`、`cron_list`、`cron_remove` +- heartbeat 驱动的主动任务处理 +- ReAct loop 中的工具调用 +- 重启后仍可保留的本地状态 -技术细节在 `docs/` 文件夹: +## 工具 -- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — 系统设计、模块划分、任务布局、内存分配、协议、Flash 分区 -- **[docs/TODO.md](docs/TODO.md)** — 功能差距和路线图 -- **[docs/im-integration/](docs/im-integration/README.md)** — IM 通道集成指南(飞书等) +内置工具包括: -## 贡献 +- `web_search` +- `get_current_time` +- `cron_add` +- `cron_list` +- `cron_remove` +- Agent 运行时使用的 SPIFFS 文件工具 -提交 Issue 或 Pull Request 前,请先阅读 **[CONTRIBUTING.md](CONTRIBUTING.md)**。 +如需启用网页搜索,配置以下任一项: -## 贡献者 +- `MIMI_SECRET_TAVILY_KEY` +- `MIMI_SECRET_SEARCH_KEY` -感谢所有为 MimiClaw 做出贡献的开发者。 +## 致谢 - - MimiClaw contributors - +本项目基于原始的 [mimiclaw](https://github.com/memovai/mimiclaw)。reSpeaker-claw 将那套嵌入式 agent 基础适配到 ReSpeaker XVF3800 语音硬件之上,扩展了 STT / TTS 流程,并延续了多通道 agent 架构。 ## 许可证 MIT - -## 致谢 - -灵感来自 [OpenClaw](https://github.com/openclaw/openclaw) 和 [Nanobot](https://github.com/HKUDS/nanobot)。MimiClaw 为嵌入式硬件重新实现了核心 AI Agent 架构 — 没有 Linux,没有服务器,只有一颗 $5 的芯片。 - -## Star History - - - - - - Star History Chart - - diff --git a/README_JA.md b/README_JA.md index fe91a8a9..cfd2e2d7 100644 --- a/README_JA.md +++ b/README_JA.md @@ -1,324 +1,264 @@ -# MimiClaw: $5チップで動くポケットAIアシスタント +# reSpeaker-claw: ReSpeaker XVF3800 向け音声 AI Agent

- MimiClaw -

- -

- License: MIT - DeepWiki - Discord - X + License: MIT + Language: C + Framework: ESP-IDF v5.5+ + Hardware: ReSpeaker XVF3800 + Architecture: Voice Agent

English | 中文 | 日本語

-**$5チップ上の世界初のAIアシスタント(OpenClaw)。Linuxなし、Node.jsなし、純粋なCのみ。** - -MimiClawは小さなESP32-S3ボードをパーソナルAIアシスタントに変えます。USB電源に接続し、WiFiにつなげて、Telegramから話しかけるだけ — どんなタスクも処理し、ローカルメモリで時間とともに成長します — すべて親指サイズのチップ上で。 +reSpeaker-claw は、ReSpeaker XVF3800 ベースのデバイスを音声ファーストの AI Agent に変えるプロジェクトです。I2S で音声を取り込み、ローカル VAD を実行し、発話を STT に送って組み込みの agent loop で処理します。システムはリアルタイム音声対話に加えて、ローカルメモリ、ツール呼び出し、スケジューリング、heartbeat、OTA 更新、プロキシ対応を統合し、最終的に TTS でスピーカーから応答を返します。 -## MimiClawの特徴 +## reSpeaker-claw とは -- **超小型** — Linux不要、Node.js不要、無駄なし — 純粋なCのみ -- **便利** — Telegramでメッセージを送るだけ、あとはお任せ -- **忠実** — メモリから学習し、再起動しても忘れない -- **省エネ** — USB給電、0.5W、24時間365日稼働 -- **お手頃** — ESP32-S3ボード1枚、$5、それだけ +- **小さい**: Linux なし、Node.js なし、無駄な依存なし、純粋な C のみ +- **記憶する**: メモリから学習し、再起動後も文脈を保持 +- **省電力**: USB 給電、より低消費電力で 24/7 稼働可能 +- **自由度が高い**: ReSpeaker XVF3800 のマイクアレイに、好みのアンプや DAC を組み合わせ可能 +- **扱いやすい**: 音声チャネルを内蔵し、XVF3800 とスピーカー経路以外の追加ハードウェアをほぼ必要としない -## 仕組み +## 特長 -![](assets/mimiclaw.png) - -Telegramでメッセージを送ると、ESP32-S3がWiFi経由で受信し、エージェントループに送ります — LLMが思考し、ツールを呼び出し、メモリを読み取り — 返答を送り返します。**Anthropic (Claude)** と **OpenAI (GPT)** の両方をサポートし、実行時に切り替え可能です。すべてが$5のチップ上で動作し、データはすべてローカルのFlashに保存されます。 +- 音声入力: ReSpeaker XVF3800 マイクアレイを I2S で接続 +- 音声出力: TTS 音声のダウンロード、WAV デコード、リサンプル、I2S 再生 +- マルチチャネル Agent: 音声、Telegram、Feishu、WebSocket +- ローカル永続化: SPIFFS にメモリ、設定、セッション、cron ジョブ、日次メモを保存 +- 互換 LLM バックエンド: 公式 Anthropic / OpenAI API に加え、Anthropic 互換または OpenAI 互換エンドポイントも利用可能 +- STT / TTS を柔軟に設定可能: URL、API Key、モデル、音色、言語を自由に差し替え可能 +- 実行時オーバーライド: WiFi、provider、model、API base、proxy、token をシリアル CLI から変更可能 ## クイックスタート ### 必要なもの -- **ESP32-S3開発ボード**(16MB Flash + 8MB PSRAM搭載、例:小智AIボード、約$10) -- **USB Type-Cケーブル** -- **Telegram Botトークン** — Telegramで[@BotFather](https://t.me/BotFather)に話しかけて作成 -- **Anthropic APIキー** — [console.anthropic.com](https://console.anthropic.com)から取得、または **OpenAI APIキー** — [platform.openai.com](https://platform.openai.com)から取得 +- reSpeaker XVF3800 USB 4 Microphone Array と XIAO ESP32S3 ボード +- I2S 出力で接続するスピーカー / DAC / アンプ経路 +- 書き込みとシリアルモニタ用の USB ケーブル +- WiFi 接続 +- ESP-IDF v5.5+ +- 任意: Telegram を使う場合は Telegram Bot Token +- 任意: Feishu を使う場合は Feishu アプリ認証情報 +- Anthropic 互換または OpenAI 互換エンドポイント向けの LLM API Key +- 音声モード用の STT サービスと TTS サービス -### インストール +### クローンとビルド環境 -```bash -# まずESP-IDF v5.5+をインストールしてください: -# https://docs.espressif.com/projects/esp-idf/en/v5.5.2/esp32s3/get-started/ +まず公式ガイドを参照して I2S ファームウェアを書き込んでください: +[SeeedStudio wiki](https://wiki.seeedstudio.com/respeaker_xvf3800_introduction/#flash-firmware) + +その後、このプロジェクトをクローンしてターゲットを設定します: -git clone https://github.com/memovai/mimiclaw.git -cd mimiclaw +```bash +git clone https://github.com/Seeed-Projects/reSpeaker-claw +cd reSpeaker-claw idf.py set-target esp32s3 ``` -
-Ubuntu インストール - -推奨ベースライン: +ESP-IDF は先にインストールしてください: [ESP-IDF Install](https://docs.espressif.com/projects/esp-idf/en/v5.5.3/esp32s3/get-started/) -- Ubuntu 22.04/24.04 -- Python >= 3.10 -- CMake >= 3.16 -- Ninja >= 1.10 -- Git >= 2.34 -- flex >= 2.6 -- bison >= 3.8 -- gperf >= 3.1 -- dfu-util >= 0.11 -- `libusb-1.0-0`, `libffi-dev`, `libssl-dev` - -Ubuntu でのインストールとビルド: +Ubuntu 用ヘルパースクリプト: ```bash -sudo apt-get update -sudo apt-get install -y git wget flex bison gperf python3 python3-pip python3-venv \ - cmake ninja-build ccache libffi-dev libssl-dev dfu-util libusb-1.0-0 - ./scripts/setup_idf_ubuntu.sh ./scripts/build_ubuntu.sh ``` -
- -
-macOS インストール - -推奨ベースライン: - -- macOS 12/13/14 -- Xcode Command Line Tools -- Homebrew -- Python >= 3.10 -- CMake >= 3.16 -- Ninja >= 1.10 -- Git >= 2.34 -- flex >= 2.6 -- bison >= 3.8 -- gperf >= 3.1 -- dfu-util >= 0.11 -- `libusb`, `libffi`, `openssl` - -macOS でのインストールとビルド: +macOS 用ヘルパースクリプト: ```bash -xcode-select --install -/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" - ./scripts/setup_idf_macos.sh ./scripts/build_macos.sh ``` -
+## 設定 -### 設定 - -MimiClawは**2層設定**を採用しています:`mimi_secrets.h`でビルド時のデフォルト値を設定し、シリアルCLIで実行時にオーバーライドできます。CLI設定値はNVS Flashに保存され、ビルド時の値より優先されます。 +まず secrets のサンプルファイルをコピーします: ```bash -cp main/mimi_secrets.h.example main/mimi_secrets.h +cp "main/mimi_secrets.h.example" "main/mimi_secrets.h" ``` -`main/mimi_secrets.h`を編集: +`main/mimi_secrets.h` を編集し、実際に使う項目を設定します: ```c -#define MIMI_SECRET_WIFI_SSID "WiFi名" -#define MIMI_SECRET_WIFI_PASS "WiFiパスワード" -#define MIMI_SECRET_TG_TOKEN "123456:ABC-DEF1234ghIkl-zyx57W2v1u123ew11" -#define MIMI_SECRET_API_KEY "sk-ant-api03-xxxxx" -#define MIMI_SECRET_MODEL_PROVIDER "anthropic" // "anthropic" または "openai" -#define MIMI_SECRET_SEARCH_KEY "" // 任意:Brave Search APIキー -#define MIMI_SECRET_TAVILY_KEY "" // 任意:Tavily APIキー(優先) -#define MIMI_SECRET_PROXY_HOST "" // 任意:例 "10.0.0.1" -#define MIMI_SECRET_PROXY_PORT "" // 任意:例 "7897" +/* WiFi */ +#define MIMI_SECRET_WIFI_SSID "YourWiFiName" +#define MIMI_SECRET_WIFI_PASS "YourWiFiPassword" + +/* Optional text channels */ +#define MIMI_SECRET_TG_TOKEN "" +#define MIMI_SECRET_FEISHU_APP_ID "" +#define MIMI_SECRET_FEISHU_APP_SECRET "" + +/* LLM */ +#define MIMI_SECRET_API_KEY "your-llm-key" +#define MIMI_SECRET_MODEL "your-model" +#define MIMI_SECRET_MODEL_PROVIDER "openai" /* or "anthropic" */ + +/* Search and proxy */ +#define MIMI_SECRET_TAVILY_KEY "" +#define MIMI_SECRET_SEARCH_KEY "" +#define MIMI_SECRET_PROXY_HOST "" +#define MIMI_SECRET_PROXY_PORT "" +#define MIMI_SECRET_PROXY_TYPE "" /* "http" or "socks5" */ + +/* Voice STT / TTS */ +#define MIMI_SECRET_STT_URL "https://your-stt-endpoint" +#define MIMI_SECRET_STT_API_KEY "your-stt-key" +#define MIMI_SECRET_STT_MODEL "your-stt-model" +#define MIMI_SECRET_TTS_URL "https://your-tts-endpoint" +#define MIMI_SECRET_TTS_API_KEY "your-tts-key" +#define MIMI_SECRET_TTS_MODEL "your-tts-model" +#define MIMI_SECRET_TTS_VOICE "" +#define MIMI_SECRET_TTS_LANGUAGE "English" + +/* ReSpeaker XVF3800 I2S pin map */ +#define MIMI_VOICE_I2S_PORT 0 +#define MIMI_VOICE_I2S_BCLK GPIO_NUM_8 +#define MIMI_VOICE_I2S_WS GPIO_NUM_7 +#define MIMI_VOICE_I2S_DIN GPIO_NUM_43 +#define MIMI_VOICE_I2S_DOUT GPIO_NUM_44 ``` -ビルドとフラッシュ: +補足: -```bash -# フルビルド(mimi_secrets.h変更後はfullclean必須) -idf.py fullclean && idf.py build +- `MIMI_SECRET_MODEL_PROVIDER` はベンダ名ではなく、リクエストプロトコルを選択します +- OpenAI 互換ゲートウェイには `openai` を使用します +- Anthropic 互換ゲートウェイには `anthropic` を使用します +- 音声モードでは STT と TTS の URL / Key を両方設定する必要があります +- LLM API base は実行時に `set_api_base` で変更できます -# シリアルポートを確認 -ls /dev/cu.usb* # macOS -ls /dev/ttyACM* # Linux +## STT と TTS の追加 -# フラッシュとモニター(PORTをあなたのポートに置き換え) -# USBアダプタ:おそらく /dev/cu.usbmodem11401(macOS)または /dev/ttyACM0(Linux) -idf.py -p PORT flash monitor -``` +このプロジェクトでは、音声を後付け機能として扱っていません。完全な ReSpeaker 体験を有効にするには: -> **重要:正しいUSBポートに接続してください!** ほとんどのESP32-S3ボードには2つのUSB-Cポートがあります。**USB**(ネイティブUSB Serial/JTAG)と書かれたポートを使用してください。**COM**(外部UARTブリッジ)と書かれたポートは使わないでください。間違ったポートに接続するとフラッシュ/モニターが失敗します。 -> ->
-> 参考画像を表示 -> -> USBポートに接続、COMポートではありません -> ->
+1. `MIMI_SECRET_STT_URL`、`MIMI_SECRET_STT_API_KEY`、`MIMI_SECRET_STT_MODEL` を設定します +2. `MIMI_SECRET_TTS_URL`、`MIMI_SECRET_TTS_API_KEY`、`MIMI_SECRET_TTS_MODEL`、`MIMI_SECRET_TTS_VOICE`、`MIMI_SECRET_TTS_LANGUAGE` を設定します +3. I2S セクションで XVF3800 の入力ピンとスピーカー側の出力ピンを設定します +4. DAC やアンプの音がノイズになる場合は、`MIMI_VOICE_I2S_STD_SLOT_STYLE` をハードウェアのタイミングに合わせて設定します +5. 室内環境で誤検知が多い場合は、`MIMI_VOICE_VAD_START_FRAMES`、`MIMI_VOICE_VAD_MIN_FRAMES`、`MIMI_VOICE_STT_COOLDOWN_MS` を調整します +6. TTS 音声が長すぎる場合は、`MIMI_VOICE_TTS_MAX_SECONDS`、`MIMI_VOICE_TTS_CHARS_PER_SEC`、`MIMI_VOICE_TTS_MAX_CHARS` を調整します -### CLIコマンド(UART/COMポート経由) +現在のファームウェアには、すでに完全な音声チャネルが含まれています: -シリアル接続で設定やデバッグができます。**設定コマンド**により再コンパイル不要で設定変更可能 — USBケーブルを挿すだけ。 +- 入力方向: mic PCM -> VAD -> STT -> message bus +- 出力方向: agent text -> TTS -> playback -**実行時設定**(NVSに保存、ビルド時のデフォルト値をオーバーライド): +## 書き込みとモニタ -``` -mimi> wifi_set MySSID MyPassword # WiFiネットワークを変更 -mimi> set_tg_token 123456:ABC... # Telegram Botトークンを変更 -mimi> set_api_key sk-ant-api03-... # APIキーを変更(AnthropicまたはOpenAI) -mimi> set_model_provider openai # プロバイダーを切替(anthropic|openai) -mimi> set_model gpt-4o # LLMモデルを変更 -mimi> set_proxy 127.0.0.1 7897 # HTTPプロキシを設定 -mimi> clear_proxy # プロキシを削除 -mimi> set_search_key BSA... # Brave Search APIキーを設定 -mimi> set_tavily_key tvly-... # Tavily APIキーを設定(優先) -mimi> config_show # 全設定を表示(マスク付き) -mimi> config_reset # NVSをクリア、ビルド時デフォルトに戻す -``` - -**デバッグ・メンテナンス:** +`main/mimi_secrets.h` を変更した後は、クリーンな状態から再ビルドしてください: +```bash +idf.py fullclean +idf.py build ``` -mimi> wifi_status # 接続されていますか? -mimi> memory_read # ボットが何を覚えているか確認 -mimi> memory_write "内容" # MEMORY.mdに書き込み -mimi> heap_info # 空きRAMはどれくらい? -mimi> session_list # 全チャットセッションを一覧 -mimi> session_clear 12345 # 会話を削除 -mimi> heartbeat_trigger # ハートビートチェックを手動トリガー -mimi> cron_start # cronスケジューラを今すぐ開始 -mimi> restart # 再起動 -``` - -### USB(JTAG)vs UART:どのポートで何をするか - -ほとんどの ESP32-S3 開発ボードには **2つの USB-C ポート**があります: -| ポート | 用途 | -|--------|------| -| **USB**(JTAG) | `idf.py flash`、JTAGデバッグ | -| **COM**(UART) | **REPL CLI**、シリアルコンソール | - -> **REPLにはUART(COM)ポートが必要です。** USB(JTAG)ポートは対話的なREPL入力をサポートしません。 - -
-ポート詳細と推奨ワークフロー - -| ポート | ラベル | プロトコル | -|--------|--------|------------| -| **USB** | USB / JTAG | ネイティブ USB Serial/JTAG | -| **COM** | UART / COM | 外部 UART ブリッジ(CP2102/CH340) | - -ESP-IDFコンソールはデフォルトでUART出力に設定されています(`CONFIG_ESP_CONSOLE_UART_DEFAULT=y`)。 - -**両方のポートを同時に接続している場合:** - -- USB(JTAG)ポートはフラッシュ/ダウンロードを処理し、補助シリアル出力を提供 -- UART(COM)ポートはREPL用のメインインタラクティブコンソールを提供 -- macOS では両ポートとも `/dev/cu.usbmodem*` または `/dev/cu.usbserial-*` として表示 — `ls /dev/cu.usb*` で確認 -- Linux では USB(JTAG)は通常 `/dev/ttyACM0`、UART は通常 `/dev/ttyUSB0` - -**推奨ワークフロー:** +シリアルポートを確認します: ```bash -# USB(JTAG)ポートでフラッシュ -idf.py -p /dev/cu.usbmodem11401 flash - -# UART(COM)ポートでREPLを開く -idf.py -p /dev/cu.usbserial-110 monitor -# または任意のシリアルターミナル:screen、minicom、PuTTY(ボーレート 115200) +ls /dev/cu.usb* # macOS +ls /dev/ttyACM* # Linux ``` -
- -## メモリ - -MimiClawはすべてのデータをプレーンテキストファイルとして保存します。直接読み取り・編集可能です: +書き込みとモニタ: -| ファイル | 説明 | -|----------|------| -| `SOUL.md` | ボットの性格 — 編集して振る舞いを変更 | -| `USER.md` | あなたの情報 — 名前、好み、言語 | -| `MEMORY.md` | 長期記憶 — ボットが常に覚えておくべきこと | -| `HEARTBEAT.md` | タスクリスト — ボットが定期的にチェックして自律的に実行 | -| `cron.json` | スケジュールジョブ — AIが作成した定期・単発タスク | -| `2026-02-05.md` | 日次メモ — 今日あったこと | -| `tg_12345.jsonl` | チャット履歴 — ボットとの会話 | - -## ツール +```bash +idf.py -p PORT flash monitor +``` -MimiClawはAnthropicとOpenAI両方のツール呼び出しをサポート — LLMは会話中にツールを呼び出し、タスクが完了するまでループします(ReActパターン)。 +`PORT` は実際のデバイスパスに置き換えてください。 + +## シリアル CLI + +シリアル CLI は、NVS に保存される実行時設定を最も素早く変更する方法です: + +```text +mimi> wifi_set MySSID MyPassword +mimi> set_tg_token 123456:ABC... +mimi> set_api_key your-llm-key +mimi> set_api_base https://your-compatible-endpoint/v1 +mimi> set_model_provider openai +mimi> set_model gpt-5.2 +mimi> set_proxy 127.0.0.1 7897 +mimi> clear_proxy +mimi> set_search_key BSA... +mimi> set_tavily_key tvly-... +mimi> config_show +mimi> config_reset +``` -| ツール | 説明 | -|--------|------| -| `web_search` | Tavily(優先)またはBraveでウェブ検索し、最新情報を取得 | -| `get_current_time` | HTTP経由で現在の日時を取得し、システムクロックを設定 | -| `cron_add` | 定期または単発タスクをスケジュール(LLMが自律的にcronジョブを作成) | -| `cron_list` | スケジュール済みのcronジョブを一覧表示 | -| `cron_remove` | IDでcronジョブを削除 | +メンテナンス用コマンド: + +```text +mimi> wifi_status +mimi> memory_read +mimi> memory_write "remember this" +mimi> heap_info +mimi> session_list +mimi> session_clear 12345 +mimi> heartbeat_trigger +mimi> cron_start +mimi> restart +``` -ウェブ検索を有効にするには、`mimi_secrets.h`で[Tavily APIキー](https://app.tavily.com/home)(優先、`MIMI_SECRET_TAVILY_KEY`)または[Brave Search APIキー](https://brave.com/search/api/)(`MIMI_SECRET_SEARCH_KEY`)を設定してください。 +## 互換 Provider モデル -## Cronタスク +`reSpeaker-claw` は公式の Anthropic と OpenAI のエンドポイントだけに限定されません。 -MimiClawにはcronスケジューラが内蔵されており、AIが自律的にタスクをスケジュールできます。LLMは`cron_add`ツールで定期ジョブ(「N秒ごと」)や単発ジョブ(「UNIXタイムスタンプで指定」)を作成できます。ジョブが発火すると、メッセージがエージェントループに注入され、AIが起動してタスクを処理・応答します。 +対応内容: -ジョブはSPIFFS(`cron.json`)に永続化され、再起動後も保持されます。活用例:日次サマリー、定期リマインダー、スケジュールチェック。 +- `set_model_provider anthropic` で選択する Anthropic 互換サービス +- `set_model_provider openai` で選択する OpenAI 互換サービス +- `set_api_base` で切り替える任意の API base -## ハートビート +これにより、agent loop を変更せずに、ローカルゲートウェイ、地域クラウド、統合 API プラットフォームを利用できます。 -ハートビートサービスはSPIFFS上の`HEARTBEAT.md`を定期的に読み取り、アクション可能なタスクがあるかチェックします。未完了の項目(空行、見出し、チェック済み`- [x]`以外)が見つかると、エージェントループにプロンプトを送信し、AIが自律的に処理します。 +## メモリと自動化 -これによりMimiClawはプロアクティブなアシスタントになります — `HEARTBEAT.md`にタスクを書き込めば、次のハートビートサイクルで自動的に拾い上げて実行します(デフォルト:30分ごと)。 +Agent は SPIFFS 上に状態をプレーンテキストファイルとして保存します: -## その他の機能 +| ファイル | 用途 | +|----------|------| +| `SOUL.md` | アシスタント人格 | +| `USER.md` | ユーザープロファイル | +| `MEMORY.md` | 長期記憶 | +| `HEARTBEAT.md` | 定期実行する自律タスクリスト | +| `cron.json` | スケジュールジョブ | +| `tg_12345.jsonl` | セッション履歴 | -- **WebSocketゲートウェイ** — ポート18789、LAN内から任意のWebSocketクライアントで接続 -- **OTAアップデート** — WiFi経由でファームウェア更新、USB不要 -- **デュアルコア** — ネットワークI/OとAI処理が別々のCPUコアで動作 -- **HTTPプロキシ** — CONNECTトンネル対応、制限付きネットワークに対応 -- **マルチプロバイダー** — Anthropic (Claude) と OpenAI (GPT) の両方をサポート、実行時に切り替え可能 -- **Cronスケジューラ** — AIが定期・単発タスクを自律的にスケジュール、再起動後も永続化 -- **ハートビート** — タスクファイルを定期チェックし、AIを自律的に駆動 -- **ツール呼び出し** — ReActエージェントループ、両プロバイダーでツール呼び出し対応 +組み込みの自動化機能: -## 開発者向け +- `cron_add`、`cron_list`、`cron_remove` +- heartbeat 駆動の能動的タスク処理 +- ReAct loop におけるツール呼び出し +- 再起動後も保持されるローカル状態 -技術的な詳細は`docs/`フォルダにあります: +## ツール -- **[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md)** — システム設計、モジュール構成、タスクレイアウト、メモリバジェット、プロトコル、Flashパーティション -- **[docs/TODO.md](docs/TODO.md)** — 機能ギャップとロードマップ -- **[docs/im-integration/](docs/im-integration/README.md)** — IMチャネル統合ガイド(Feishuなど) +組み込みツール: -## 貢献 +- `web_search` +- `get_current_time` +- `cron_add` +- `cron_list` +- `cron_remove` +- Agent ランタイムが使う SPIFFS ファイル操作ツール -Issue や Pull Request を作成する前に、**[CONTRIBUTING.md](CONTRIBUTING.md)** をご確認ください。 +Web 検索を有効にするには、次のいずれかを設定します: -## コントリビューター +- `MIMI_SECRET_TAVILY_KEY` +- `MIMI_SECRET_SEARCH_KEY` -MimiClaw に貢献してくれた皆さんに感謝します。 +## 謝辞 - - MimiClaw contributors - +本プロジェクトは元の [mimiclaw](https://github.com/memovai/mimiclaw) を基盤としています。reSpeaker-claw は、その組み込み agent 基盤を ReSpeaker XVF3800 の音声ハードウェア向けに適応し、STT / TTS パイプラインを拡張しつつ、マルチチャネル agent アーキテクチャを継承しています。 ## ライセンス MIT - -## 謝辞 - -[OpenClaw](https://github.com/openclaw/openclaw)と[Nanobot](https://github.com/HKUDS/nanobot)にインスパイアされました。MimiClawはコアAIエージェントアーキテクチャを組み込みハードウェア向けに再実装しました — Linuxなし、サーバーなし、$5のチップだけ。 - -## Star History - - - - - - Star History Chart - - diff --git a/assets/banner.png b/assets/banner.png deleted file mode 100644 index c3cc4255..00000000 Binary files a/assets/banner.png and /dev/null differ diff --git a/assets/esp32s3-usb-port.jpg b/assets/esp32s3-usb-port.jpg deleted file mode 100644 index 706f6107..00000000 Binary files a/assets/esp32s3-usb-port.jpg and /dev/null differ diff --git a/assets/mimiclaw.png b/assets/mimiclaw.png deleted file mode 100644 index e22246e7..00000000 Binary files a/assets/mimiclaw.png and /dev/null differ diff --git a/main/CMakeLists.txt b/main/CMakeLists.txt index 5f3fe1ea..47afb9ec 100644 --- a/main/CMakeLists.txt +++ b/main/CMakeLists.txt @@ -21,10 +21,11 @@ idf_component_register( "tools/tool_get_time.c" "tools/tool_files.c" "skills/skill_loader.c" + "voice/voice_channel.c" INCLUDE_DIRS "." REQUIRES nvs_flash esp_wifi esp_netif esp_http_client esp_http_server esp_https_ota esp_event json spiffs console vfs app_update esp-tls - esp_timer esp_websocket_client + esp_timer esp_websocket_client driver ) diff --git a/main/agent/agent_loop.c b/main/agent/agent_loop.c index 7e5eae64..2a513078 100644 --- a/main/agent/agent_loop.c +++ b/main/agent/agent_loop.c @@ -86,6 +86,30 @@ static void append_turn_context_prompt(char *prompt, size_t size, const mimi_msg if (n < 0 || (size_t)n >= (size - off)) { prompt[size - 1] = '\0'; } + + if (msg->channel[0] && strcmp(msg->channel, MIMI_CHAN_VOICE) == 0) { + off = strnlen(prompt, size - 1); + if (off >= size - 1) { + return; + } + + n = snprintf( + prompt + off, size - off, + "\n## Voice Output Constraints\n" + "This reply will be converted to speech (TTS) and played on a small speaker.\n" + "- Use English, natural spoken style.\n" + "- Keep it short: keep playback within ~%d seconds.\n" + "- Structure: at most 2 sentences + 1 short follow-up question.\n" + "- Length: <= %d characters total.\n" + "- No markdown, no lists, no code blocks, no URLs.\n" + "- Avoid long explanations; if the answer is long, give a 1–2 sentence summary and ask if the user wants more.\n", + (int)MIMI_VOICE_TTS_MAX_SECONDS, + (int)MIMI_VOICE_LLM_MAX_CHARS); + + if (n < 0 || (size_t)n >= (size - off)) { + prompt[size - 1] = '\0'; + } + } } static char *patch_tool_input_with_context(const llm_tool_call_t *call, const mimi_msg_t *msg) @@ -218,7 +242,7 @@ static void agent_loop_task(void *arg) while (iteration < MIMI_AGENT_MAX_TOOL_ITER) { /* Send "working" indicator before each API call */ #if MIMI_AGENT_SEND_WORKING_STATUS - if (!sent_working_status && strcmp(msg.channel, MIMI_CHAN_SYSTEM) != 0) { + if (!sent_working_status && strcmp(msg.channel, MIMI_CHAN_SYSTEM) != 0 && strcmp(msg.channel, MIMI_CHAN_VOICE) != 0) { mimi_msg_t status = {0}; strncpy(status.channel, msg.channel, sizeof(status.channel) - 1); strncpy(status.chat_id, msg.chat_id, sizeof(status.chat_id) - 1); diff --git a/main/bus/message_bus.h b/main/bus/message_bus.h index 1fc2d31d..b0fe8edb 100644 --- a/main/bus/message_bus.h +++ b/main/bus/message_bus.h @@ -10,6 +10,7 @@ #define MIMI_CHAN_WEBSOCKET "websocket" #define MIMI_CHAN_CLI "cli" #define MIMI_CHAN_SYSTEM "system" +#define MIMI_CHAN_VOICE "voice" /* Message types on the bus */ typedef struct { diff --git a/main/cli/serial_cli.c b/main/cli/serial_cli.c index 4968ff7d..904d90d4 100644 --- a/main/cli/serial_cli.c +++ b/main/cli/serial_cli.c @@ -123,6 +123,12 @@ static struct { struct arg_end *end; } api_key_args; +/* --- set_api_base command --- */ +static struct { + struct arg_str *base; + struct arg_end *end; +} api_base_args; + static int cmd_set_api_key(int argc, char **argv) { int nerrors = arg_parse(argc, argv, (void **)&api_key_args); @@ -135,6 +141,18 @@ static int cmd_set_api_key(int argc, char **argv) return 0; } +static int cmd_set_api_base(int argc, char **argv) +{ + int nerrors = arg_parse(argc, argv, (void **)&api_base_args); + if (nerrors != 0) { + arg_print_errors(stderr, api_base_args.end, argv[0]); + return 1; + } + llm_set_api_base(api_base_args.base->sval[0]); + printf("API base set.\n"); + return 0; +} + /* --- set_model command --- */ static struct { struct arg_str *model; @@ -535,6 +553,7 @@ static int cmd_config_show(int argc, char **argv) print_config("WiFi Pass", MIMI_NVS_WIFI, MIMI_NVS_KEY_PASS, MIMI_SECRET_WIFI_PASS, true); print_config("TG Token", MIMI_NVS_TG, MIMI_NVS_KEY_TG_TOKEN, MIMI_SECRET_TG_TOKEN, true); print_config("API Key", MIMI_NVS_LLM, MIMI_NVS_KEY_API_KEY, MIMI_SECRET_API_KEY, true); + print_config("API Base", MIMI_NVS_LLM, MIMI_NVS_KEY_API_BASE, MIMI_SECRET_API_BASE, false); print_config("Model", MIMI_NVS_LLM, MIMI_NVS_KEY_MODEL, MIMI_SECRET_MODEL, false); print_config("Provider", MIMI_NVS_LLM, MIMI_NVS_KEY_PROVIDER, MIMI_SECRET_MODEL_PROVIDER, false); print_config("Proxy Host", MIMI_NVS_PROXY, MIMI_NVS_KEY_PROXY_HOST, MIMI_SECRET_PROXY_HOST, false); @@ -849,6 +868,17 @@ esp_err_t serial_cli_init(void) }; esp_console_cmd_register(&api_key_cmd); + /* set_api_base */ + api_base_args.base = arg_str1(NULL, NULL, "", "LLM API base (http(s)://host[:port][/path])"); + api_base_args.end = arg_end(1); + esp_console_cmd_t api_base_cmd = { + .command = "set_api_base", + .help = "Set LLM API base (e.g. https://api.anthropic.com/v1)", + .func = &cmd_set_api_base, + .argtable = &api_base_args, + }; + esp_console_cmd_register(&api_base_cmd); + /* set_model */ model_args.model = arg_str1(NULL, NULL, "", "Model identifier"); model_args.end = arg_end(1); @@ -1054,4 +1084,4 @@ esp_err_t serial_cli_init(void) ESP_LOGI(TAG, "Serial CLI started"); return ESP_OK; -} +} \ No newline at end of file diff --git a/main/llm/llm_proxy.c b/main/llm/llm_proxy.c index c6fa1b88..adcd74d5 100644 --- a/main/llm/llm_proxy.c +++ b/main/llm/llm_proxy.c @@ -15,12 +15,37 @@ static const char *TAG = "llm"; #define LLM_API_KEY_MAX_LEN 320 #define LLM_MODEL_MAX_LEN 64 +#define LLM_API_BASE_MAX_LEN 256 +#define LLM_HOST_MAX_LEN 128 +#define LLM_PATH_MAX_LEN 128 #define LLM_DUMP_MAX_BYTES (16 * 1024) #define LLM_DUMP_CHUNK_BYTES 320 static char s_api_key[LLM_API_KEY_MAX_LEN] = {0}; static char s_model[LLM_MODEL_MAX_LEN] = MIMI_LLM_DEFAULT_MODEL; +static char s_model_id[LLM_MODEL_MAX_LEN] = {0}; static char s_provider[16] = MIMI_LLM_PROVIDER_DEFAULT; +static char s_api_base[LLM_API_BASE_MAX_LEN] = {0}; + +typedef enum { + LLM_PROTOCOL_ANTHROPIC = 0, + LLM_PROTOCOL_OPENAI = 1, +} llm_protocol_t; + +static llm_protocol_t s_protocol = LLM_PROTOCOL_ANTHROPIC; +static bool s_api_tls = true; +static char s_api_host[LLM_HOST_MAX_LEN] = {0}; +static uint16_t s_api_port = 443; +static char s_api_base_path[LLM_PATH_MAX_LEN] = {0}; +static char s_api_req_path[LLM_PATH_MAX_LEN + 32] = {0}; +static char s_api_host_header[LLM_HOST_MAX_LEN + 8] = {0}; +static char s_api_url[LLM_API_BASE_MAX_LEN + 64] = {0}; +static bool s_logged_proxy_bypass_warning = false; + +static const char *llm_protocol_name(llm_protocol_t p) +{ + return (p == LLM_PROTOCOL_OPENAI) ? "openai" : "anthropic"; +} static void llm_log_payload(const char *label, const char *payload) { @@ -180,29 +205,157 @@ static esp_err_t http_event_handler(esp_http_client_event_t *evt) return ESP_OK; } -/* ── Provider helpers ──────────────────────────────────────────── */ +/* ── Protocol config ─────────────────────────────────────────── */ -static bool provider_is_openai(void) -{ - return strcmp(s_provider, "openai") == 0; +typedef struct { + llm_protocol_t protocol; + const char *label; /* "openai" */ + const char *prefix; /* "openai/" */ + const char *suffix; /* "/chat/completions" */ + const char *base; /* Default API base */ +} llm_proto_cfg_t; + +static const llm_proto_cfg_t PROTO_MAP[] = { + {LLM_PROTOCOL_OPENAI, "openai", "openai/", "/chat/completions", MIMI_LLM_API_BASE_OPENAI}, + {LLM_PROTOCOL_ANTHROPIC, "anthropic", "anthropic/", "/messages", MIMI_LLM_API_BASE_ANTHROPIC} +}; + +static const llm_proto_cfg_t* get_current_proto(void) { + return &PROTO_MAP[s_protocol == LLM_PROTOCOL_OPENAI ? 0 : 1]; } -static const char *llm_api_url(void) -{ - return provider_is_openai() ? MIMI_OPENAI_API_URL : MIMI_LLM_API_URL; +/* ── Helpers ─────────────────────────────────────────────────── */ + +static bool llm_protocol_is_openai(void) { + return s_protocol == LLM_PROTOCOL_OPENAI; } -static const char *llm_api_host(void) -{ - return provider_is_openai() ? "api.openai.com" : "api.anthropic.com"; +/* Validate api_base format without modifying global state */ +static esp_err_t llm_validate_api_base(const char *api_base) { + if (!api_base || api_base[0] == '\0') return ESP_ERR_INVALID_ARG; + + /* Check for valid scheme */ + const char *p; + if (strncmp(api_base, "https://", 8) == 0) { + p = api_base + 8; + } else if (strncmp(api_base, "http://", 7) == 0) { + p = api_base + 7; + } else { + return ESP_ERR_INVALID_ARG; + } + + /* Basic format validation - ensure there's content after the scheme */ + if (p[0] == '\0' || p[0] == '/' || p[0] == ':') { + return ESP_ERR_INVALID_ARG; + } + + /* Check for valid host part (before colon or slash) */ + const char *slash = strchr(p, '/'); + const char *colon = strchr(p, ':'); + if (colon && slash && colon > slash) colon = NULL; /* Colon is part of path */ + + const char *host_end = colon ? colon : (slash ? slash : p + strlen(p)); + if (host_end == p) return ESP_ERR_INVALID_ARG; /* Empty host */ + + /* Validate port if present */ + if (colon) { + char *endptr; + long port = strtol(colon + 1, &endptr, 10); + if (endptr == colon + 1 || (*endptr != '\0' && *endptr != '/') || + port < 1 || port > 65535) { + return ESP_ERR_INVALID_ARG; + } + } + + return ESP_OK; } -static const char *llm_api_path(void) -{ - return provider_is_openai() ? "/v1/chat/completions" : "/v1/messages"; +/* Parse api_base: scheme (http/https), host[:port], optional base path. */ +static esp_err_t llm_parse_api_base(const char *api_base) { + if (!api_base || api_base[0] == '\0') return ESP_ERR_INVALID_ARG; + + const char *p; + if (strncmp(api_base, "https://", 8) == 0) { + s_api_tls = true; p = api_base + 8; s_api_port = 443; + } else if (strncmp(api_base, "http://", 7) == 0) { + s_api_tls = false; p = api_base + 7; s_api_port = 80; + } else return ESP_ERR_INVALID_ARG; + + const char *slash = strchr(p, '/'); + const char *colon = strchr(p, ':'); + if (colon && slash && colon > slash) colon = NULL; /* Colon is part of path */ + + const char *host_end = colon ? colon : (slash ? slash : p + strlen(p)); + snprintf(s_api_host, sizeof(s_api_host), "%.*s", (int)(host_end - p), p); + + if (colon) { + char *endptr; + long port = strtol(colon + 1, &endptr, 10); + if (endptr != colon + 1 && (*endptr == '\0' || *endptr == '/') && + port >= 1 && port <= 65535) { + s_api_port = (uint16_t)port; + } + /* If port parsing fails, keep the default port (443 for HTTPS, 80 for HTTP) */ + } + + s_api_base_path[0] = '\0'; + if (slash) { + safe_copy(s_api_base_path, sizeof(s_api_base_path), slash); + size_t len = strlen(s_api_base_path); + while (len > 0 && s_api_base_path[len - 1] == '/') s_api_base_path[--len] = '\0'; + } + return ESP_OK; +} + +/* Build derived request path, Host header, and full URL strings. */ +static void llm_build_request_targets(void) { + const llm_proto_cfg_t *cfg = get_current_proto(); + + snprintf(s_api_req_path, sizeof(s_api_req_path), "%s%s", s_api_base_path, cfg->suffix); + if (s_api_req_path[0] == '\0') strcpy(s_api_req_path, "/"); + + bool is_std = (s_api_tls && s_api_port == 443) || (!s_api_tls && s_api_port == 80); + if (is_std) { + snprintf(s_api_host_header, sizeof(s_api_host_header), "%s", s_api_host); + } else { + snprintf(s_api_host_header, sizeof(s_api_host_header), "%s:%u", s_api_host, s_api_port); + } + + snprintf(s_api_url, sizeof(s_api_url), "%s://%s%s", + s_api_tls ? "https" : "http", s_api_host_header, s_api_req_path); } -/* ── Init ─────────────────────────────────────────────────────── */ +/* ── Derived config ──────────────────────────────────────────── */ + +static void llm_recompute_effective_config(void) { + /* Determine protocol + model_id (prefix overrides provider), and update request targets. */ + s_logged_proxy_bypass_warning = false; /* Reset warning flag when config changes */ + s_protocol = (strcmp(s_provider, "openai") == 0) ? LLM_PROTOCOL_OPENAI : LLM_PROTOCOL_ANTHROPIC; + const char *model_id = s_model; + + for (int i = 0; i < 2; i++) { + size_t len = strlen(PROTO_MAP[i].prefix); + if (strncmp(s_model, PROTO_MAP[i].prefix, len) == 0 && s_model[len] != '\0') { + s_protocol = PROTO_MAP[i].protocol; + model_id = s_model + len; + break; + } + } + safe_copy(s_model_id, sizeof(s_model_id), model_id); + + const char *default_base = get_current_proto()->base; + const char *base = (s_api_base[0] != '\0') ? s_api_base : default_base; + + if (llm_parse_api_base(base) != ESP_OK) { + ESP_LOGE(TAG, "Failed to parse API base: %s. Using default.", base); + llm_parse_api_base(default_base); + } + + llm_build_request_targets(); + + ESP_LOGI(TAG, "Configured: Protocol=%s, Model=%s, URL=%s", + get_current_proto()->label, s_model_id, s_api_url); +} esp_err_t llm_proxy_init(void) { @@ -210,6 +363,9 @@ esp_err_t llm_proxy_init(void) if (MIMI_SECRET_API_KEY[0] != '\0') { safe_copy(s_api_key, sizeof(s_api_key), MIMI_SECRET_API_KEY); } + if (MIMI_SECRET_API_BASE[0] != '\0') { + safe_copy(s_api_base, sizeof(s_api_base), MIMI_SECRET_API_BASE); + } if (MIMI_SECRET_MODEL[0] != '\0') { safe_copy(s_model, sizeof(s_model), MIMI_SECRET_MODEL); } @@ -225,6 +381,11 @@ esp_err_t llm_proxy_init(void) if (nvs_get_str(nvs, MIMI_NVS_KEY_API_KEY, tmp, &len) == ESP_OK && tmp[0]) { safe_copy(s_api_key, sizeof(s_api_key), tmp); } + char base_tmp[LLM_API_BASE_MAX_LEN] = {0}; + len = sizeof(base_tmp); + if (nvs_get_str(nvs, MIMI_NVS_KEY_API_BASE, base_tmp, &len) == ESP_OK && base_tmp[0]) { + safe_copy(s_api_base, sizeof(s_api_base), base_tmp); + } char model_tmp[LLM_MODEL_MAX_LEN] = {0}; len = sizeof(model_tmp); if (nvs_get_str(nvs, MIMI_NVS_KEY_MODEL, model_tmp, &len) == ESP_OK && model_tmp[0]) { @@ -238,9 +399,9 @@ esp_err_t llm_proxy_init(void) nvs_close(nvs); } - if (s_api_key[0]) { - ESP_LOGI(TAG, "LLM proxy initialized (provider: %s, model: %s)", s_provider, s_model); - } else { + llm_recompute_effective_config(); + + if (s_api_key[0] == '\0') { ESP_LOGW(TAG, "No API key. Use CLI: set_api_key "); } return ESP_OK; @@ -251,7 +412,7 @@ esp_err_t llm_proxy_init(void) static esp_err_t llm_http_direct(const char *post_data, resp_buf_t *rb, int *out_status) { esp_http_client_config_t config = { - .url = llm_api_url(), + .url = s_api_url, .event_handler = http_event_handler, .user_data = rb, .timeout_ms = 120 * 1000, @@ -265,14 +426,16 @@ static esp_err_t llm_http_direct(const char *post_data, resp_buf_t *rb, int *out esp_http_client_set_method(client, HTTP_METHOD_POST); esp_http_client_set_header(client, "Content-Type", "application/json"); - if (provider_is_openai()) { + if (llm_protocol_is_openai()) { if (s_api_key[0]) { char auth[LLM_API_KEY_MAX_LEN + 16]; snprintf(auth, sizeof(auth), "Bearer %s", s_api_key); esp_http_client_set_header(client, "Authorization", auth); } } else { - esp_http_client_set_header(client, "x-api-key", s_api_key); + if (s_api_key[0] != '\0') { + esp_http_client_set_header(client, "x-api-key", s_api_key); + } esp_http_client_set_header(client, "anthropic-version", MIMI_LLM_API_VERSION); } esp_http_client_set_post_field(client, post_data, strlen(post_data)); @@ -287,80 +450,71 @@ static esp_err_t llm_http_direct(const char *post_data, resp_buf_t *rb, int *out static esp_err_t llm_http_via_proxy(const char *post_data, resp_buf_t *rb, int *out_status) { - proxy_conn_t *conn = proxy_conn_open(llm_api_host(), 443, 30000); + proxy_conn_t *conn = proxy_conn_open(s_api_host, s_api_port, 30000); if (!conn) return ESP_ERR_HTTP_CONNECT; - int body_len = strlen(post_data); - char header[1024]; - int hlen = 0; - if (provider_is_openai()) { - hlen = snprintf(header, sizeof(header), - "POST %s HTTP/1.1\r\n" - "Host: %s\r\n" - "Content-Type: application/json\r\n" - "Authorization: Bearer %s\r\n" - "Content-Length: %d\r\n" - "Connection: close\r\n\r\n", - llm_api_path(), llm_api_host(), s_api_key, body_len); + /* Build request headers */ + char h[1024]; + int off = snprintf(h, sizeof(h), "POST %s HTTP/1.1\r\nHost: %s\r\nContent-Type: application/json\r\n", + s_api_req_path, s_api_host_header); + + if (llm_protocol_is_openai()) { + if (s_api_key[0] != '\0') { + off += snprintf(h + off, sizeof(h) - off, "Authorization: Bearer %s\r\n", s_api_key); + } } else { - hlen = snprintf(header, sizeof(header), - "POST %s HTTP/1.1\r\n" - "Host: %s\r\n" - "Content-Type: application/json\r\n" - "x-api-key: %s\r\n" - "anthropic-version: %s\r\n" - "Content-Length: %d\r\n" - "Connection: close\r\n\r\n", - llm_api_path(), llm_api_host(), s_api_key, MIMI_LLM_API_VERSION, body_len); - } - - if (proxy_conn_write(conn, header, hlen) < 0 || - proxy_conn_write(conn, post_data, body_len) < 0) { + if (s_api_key[0] != '\0') { + off += snprintf(h + off, sizeof(h) - off, "x-api-key: %s\r\n", s_api_key); + } + off += snprintf(h + off, sizeof(h) - off, "anthropic-version: %s\r\n", MIMI_LLM_API_VERSION); + } + + off += snprintf(h + off, sizeof(h) - off, "Content-Length: %zu\r\nConnection: close\r\n\r\n", strlen(post_data)); + + /* Send */ + if (off >= sizeof(h) || proxy_conn_write(conn, h, off) < 0 || + proxy_conn_write(conn, post_data, strlen(post_data)) < 0) { proxy_conn_close(conn); return ESP_ERR_HTTP_WRITE_DATA; } - /* Read full response into buffer */ - char tmp[4096]; - while (1) { - int n = proxy_conn_read(conn, tmp, sizeof(tmp), 120000); - if (n <= 0) break; + /* Receive full response */ + char tmp[1024]; + int n; + while ((n = proxy_conn_read(conn, tmp, sizeof(tmp), 120000)) > 0) { if (resp_buf_append(rb, tmp, n) != ESP_OK) break; + vTaskDelay(pdMS_TO_TICKS(1)); } proxy_conn_close(conn); - /* Parse status line */ - *out_status = 0; - if (rb->len > 5 && strncmp(rb->data, "HTTP/", 5) == 0) { - const char *sp = strchr(rb->data, ' '); - if (sp) *out_status = atoi(sp + 1); - } + /* Parse status */ + *out_status = (rb->len > 12 && strncmp(rb->data, "HTTP/", 5) == 0) ? atoi(rb->data + 9) : 0; - /* Strip HTTP headers, keep body only */ + /* Strip headers */ char *body = strstr(rb->data, "\r\n\r\n"); if (body) { body += 4; - size_t blen = rb->len - (body - rb->data); - memmove(rb->data, body, blen); - rb->len = blen; + rb->len -= (body - rb->data); + memmove(rb->data, body, rb->len); rb->data[rb->len] = '\0'; } - /* Decode chunked transfer encoding if present */ resp_buf_decode_chunked(rb); - return ESP_OK; } -/* ── Shared HTTP dispatch ─────────────────────────────────────── */ - static esp_err_t llm_http_call(const char *post_data, resp_buf_t *rb, int *out_status) { if (http_proxy_is_enabled()) { - return llm_http_via_proxy(post_data, rb, out_status); - } else { - return llm_http_direct(post_data, rb, out_status); + if (s_api_tls) { + return llm_http_via_proxy(post_data, rb, out_status); + } + if (!s_logged_proxy_bypass_warning) { + ESP_LOGW(TAG, "Proxy configured but api_base is http; bypassing proxy"); + s_logged_proxy_bypass_warning = true; + } } + return llm_http_direct(post_data, rb, out_status); } static cJSON *convert_tools_openai(const char *tools_json) @@ -554,18 +708,16 @@ esp_err_t llm_chat_tools(const char *system_prompt, { memset(resp, 0, sizeof(*resp)); - if (s_api_key[0] == '\0') return ESP_ERR_INVALID_STATE; - /* Build request body (non-streaming) */ cJSON *body = cJSON_CreateObject(); - cJSON_AddStringToObject(body, "model", s_model); - if (provider_is_openai()) { + cJSON_AddStringToObject(body, "model", s_model_id); + if (strncasecmp(s_model_id, "gpt-5", 5) == 0 || strncasecmp(s_model_id, "o1", 2) == 0) { cJSON_AddNumberToObject(body, "max_completion_tokens", MIMI_LLM_MAX_TOKENS); } else { cJSON_AddNumberToObject(body, "max_tokens", MIMI_LLM_MAX_TOKENS); } - if (provider_is_openai()) { + if (llm_protocol_is_openai()) { cJSON *openai_msgs = convert_messages_openai(system_prompt, messages); cJSON_AddItemToObject(body, "messages", openai_msgs); @@ -596,8 +748,8 @@ esp_err_t llm_chat_tools(const char *system_prompt, cJSON_Delete(body); if (!post_data) return ESP_ERR_NO_MEM; - ESP_LOGI(TAG, "Calling LLM API with tools (provider: %s, model: %s, body: %d bytes)", - s_provider, s_model, (int)strlen(post_data)); + ESP_LOGI(TAG, "Calling LLM API with tools (protocol: %s, model: %s, body: %d bytes)", + llm_protocol_name(s_protocol), s_model_id, (int)strlen(post_data)); llm_log_payload("LLM tools request", post_data); /* HTTP call */ @@ -635,7 +787,7 @@ esp_err_t llm_chat_tools(const char *system_prompt, return ESP_FAIL; } - if (provider_is_openai()) { + if (llm_protocol_is_openai()) { cJSON *choices = cJSON_GetObjectItem(root, "choices"); cJSON *choice0 = choices && cJSON_IsArray(choices) ? cJSON_GetArrayItem(choices, 0) : NULL; if (choice0) { @@ -784,6 +936,27 @@ esp_err_t llm_set_api_key(const char *api_key) return ESP_OK; } +esp_err_t llm_set_api_base(const char *api_base) +{ + /* Validate before persisting - use validation-only function */ + esp_err_t err = llm_validate_api_base(api_base); + if (err != ESP_OK) { + ESP_LOGE(TAG, "Invalid API base format: %s", api_base ? api_base : ""); + return err; + } + + nvs_handle_t nvs; + ESP_ERROR_CHECK(nvs_open(MIMI_NVS_LLM, NVS_READWRITE, &nvs)); + ESP_ERROR_CHECK(nvs_set_str(nvs, MIMI_NVS_KEY_API_BASE, api_base)); + ESP_ERROR_CHECK(nvs_commit(nvs)); + nvs_close(nvs); + + safe_copy(s_api_base, sizeof(s_api_base), api_base); + llm_recompute_effective_config(); + ESP_LOGI(TAG, "API base set"); + return ESP_OK; +} + esp_err_t llm_set_model(const char *model) { nvs_handle_t nvs; @@ -793,6 +966,7 @@ esp_err_t llm_set_model(const char *model) nvs_close(nvs); safe_copy(s_model, sizeof(s_model), model); + llm_recompute_effective_config(); ESP_LOGI(TAG, "Model set to: %s", s_model); return ESP_OK; } @@ -806,6 +980,7 @@ esp_err_t llm_set_provider(const char *provider) nvs_close(nvs); safe_copy(s_provider, sizeof(s_provider), provider); + llm_recompute_effective_config(); ESP_LOGI(TAG, "Provider set to: %s", s_provider); return ESP_OK; -} +} \ No newline at end of file diff --git a/main/llm/llm_proxy.h b/main/llm/llm_proxy.h index b667f624..7d333b84 100644 --- a/main/llm/llm_proxy.h +++ b/main/llm/llm_proxy.h @@ -17,6 +17,19 @@ esp_err_t llm_proxy_init(void); */ esp_err_t llm_set_api_key(const char *api_key); +/** + * Save the LLM API base URL to NVS. + * + * Expected format: http(s)://host[:port][/path] + * Examples: + * - https://api.anthropic.com/v1 + * - https://api.openai.com/v1 + * - http://localhost:11434/v1 + * - https://api.minimaxi.com/anthropic/v1 + * - https://open.bigmodel.cn/api/paas/v4 + */ +esp_err_t llm_set_api_base(const char *api_base); + /** * Save the LLM provider to NVS. (e.g. "anthropic", "openai") */ @@ -58,4 +71,4 @@ void llm_response_free(llm_response_t *resp); esp_err_t llm_chat_tools(const char *system_prompt, cJSON *messages, const char *tools_json, - llm_response_t *resp); + llm_response_t *resp); \ No newline at end of file diff --git a/main/mimi.c b/main/mimi.c index 0e8e8fa7..430b9b88 100644 --- a/main/mimi.c +++ b/main/mimi.c @@ -25,6 +25,7 @@ #include "cron/cron_service.h" #include "heartbeat/heartbeat.h" #include "skills/skill_loader.h" +#include "voice/voice_channel.h" static const char *TAG = "mimi"; @@ -60,41 +61,73 @@ static esp_err_t init_spiffs(void) return ESP_OK; } - +static void voice_speak_task(void *arg) +{ + char *text = (char *)arg; + if (text) { + esp_err_t err = voice_channel_speak_text(text); + if (err != ESP_OK) { + ESP_LOGW(TAG, "Voice playback failed: %s", esp_err_to_name(err)); + } + free(text); + } + vTaskDelete(NULL); +} /* Outbound dispatch task: reads from outbound queue and routes to channels */ static void outbound_dispatch_task(void *arg) { - ESP_LOGI(TAG, "Outbound dispatch started"); + (void)arg; + ESP_LOGI(TAG, "Outbound dispatch started on core %d", xPortGetCoreID()); while (1) { - mimi_msg_t msg; - if (message_bus_pop_outbound(&msg, UINT32_MAX) != ESP_OK) continue; + mimi_msg_t msg = {0}; + if (message_bus_pop_outbound(&msg, UINT32_MAX) != ESP_OK) { + continue; + } + + ESP_LOGI(TAG, "Dispatching response to %s:%s", + msg.channel[0] ? msg.channel : "(unknown)", + msg.chat_id[0] ? msg.chat_id : "(empty)"); - ESP_LOGI(TAG, "Dispatching response to %s:%s", msg.channel, msg.chat_id); + if (!msg.content || !msg.content[0]) { + free(msg.content); + continue; + } if (strcmp(msg.channel, MIMI_CHAN_TELEGRAM) == 0) { - esp_err_t send_err = telegram_send_message(msg.chat_id, msg.content); - if (send_err != ESP_OK) { - ESP_LOGE(TAG, "Telegram send failed for %s: %s", msg.chat_id, esp_err_to_name(send_err)); - } else { - ESP_LOGI(TAG, "Telegram send success for %s (%d bytes)", msg.chat_id, (int)strlen(msg.content)); - } + telegram_send_message(msg.chat_id, msg.content); + } else if (strcmp(msg.channel, MIMI_CHAN_FEISHU) == 0) { - esp_err_t send_err = feishu_send_message(msg.chat_id, msg.content); - if (send_err != ESP_OK) { - ESP_LOGE(TAG, "Feishu send failed for %s: %s", msg.chat_id, esp_err_to_name(send_err)); - } else { - ESP_LOGI(TAG, "Feishu send success for %s (%d bytes)", msg.chat_id, (int)strlen(msg.content)); - } + feishu_send_message(msg.chat_id, msg.content); + } else if (strcmp(msg.channel, MIMI_CHAN_WEBSOCKET) == 0) { - esp_err_t ws_err = ws_server_send(msg.chat_id, msg.content); - if (ws_err != ESP_OK) { - ESP_LOGW(TAG, "WS send failed for %s: %s", msg.chat_id, esp_err_to_name(ws_err)); + ws_server_send(msg.chat_id, msg.content); + + } else if (strcmp(msg.channel, MIMI_CHAN_VOICE) == 0) { + char *copy = strdup(msg.content); + if (!copy) { + ESP_LOGW(TAG, "No memory for voice speak task"); + } else { + BaseType_t ok = xTaskCreatePinnedToCore( + voice_speak_task, + "voice_speak", + MIMI_VOICE_SPEAK_STACK, + copy, + MIMI_VOICE_SPEAK_PRIO, + NULL, + MIMI_VOICE_SPEAK_CORE + ); + if (ok != pdPASS) { + ESP_LOGW(TAG, "Failed to create voice_speak task"); + free(copy); + } } - } else if (strcmp(msg.channel, MIMI_CHAN_SYSTEM) == 0) { - ESP_LOGI(TAG, "System message [%s]: %.128s", msg.chat_id, msg.content); + + } else if (strcmp(msg.channel, MIMI_CHAN_CLI) == 0) { + printf("\n%s\n", msg.content); + } else { - ESP_LOGW(TAG, "Unknown channel: %s", msg.channel); + ESP_LOGW(TAG, "Unknown outbound channel: %s", msg.channel); } free(msg.content); @@ -134,6 +167,7 @@ void app_main(void) ESP_ERROR_CHECK(tool_registry_init()); ESP_ERROR_CHECK(cron_service_init()); ESP_ERROR_CHECK(heartbeat_init()); + ESP_ERROR_CHECK(voice_channel_init()); ESP_ERROR_CHECK(agent_loop_init()); /* Start Serial CLI first (works without WiFi) */ @@ -161,6 +195,7 @@ void app_main(void) ESP_ERROR_CHECK(feishu_bot_start()); cron_service_start(); heartbeat_start(); + voice_channel_start(); ESP_ERROR_CHECK(ws_server_start()); ESP_LOGI(TAG, "All services started!"); diff --git a/main/mimi_config.h b/main/mimi_config.h index 9be7c087..68f99369 100644 --- a/main/mimi_config.h +++ b/main/mimi_config.h @@ -19,6 +19,9 @@ #ifndef MIMI_SECRET_API_KEY #define MIMI_SECRET_API_KEY "" #endif +#ifndef MIMI_SECRET_LLM_API_URL +#define MIMI_SECRET_LLM_API_URL "" +#endif #ifndef MIMI_SECRET_MODEL #define MIMI_SECRET_MODEL "" #endif @@ -46,6 +49,36 @@ #ifndef MIMI_SECRET_TAVILY_KEY #define MIMI_SECRET_TAVILY_KEY "" #endif +#ifndef MIMI_SECRET_STT_URL +#define MIMI_SECRET_STT_URL "" +#endif +#ifndef MIMI_SECRET_STT_API_KEY +#define MIMI_SECRET_STT_API_KEY "" +#endif +#ifndef MIMI_SECRET_STT_MODEL +#define MIMI_SECRET_STT_MODEL "" +#endif +#ifndef MIMI_SECRET_TTS_URL +#define MIMI_SECRET_TTS_URL "" +#endif +#ifndef MIMI_SECRET_TTS_API_KEY +#define MIMI_SECRET_TTS_API_KEY "" +#endif +#ifndef MIMI_SECRET_TTS_VOICE +#define MIMI_SECRET_TTS_VOICE "Cherry" +#endif +#ifndef MIMI_SECRET_TTS_MODEL +#define MIMI_SECRET_TTS_MODEL "" +#endif +#ifndef MIMI_SECRET_TTS_LANGUAGE +#define MIMI_SECRET_TTS_LANGUAGE "English" +#endif + +/* Qwen voice API defaults (DashScope) */ +#define MIMI_QWEN_STT_URL "https://dashscope.aliyuncs.com/compatible-mode/v1/chat/completions" +#define MIMI_QWEN_STT_MODEL "qwen3-asr-flash" +#define MIMI_QWEN_TTS_URL "https://dashscope.aliyuncs.com/api/v1/services/aigc/multimodal-generation/generation" +#define MIMI_QWEN_TTS_MODEL "qwen3-tts-flash" /* WiFi */ #define MIMI_WIFI_MAX_RETRY 10 @@ -79,6 +112,40 @@ #define MIMI_MAX_TOOL_CALLS 4 #define MIMI_AGENT_SEND_WORKING_STATUS 1 +/* Voice UX (LLM -> TTS) */ +/* Rough speaking rate for Simplified Chinese TTS is often ~4–6 chars/sec depending on voice. + * Default limits aim to keep playback under ~20 seconds in typical conditions. + * Override these in mimi_secrets.h per your preferred voice/speed. + */ +#ifndef MIMI_VOICE_TTS_MAX_SECONDS +#define MIMI_VOICE_TTS_MAX_SECONDS 20 +#endif + +#ifndef MIMI_VOICE_TTS_CHARS_PER_SEC +#define MIMI_VOICE_TTS_CHARS_PER_SEC 7 +#endif + +#ifndef MIMI_VOICE_LLM_MAX_CHARS +#define MIMI_VOICE_LLM_MAX_CHARS (MIMI_VOICE_TTS_MAX_SECONDS * MIMI_VOICE_TTS_CHARS_PER_SEC) +#endif + +#ifndef MIMI_VOICE_TTS_MAX_CHARS +#define MIMI_VOICE_TTS_MAX_CHARS (MIMI_VOICE_LLM_MAX_CHARS + 10) +#endif + +/* Voice capture (VAD / STT trigger) */ +#ifndef MIMI_VOICE_VAD_START_FRAMES +#define MIMI_VOICE_VAD_START_FRAMES 4 /* consecutive frames above threshold to enter speech */ +#endif + +#ifndef MIMI_VOICE_VAD_MIN_FRAMES +#define MIMI_VOICE_VAD_MIN_FRAMES 50 /* minimum utterance frames before sending to STT */ +#endif + +#ifndef MIMI_VOICE_STT_COOLDOWN_MS +#define MIMI_VOICE_STT_COOLDOWN_MS 2000 /* cooldown after an STT attempt to reduce re-trigger */ +#endif + /* Timezone (POSIX TZ format) */ #define MIMI_TIMEZONE "PST8PDT,M3.2.0,M11.1.0" @@ -86,8 +153,8 @@ #define MIMI_LLM_DEFAULT_MODEL "claude-opus-4-5" #define MIMI_LLM_PROVIDER_DEFAULT "anthropic" #define MIMI_LLM_MAX_TOKENS 4096 -#define MIMI_LLM_API_URL "https://api.anthropic.com/v1/messages" -#define MIMI_OPENAI_API_URL "https://api.openai.com/v1/chat/completions" +#define MIMI_LLM_API_BASE_ANTHROPIC "https://api.anthropic.com/v1" +#define MIMI_LLM_API_BASE_OPENAI "https://api.openai.com/v1" #define MIMI_LLM_API_VERSION "2023-06-01" #define MIMI_LLM_STREAM_BUF_SIZE (32 * 1024) #define MIMI_LLM_LOG_VERBOSE_PAYLOAD 0 @@ -99,6 +166,22 @@ #define MIMI_OUTBOUND_PRIO 5 #define MIMI_OUTBOUND_CORE 0 +/* Voice speak task (TTS download + resample + playback) */ +#ifndef MIMI_VOICE_SPEAK_STACK +#define MIMI_VOICE_SPEAK_STACK (12 * 1024) +#endif +#ifndef MIMI_VOICE_SPEAK_PRIO +#define MIMI_VOICE_SPEAK_PRIO 5 +#endif +#ifndef MIMI_VOICE_SPEAK_CORE +#define MIMI_VOICE_SPEAK_CORE 1 +#endif + +/* WiFi reliability */ +#ifndef MIMI_WIFI_DISABLE_POWERSAVE +#define MIMI_WIFI_DISABLE_POWERSAVE 1 +#endif + /* Memory / SPIFFS */ #define MIMI_SPIFFS_BASE "/spiffs" #define MIMI_SPIFFS_CONFIG_DIR MIMI_SPIFFS_BASE "/config" @@ -144,6 +227,7 @@ #define MIMI_NVS_KEY_FEISHU_APP_ID "app_id" #define MIMI_NVS_KEY_FEISHU_APP_SECRET "app_secret" #define MIMI_NVS_KEY_API_KEY "api_key" +#define MIMI_NVS_KEY_API_BASE "api_base" #define MIMI_NVS_KEY_TAVILY_KEY "tavily_key" #define MIMI_NVS_KEY_MODEL "model" #define MIMI_NVS_KEY_PROVIDER "provider" diff --git a/main/mimi_secrets.h.example b/main/mimi_secrets.h.example index ecebf54e..1852f66c 100644 --- a/main/mimi_secrets.h.example +++ b/main/mimi_secrets.h.example @@ -21,8 +21,9 @@ #define MIMI_SECRET_FEISHU_APP_ID "" #define MIMI_SECRET_FEISHU_APP_SECRET "" -/* Anthropic API */ +/* LLM */ #define MIMI_SECRET_API_KEY "" +#define MIMI_SECRET_LLM_API_URL "" /* optional: full URL incl scheme/host/port/path */ #define MIMI_SECRET_MODEL "" #define MIMI_SECRET_MODEL_PROVIDER "anthropic" @@ -33,5 +34,53 @@ /* Brave Search API */ #define MIMI_SECRET_SEARCH_KEY "" + +/* Voice STT / TTS services */ +#define MIMI_SECRET_STT_URL "" +#define MIMI_SECRET_STT_API_KEY "" +#define MIMI_SECRET_STT_MODEL "" +#define MIMI_SECRET_TTS_URL "" +#define MIMI_SECRET_TTS_API_KEY "" +#define MIMI_SECRET_TTS_VOICE "Cherry" +#define MIMI_SECRET_TTS_MODEL "" +#define MIMI_SECRET_TTS_LANGUAGE "English" + +/* ReSpeaker XVF3800 I2S pin map (set per board) */ +#define MIMI_VOICE_I2S_PORT 0 +#define MIMI_VOICE_I2S_BCLK (-1) +#define MIMI_VOICE_I2S_WS (-1) +#define MIMI_VOICE_I2S_DIN (-1) +#define MIMI_VOICE_I2S_DOUT (-1) + +/* I2S slot/timing style (set per DAC/codec): + * 0: Philips (I2S) + * 1: MSB (left-justified) + * 2: PCM (short frame sync) + */ +/* #define MIMI_VOICE_I2S_STD_SLOT_STYLE 1 */ + +/* Optional: tune DMA and silence tail to suppress post-playback "咚咚" on some DAC/amps */ +/* #define MIMI_VOICE_I2S_DMA_DESC_NUM 6 */ +/* #define MIMI_VOICE_I2S_DMA_FRAME_NUM 240 */ +/* #define MIMI_VOICE_TX_SILENCE_TAIL_MS 400 */ + +/* Optional: voice conversation pacing (LLM -> TTS) + * Target: <= 20s playback, <= 2 sentences + 1 follow-up question. + */ +/* #define MIMI_VOICE_TTS_MAX_SECONDS 20 */ +/* #define MIMI_VOICE_TTS_CHARS_PER_SEC 5 */ +/* #define MIMI_VOICE_LLM_MAX_CHARS 100 */ +/* #define MIMI_VOICE_TTS_MAX_CHARS 110 */ + +/* Optional: reduce STT false triggers (VAD tuning) */ +/* #define MIMI_VOICE_VAD_START_FRAMES 3 */ +/* #define MIMI_VOICE_VAD_MIN_FRAMES 15 */ +/* #define MIMI_VOICE_STT_COOLDOWN_MS 1200 */ + +/* Optional: WiFi reliability tuning (may increase power draw) */ +/* #define MIMI_WIFI_DISABLE_POWERSAVE 1 */ + +/* Optional: move TTS/resample/playback off WiFi core to reduce bcn_timeout under load */ +/* #define MIMI_VOICE_SPEAK_CORE 1 */ /* Tavily Search API */ #define MIMI_SECRET_TAVILY_KEY "" diff --git a/main/proxy/http_proxy.c b/main/proxy/http_proxy.c index fdb75541..3745144d 100644 --- a/main/proxy/http_proxy.c +++ b/main/proxy/http_proxy.c @@ -104,6 +104,61 @@ bool http_proxy_is_enabled(void) return s_proxy_host[0] != '\0' && s_proxy_port != 0; } +/* ── Raw tunnels (no TLS) ────────────────────────────────────── */ + +static int open_connect_tunnel(const char *host, int port, int timeout_ms); +static int open_socks5_tunnel(const char *host, int port, int timeout_ms); + +int proxy_tunnel_open(const char *host, int port, int timeout_ms) +{ + if (!http_proxy_is_enabled()) { + ESP_LOGE(TAG, "proxy_tunnel_open called but no proxy configured"); + return -1; + } + + if (!host || !host[0] || port <= 0 || port > 65535) { + ESP_LOGE(TAG, "proxy_tunnel_open invalid target"); + return -1; + } + + if (strcmp(s_proxy_type, "socks5") == 0) { + return open_socks5_tunnel(host, port, timeout_ms); + } + return open_connect_tunnel(host, port, timeout_ms); +} + +int proxy_tunnel_write(int sock, const char *data, int len) +{ + if (sock < 0 || !data || len <= 0) return -1; + + int written = 0; + while (written < len) { + int n = send(sock, data + written, len - written, 0); + if (n <= 0) return -1; + written += n; + } + return written; +} + +int proxy_tunnel_read(int sock, char *buf, int len, int timeout_ms) +{ + if (sock < 0 || !buf || len <= 0) return -1; + + struct timeval tv = { .tv_sec = timeout_ms / 1000, .tv_usec = (timeout_ms % 1000) * 1000 }; + setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)); + + int n = recv(sock, buf, len, 0); + if (n < 0) return -1; + return n; +} + +void proxy_tunnel_close(int sock) +{ + if (sock >= 0) { + close(sock); + } +} + /* ── Proxied TLS connection ───────────────────────────────────── */ struct proxy_conn { diff --git a/main/proxy/http_proxy.h b/main/proxy/http_proxy.h index f324700e..382dc742 100644 --- a/main/proxy/http_proxy.h +++ b/main/proxy/http_proxy.h @@ -24,6 +24,27 @@ esp_err_t http_proxy_set(const char *host, uint16_t port, const char *type); */ esp_err_t http_proxy_clear(void); +/* ── Proxy tunnels (no TLS) ──────────────────────────────────── */ + +/** + * Open a raw TCP tunnel to target host:port through the configured proxy. + * + * - If proxy type is "http": uses HTTP CONNECT + * - If proxy type is "socks5": uses SOCKS5 CONNECT + * + * Returns a socket fd on success, or -1 on failure. + */ +int proxy_tunnel_open(const char *host, int port, int timeout_ms); + +/** Write raw bytes through the tunnel. Returns bytes written or -1. */ +int proxy_tunnel_write(int sock, const char *data, int len); + +/** Read raw bytes from the tunnel. Returns bytes read or -1. */ +int proxy_tunnel_read(int sock, char *buf, int len, int timeout_ms); + +/** Close the tunnel socket. */ +void proxy_tunnel_close(int sock); + /* ── Proxied HTTPS connection ─────────────────────────────────── */ typedef struct proxy_conn proxy_conn_t; diff --git a/main/tools/tool_cron.c b/main/tools/tool_cron.c index 048e8902..5670678f 100644 --- a/main/tools/tool_cron.c +++ b/main/tools/tool_cron.c @@ -66,16 +66,21 @@ esp_err_t tool_cron_add_execute(const char *input_json, char *output, size_t out job.delete_after_run = false; } else if (strcmp(schedule_type, "at") == 0) { job.kind = CRON_KIND_AT; - cJSON *at_epoch = cJSON_GetObjectItem(root, "at_epoch"); - if (!at_epoch || !cJSON_IsNumber(at_epoch)) { - snprintf(output, output_size, "Error: 'at' schedule requires 'at_epoch' (unix timestamp)"); - cJSON_Delete(root); - return ESP_ERR_INVALID_ARG; + time_t now = time(NULL); + cJSON *delay_s = cJSON_GetObjectItem(root, "delay_s"); + if (delay_s && cJSON_IsNumber(delay_s) && delay_s->valuedouble > 0) { + job.at_epoch = (int64_t)now + (int64_t)delay_s->valuedouble; + } else { + cJSON *at_epoch = cJSON_GetObjectItem(root, "at_epoch"); + if (!at_epoch || !cJSON_IsNumber(at_epoch)) { + snprintf(output, output_size, "Error: 'at' schedule requires 'at_epoch' (unix timestamp) or positive 'delay_s'"); + cJSON_Delete(root); + return ESP_ERR_INVALID_ARG; + } + job.at_epoch = (int64_t)at_epoch->valuedouble; } - job.at_epoch = (int64_t)at_epoch->valuedouble; /* Check if already in the past */ - time_t now = time(NULL); if (job.at_epoch <= now) { snprintf(output, output_size, "Error: at_epoch %lld is in the past (now=%lld)", (long long)job.at_epoch, (long long)now); diff --git a/main/tools/tool_registry.c b/main/tools/tool_registry.c index 6c82a3ef..e6251f8a 100644 --- a/main/tools/tool_registry.c +++ b/main/tools/tool_registry.c @@ -135,14 +135,15 @@ esp_err_t tool_registry_init(void) /* Register cron_add */ mimi_tool_t ca = { .name = "cron_add", - .description = "Schedule a recurring or one-shot task. The message will trigger an agent turn when the job fires.", + .description = "Schedule a recurring or one-shot task. For relative reminders (e.g. 'in 2 minutes'), prefer delay_s to avoid timestamp math. The message will trigger an agent turn when the job fires.", .input_schema_json = "{\"type\":\"object\"," "\"properties\":{" "\"name\":{\"type\":\"string\",\"description\":\"Short name for the job\"}," "\"schedule_type\":{\"type\":\"string\",\"description\":\"'every' for recurring interval or 'at' for one-shot at a unix timestamp\"}," "\"interval_s\":{\"type\":\"integer\",\"description\":\"Interval in seconds (required for 'every')\"}," - "\"at_epoch\":{\"type\":\"integer\",\"description\":\"Unix timestamp to fire at (required for 'at')\"}," + "\"at_epoch\":{\"type\":\"integer\",\"description\":\"Unix timestamp to fire at (for 'at'). Prefer delay_s for relative reminders.\"}," + "\"delay_s\":{\"type\":\"integer\",\"description\":\"Delay in seconds from now (preferred for 'at' when user says 'in N minutes')\"}," "\"message\":{\"type\":\"string\",\"description\":\"Message to inject when the job fires, triggering an agent turn\"}," "\"channel\":{\"type\":\"string\",\"description\":\"Optional reply channel (e.g. 'telegram'). If omitted, current turn channel is used when available\"}," "\"chat_id\":{\"type\":\"string\",\"description\":\"Optional reply chat_id. Required when channel='telegram'. If omitted during a Telegram turn, current chat_id is used\"}" diff --git a/main/voice/voice_channel.c b/main/voice/voice_channel.c new file mode 100644 index 00000000..039c6e26 --- /dev/null +++ b/main/voice/voice_channel.c @@ -0,0 +1,1613 @@ +#include "voice/voice_channel.h" + +#include +#include +#include +#include +#include +#include +#include + +#include "mimi_config.h" +#include "bus/message_bus.h" +#include "proxy/http_proxy.h" + +#include "freertos/FreeRTOS.h" +#include "freertos/task.h" +#include "freertos/semphr.h" + +#include "esp_log.h" +#include "esp_err.h" +#include "esp_http_client.h" +#include "esp_crt_bundle.h" +#include "esp_heap_caps.h" +#include "driver/i2s_std.h" +#include "driver/i2s_common.h" + +#include "cJSON.h" +#include "mbedtls/base64.h" + +static const char *TAG = "voice"; + +/* + * I2S timing / slot style selection: + * 0: Philips (I2S, 1-bit delay after WS edge) + * 1: MSB (left-justified, no 1-bit delay) + * 2: PCM (short frame sync, ws_width=1, ws_pol=true) + * + * Many DAC/codec parts are sensitive to this. If your audio sounds like loud + * "沙沙" noise but speech is partially recognizable, this is a prime suspect. + */ +#ifndef MIMI_VOICE_I2S_STD_SLOT_STYLE +#define MIMI_VOICE_I2S_STD_SLOT_STYLE 0 +#endif + +#ifndef MIMI_VOICE_I2S_DMA_DESC_NUM +#define MIMI_VOICE_I2S_DMA_DESC_NUM 6 +#endif + +#ifndef MIMI_VOICE_I2S_DMA_FRAME_NUM +#define MIMI_VOICE_I2S_DMA_FRAME_NUM 240 +#endif + +#ifndef MIMI_VOICE_TX_SILENCE_TAIL_MS +#define MIMI_VOICE_TX_SILENCE_TAIL_MS 400 +#endif + +#define MIMI_VOICE_TX_BYTES_PER_FRAME (2U * sizeof(int32_t)) +#define MIMI_VOICE_TX_DMA_TOTAL_BYTES \ + ((uint32_t)MIMI_VOICE_I2S_DMA_DESC_NUM * (uint32_t)MIMI_VOICE_I2S_DMA_FRAME_NUM * (uint32_t)MIMI_VOICE_TX_BYTES_PER_FRAME) + +#if MIMI_VOICE_I2S_STD_SLOT_STYLE == 1 +#define MIMI_VOICE_I2S_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) \ + I2S_STD_MSB_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) +#elif MIMI_VOICE_I2S_STD_SLOT_STYLE == 2 +#define MIMI_VOICE_I2S_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) \ + I2S_STD_PCM_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) +#else +#define MIMI_VOICE_I2S_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) \ + I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(bits, mono_or_stereo) +#endif + +static const char *i2s_slot_style_str(void) +{ +#if MIMI_VOICE_I2S_STD_SLOT_STYLE == 1 + return "MSB"; +#elif MIMI_VOICE_I2S_STD_SLOT_STYLE == 2 + return "PCM"; +#else + return "PHILIPS"; +#endif +} + +/* ========================= + * Fallback config defaults + * ========================= */ + +#ifndef MIMI_VOICE_ENABLED_DEFAULT +#define MIMI_VOICE_ENABLED_DEFAULT 0 +#endif + +#ifndef MIMI_VOICE_CHAT_ID +#define MIMI_VOICE_CHAT_ID "voice_local" +#endif + +#ifndef MIMI_VOICE_SAMPLE_RATE +#define MIMI_VOICE_SAMPLE_RATE 16000 +#endif + +#ifndef MIMI_VOICE_FRAME_MS +#define MIMI_VOICE_FRAME_MS 20 +#endif + +#ifndef MIMI_VOICE_MAX_UTTERANCE_MS +#define MIMI_VOICE_MAX_UTTERANCE_MS 10000 +#endif + +#ifndef MIMI_VOICE_SILENCE_END_MS +#define MIMI_VOICE_SILENCE_END_MS 600 +#endif + +#ifndef MIMI_VOICE_VAD_THRESHOLD +#define MIMI_VOICE_VAD_THRESHOLD 700 +#endif + +#ifndef MIMI_VOICE_CAPTURE_STACK +#define MIMI_VOICE_CAPTURE_STACK (8 * 1024) +#endif + +#ifndef MIMI_VOICE_TASK_PRIO +#define MIMI_VOICE_TASK_PRIO 5 +#endif + +#ifndef MIMI_VOICE_CORE +#define MIMI_VOICE_CORE 0 +#endif + +#ifndef MIMI_SECRET_STT_URL +#define MIMI_SECRET_STT_URL "" +#endif + +#ifndef MIMI_SECRET_STT_API_KEY +#define MIMI_SECRET_STT_API_KEY "" +#endif + +#ifndef MIMI_SECRET_STT_MODEL +#define MIMI_SECRET_STT_MODEL "qwen3-asr-flash" +#endif + +#ifndef MIMI_SECRET_TTS_URL +#define MIMI_SECRET_TTS_URL "" +#endif + +#ifndef MIMI_SECRET_TTS_API_KEY +#define MIMI_SECRET_TTS_API_KEY "" +#endif + +#ifndef MIMI_SECRET_TTS_MODEL +#define MIMI_SECRET_TTS_MODEL "qwen3-tts-flash" +#endif + +#ifndef MIMI_SECRET_TTS_VOICE +#define MIMI_SECRET_TTS_VOICE "Cherry" +#endif + +#ifndef MIMI_SECRET_TTS_LANGUAGE +#define MIMI_SECRET_TTS_LANGUAGE "English" +#endif + +#ifndef MIMI_SECRET_API_KEY +#define MIMI_SECRET_API_KEY "" +#endif + +/* TTS text constraints (can override in mimi_secrets.h) */ +#ifndef MIMI_VOICE_TTS_MAX_CHARS +#define MIMI_VOICE_TTS_MAX_CHARS 140 +#endif + +#ifndef MIMI_VOICE_I2S_PORT +#define MIMI_VOICE_I2S_PORT 0 +#endif + +#ifndef MIMI_VOICE_I2S_BCLK +#define MIMI_VOICE_I2S_BCLK 42 +#endif + +#ifndef MIMI_VOICE_I2S_WS +#define MIMI_VOICE_I2S_WS 41 +#endif + +#ifndef MIMI_VOICE_I2S_DIN +#define MIMI_VOICE_I2S_DIN 40 +#endif + +#ifndef MIMI_VOICE_I2S_DOUT +#define MIMI_VOICE_I2S_DOUT 39 +#endif + +/* XVF3800 fixed digital format in your current design: + * 16 kHz, stereo, 32-bit samples over I2S. + */ +#define VOICE_I2S_CHANNELS 2 +#define VOICE_I2S_BYTES_PER_SAMPLE 4 +#define VOICE_I2S_BYTES_PER_STEREO_FRAME (VOICE_I2S_CHANNELS * VOICE_I2S_BYTES_PER_SAMPLE) +#define VOICE_PCM_BITS 16 + +typedef struct { + char *buf; + size_t len; + size_t cap; +} http_resp_t; + +typedef struct { + uint16_t audio_format; /* 1 = PCM */ + uint16_t channels; + uint32_t sample_rate; + uint16_t bits_per_sample; +} wav_fmt_t; + +static bool s_enabled = false; +static bool s_i2s_ready = false; +static volatile bool s_is_playing = false; + +static i2s_chan_handle_t s_tx_chan = NULL; +static i2s_chan_handle_t s_rx_chan = NULL; +static TaskHandle_t s_capture_task = NULL; +static SemaphoreHandle_t s_http_lock = NULL; + +/* ========================= + * Secrets / config helpers + * ========================= */ + +static const char *stt_api_url(void) +{ + return (MIMI_SECRET_STT_URL[0] != '\0') ? MIMI_SECRET_STT_URL : ""; +} + +static const char *stt_api_key(void) +{ + return (MIMI_SECRET_STT_API_KEY[0] != '\0') ? MIMI_SECRET_STT_API_KEY : + (MIMI_SECRET_API_KEY[0] != '\0') ? MIMI_SECRET_API_KEY : ""; +} + +static const char *stt_model(void) +{ + return (MIMI_SECRET_STT_MODEL[0] != '\0') ? MIMI_SECRET_STT_MODEL : "qwen3-asr-flash"; +} + +static const char *tts_api_url(void) +{ + return (MIMI_SECRET_TTS_URL[0] != '\0') ? MIMI_SECRET_TTS_URL : ""; +} + +static const char *tts_api_key(void) +{ + return (MIMI_SECRET_TTS_API_KEY[0] != '\0') ? MIMI_SECRET_TTS_API_KEY : + (MIMI_SECRET_API_KEY[0] != '\0') ? MIMI_SECRET_API_KEY : ""; +} + +static const char *tts_model(void) +{ + return (MIMI_SECRET_TTS_MODEL[0] != '\0') ? MIMI_SECRET_TTS_MODEL : "qwen3-tts-flash"; +} + +static const char *tts_voice(void) +{ + return (MIMI_SECRET_TTS_VOICE[0] != '\0') ? MIMI_SECRET_TTS_VOICE : "Cherry"; +} + +static const char *tts_language(void) +{ + return (MIMI_SECRET_TTS_LANGUAGE[0] != '\0') ? MIMI_SECRET_TTS_LANGUAGE : "English"; +} + +/* ========================= + * HTTP helpers + * ========================= */ + +static esp_err_t http_event_handler(esp_http_client_event_t *evt) +{ + http_resp_t *resp = (http_resp_t *)evt->user_data; + if (evt->event_id != HTTP_EVENT_ON_DATA || !resp || !evt->data || evt->data_len <= 0) { + return ESP_OK; + } + + size_t need = resp->len + (size_t)evt->data_len + 1; + if (need > resp->cap) { + size_t new_cap = resp->cap ? resp->cap * 2 : 1024; + while (new_cap < need) { + new_cap *= 2; + } + char *tmp = realloc(resp->buf, new_cap); + if (!tmp) { + return ESP_ERR_NO_MEM; + } + resp->buf = tmp; + resp->cap = new_cap; + } + + memcpy(resp->buf + resp->len, evt->data, evt->data_len); + resp->len += (size_t)evt->data_len; + resp->buf[resp->len] = '\0'; + return ESP_OK; +} + +static esp_err_t http_post_json(const char *url, + const char *bearer_key, + const char *json_body, + bool enable_sse, + http_resp_t *resp, + int *http_status_out) +{ + if (!url || !url[0] || !json_body || !resp) { + return ESP_ERR_INVALID_ARG; + } + + memset(resp, 0, sizeof(*resp)); + + esp_http_client_config_t cfg = { + .url = url, + .method = HTTP_METHOD_POST, + .event_handler = http_event_handler, + .user_data = resp, + .crt_bundle_attach = esp_crt_bundle_attach, + .timeout_ms = 30000, + .buffer_size = 2048, + .buffer_size_tx = 2048, + }; + + esp_http_client_handle_t client = esp_http_client_init(&cfg); + if (!client) { + return ESP_FAIL; + } + + esp_http_client_set_header(client, "Content-Type", "application/json"); + if (bearer_key && bearer_key[0]) { + char auth[320]; + snprintf(auth, sizeof(auth), "Bearer %s", bearer_key); + esp_http_client_set_header(client, "Authorization", auth); + } + esp_http_client_set_header(client, "X-DashScope-SSE", enable_sse ? "enable" : "disable"); + esp_http_client_set_post_field(client, json_body, (int)strlen(json_body)); + + esp_err_t err = esp_http_client_perform(client); + if (err == ESP_OK && http_status_out) { + *http_status_out = esp_http_client_get_status_code(client); + } + + esp_http_client_cleanup(client); + return err; +} + +static esp_err_t http_get_binary(const char *url, http_resp_t *resp, int *http_status_out) +{ + if (!url || !url[0] || !resp) { + return ESP_ERR_INVALID_ARG; + } + + memset(resp, 0, sizeof(*resp)); + + esp_http_client_config_t cfg = { + .url = url, + .method = HTTP_METHOD_GET, + .event_handler = http_event_handler, + .user_data = resp, + .crt_bundle_attach = esp_crt_bundle_attach, + .timeout_ms = 30000, + .buffer_size = 2048, + .buffer_size_tx = 1024, + }; + + esp_http_client_handle_t client = esp_http_client_init(&cfg); + if (!client) { + return ESP_FAIL; + } + + esp_err_t err = esp_http_client_perform(client); + if (err == ESP_OK && http_status_out) { + *http_status_out = esp_http_client_get_status_code(client); + } + + esp_http_client_cleanup(client); + return err; +} + +/* ========================= + * Audio helpers + * ========================= */ + +static void *malloc_prefer_spiram(size_t bytes) +{ + if (bytes == 0) { + return NULL; + } + + void *p = heap_caps_malloc(bytes, MALLOC_CAP_SPIRAM); + if (p) { + return p; + } + return malloc(bytes); +} + +static bool utf8_is_continuation_byte(uint8_t b) +{ + return (b & 0xC0U) == 0x80U; +} + +static bool utf8_starts_with(const char *s, size_t i, size_t len, const char *lit) +{ + size_t lit_len = strlen(lit); + if (i + lit_len > len) { + return false; + } + return memcmp(s + i, lit, lit_len) == 0; +} + +static bool is_speech_cut_punct(const char *s, size_t i, size_t len) +{ + const uint8_t b = (uint8_t)s[i]; + if (b == '\n' || b == '\r') { + return true; + } + if (b == '.' || b == '!' || b == '?' || b == ',' || b == ';' || b == ':') { + return true; + } + + /* Common CJK punctuation in UTF-8 */ + if (utf8_starts_with(s, i, len, "。") || + utf8_starts_with(s, i, len, "!") || + utf8_starts_with(s, i, len, "?") || + utf8_starts_with(s, i, len, ",") || + utf8_starts_with(s, i, len, ";") || + utf8_starts_with(s, i, len, ":") || + utf8_starts_with(s, i, len, "、")) { + return true; + } + + return false; +} + +static size_t utf8_truncate_for_tts(const char *text, size_t max_chars, size_t *out_char_count, bool *out_truncated) +{ + if (!text || max_chars == 0) { + if (out_char_count) *out_char_count = 0; + if (out_truncated) *out_truncated = false; + return 0; + } + + const size_t len = strlen(text); + size_t char_count = 0; + size_t last_punct_cut = 0; + size_t i = 0; + + while (i < len) { + if (char_count >= max_chars) { + break; + } + + if (!utf8_is_continuation_byte((uint8_t)text[i])) { + char_count++; + if (is_speech_cut_punct(text, i, len)) { + /* Cut after this codepoint (best-effort) */ + size_t j = i + 1; + while (j < len && utf8_is_continuation_byte((uint8_t)text[j])) { + j++; + } + last_punct_cut = j; + } + } + i++; + } + + if (out_char_count) { + *out_char_count = char_count; + } + + if (i >= len) { + if (out_truncated) *out_truncated = false; + return len; + } + + /* Prefer cutting at punctuation, but avoid cutting too early */ + size_t cut = i; + if (last_punct_cut > 0) { + const size_t min_reasonable = (max_chars >= 20) ? (max_chars / 2) : 0; + if (last_punct_cut >= min_reasonable) { + cut = last_punct_cut; + } + } + + while (cut > 0 && utf8_is_continuation_byte((uint8_t)text[cut])) { + cut--; + } + + if (out_truncated) *out_truncated = true; + return cut; +} + +static char *voice_build_tts_text(const char *text) +{ + if (!text) { + return NULL; + } + + size_t char_count = 0; + bool truncated = false; + size_t cut_bytes = utf8_truncate_for_tts(text, MIMI_VOICE_TTS_MAX_CHARS, &char_count, &truncated); + + if (!truncated) { + return NULL; /* caller can use original text */ + } + + char *out = (char *)malloc(cut_bytes + 1); + if (!out) { + return NULL; + } + memcpy(out, text, cut_bytes); + out[cut_bytes] = '\0'; + + ESP_LOGW(TAG, "TTS text truncated: max=%u chars, cut_bytes=%u", (unsigned)MIMI_VOICE_TTS_MAX_CHARS, (unsigned)cut_bytes); + return out; +} + +static int16_t fir5_s16_at_clamped(const int16_t *src, size_t src_samples, size_t idx) +{ + if (!src || src_samples == 0) { + return 0; + } + + size_t i0 = (idx >= 2) ? (idx - 2) : 0; + size_t i1 = (idx >= 1) ? (idx - 1) : 0; + size_t i2 = idx; + if (i2 >= src_samples) i2 = src_samples - 1; + size_t i3 = (idx + 1 < src_samples) ? (idx + 1) : (src_samples - 1); + size_t i4 = (idx + 2 < src_samples) ? (idx + 2) : (src_samples - 1); + + int32_t acc = + (int32_t)src[i0] * 1 + + (int32_t)src[i1] * 4 + + (int32_t)src[i2] * 6 + + (int32_t)src[i3] * 4 + + (int32_t)src[i4] * 1; + + acc = acc / 16; + if (acc > INT16_MAX) acc = INT16_MAX; + if (acc < INT16_MIN) acc = INT16_MIN; + return (int16_t)acc; +} + +static esp_err_t i2s_tx_write_silence_ms(uint32_t ms) +{ + if (!s_i2s_ready || !s_tx_chan || ms == 0) { + return ESP_OK; + } + + uint64_t frames_total = ((uint64_t)MIMI_VOICE_SAMPLE_RATE * (uint64_t)ms) / 1000ULL; + while (frames_total > 0) { + const size_t frames_chunk = (frames_total > 256) ? 256 : (size_t)frames_total; + int32_t zeros[256 * 2] = {0}; + + const uint8_t *p = (const uint8_t *)zeros; + size_t bytes_total = frames_chunk * 2 * sizeof(int32_t); + size_t bytes_sent = 0; + + while (bytes_sent < bytes_total) { + size_t written = 0; + esp_err_t err = i2s_channel_write(s_tx_chan, + p + bytes_sent, + bytes_total - bytes_sent, + &written, + pdMS_TO_TICKS(1000)); + if (err != ESP_OK) { + return err; + } + if (written == 0) { + return ESP_FAIL; + } + bytes_sent += written; + } + + frames_total -= frames_chunk; + } + + return ESP_OK; +} + +static esp_err_t i2s_tx_overwrite_dma_with_zeros(void) +{ + if (!s_i2s_ready || !s_tx_chan) { + return ESP_ERR_INVALID_STATE; + } + + uint32_t remaining = MIMI_VOICE_TX_DMA_TOTAL_BYTES; + while (remaining > 0) { + int32_t zeros[256 * 2] = {0}; + size_t chunk = sizeof(zeros); + if (chunk > remaining) { + chunk = remaining; + } + + const uint8_t *p = (const uint8_t *)zeros; + size_t sent = 0; + while (sent < chunk) { + size_t written = 0; + esp_err_t err = i2s_channel_write(s_tx_chan, + p + sent, + chunk - sent, + &written, + pdMS_TO_TICKS(1000)); + if (err != ESP_OK) { + return err; + } + if (written == 0) { + return ESP_FAIL; + } + sent += written; + } + + remaining -= (uint32_t)chunk; + } + + return ESP_OK; +} + +static void pcm_s32_stereo_to_s16_mono(const uint8_t *src, size_t src_len, int16_t *dst, size_t *out_samples) +{ + size_t frames = src_len / VOICE_I2S_BYTES_PER_STEREO_FRAME; + const int32_t *p = (const int32_t *)src; + + for (size_t i = 0; i < frames; i++) { + int32_t l = p[i * 2 + 0]; + int32_t r = p[i * 2 + 1]; + + /* XVF3800 / many I2S MEMS frontends deliver valid audio in high 16 bits of s32 slot. */ + int16_t ls = (int16_t)(l >> 16); + int16_t rs = (int16_t)(r >> 16); + int32_t mono = ((int32_t)ls + (int32_t)rs) / 2; + + if (mono > INT16_MAX) mono = INT16_MAX; + if (mono < INT16_MIN) mono = INT16_MIN; + dst[i] = (int16_t)mono; + } + + if (out_samples) { + *out_samples = frames; + } +} + +static uint32_t pcm_energy_absavg(const int16_t *pcm, size_t samples) +{ + if (!pcm || samples == 0) return 0; + + uint64_t sum = 0; + for (size_t i = 0; i < samples; i++) { + int32_t v = pcm[i]; + if (v < 0) v = -v; + sum += (uint32_t)v; + } + return (uint32_t)(sum / samples); +} + +static size_t wav_build_from_pcm16(const int16_t *pcm, + size_t pcm_bytes, + uint32_t sample_rate, + uint16_t channels, + uint8_t **out_buf) +{ + if (!pcm || !out_buf || pcm_bytes == 0) { + return 0; + } + + const size_t wav_size = 44 + pcm_bytes; + uint8_t *buf = (uint8_t *)malloc(wav_size); + if (!buf) { + return 0; + } + + const uint32_t byte_rate = sample_rate * channels * 2; + const uint16_t block_align = channels * 2; + const uint32_t riff_size = (uint32_t)(wav_size - 8); + const uint32_t data_size = (uint32_t)pcm_bytes; + + memcpy(buf + 0, "RIFF", 4); + memcpy(buf + 4, &riff_size, 4); + memcpy(buf + 8, "WAVE", 4); + + memcpy(buf + 12, "fmt ", 4); + uint32_t fmt_size = 16; + uint16_t audio_format = 1; + uint16_t bits_per_sample = 16; + memcpy(buf + 16, &fmt_size, 4); + memcpy(buf + 20, &audio_format, 2); + memcpy(buf + 22, &channels, 2); + memcpy(buf + 24, &sample_rate, 4); + memcpy(buf + 28, &byte_rate, 4); + memcpy(buf + 32, &block_align, 2); + memcpy(buf + 34, &bits_per_sample, 2); + + memcpy(buf + 36, "data", 4); + memcpy(buf + 40, &data_size, 4); + memcpy(buf + 44, pcm, pcm_bytes); + + *out_buf = buf; + return wav_size; +} + +static esp_err_t wav_find_data_chunk(const uint8_t *wav, + size_t wav_len, + wav_fmt_t *fmt, + const uint8_t **data_out, + size_t *data_len_out) +{ + if (!wav || wav_len < 44 || !fmt || !data_out || !data_len_out) { + return ESP_ERR_INVALID_ARG; + } + + if (memcmp(wav, "RIFF", 4) != 0 || memcmp(wav + 8, "WAVE", 4) != 0) { + return ESP_ERR_INVALID_RESPONSE; + } + + memset(fmt, 0, sizeof(*fmt)); + *data_out = NULL; + *data_len_out = 0; + + size_t pos = 12; + bool got_fmt = false; + + while (pos + 8 <= wav_len) { + const uint8_t *chunk = wav + pos; + uint32_t chunk_size = 0; + memcpy(&chunk_size, chunk + 4, 4); + + size_t chunk_data_pos = pos + 8; + if (chunk_data_pos > wav_len) { + break; + } + + size_t available = wav_len - chunk_data_pos; + size_t declared = (size_t)chunk_size; + size_t actual = declared <= available ? declared : available; + + char id[5] = {0}; + memcpy(id, chunk, 4); + ESP_LOGI(TAG, "WAV chunk id=%s declared=%u actual=%u pos=%u", + id, (unsigned)chunk_size, (unsigned)actual, (unsigned)pos); + + if (memcmp(chunk, "fmt ", 4) == 0) { + if (actual < 16) { + ESP_LOGW(TAG, "WAV fmt chunk too short: %u", (unsigned)actual); + } else { + memcpy(&fmt->audio_format, wav + chunk_data_pos + 0, 2); + memcpy(&fmt->channels, wav + chunk_data_pos + 2, 2); + memcpy(&fmt->sample_rate, wav + chunk_data_pos + 4, 4); + memcpy(&fmt->bits_per_sample, wav + chunk_data_pos + 14, 2); + got_fmt = true; + } + } else if (memcmp(chunk, "data", 4) == 0) { + *data_out = wav + chunk_data_pos; + *data_len_out = actual; + break; + } + + size_t step = 8 + actual; + if (declared <= available) { + step += (declared & 1U); + } + + if (step == 0 || pos + step <= pos) { + break; + } + pos += step; + } + + if (!*data_out || *data_len_out == 0) { + return ESP_ERR_NOT_FOUND; + } + + if (!got_fmt) { + ESP_LOGW(TAG, "WAV fmt chunk not found, assume PCM16 mono/stereo fallback"); + fmt->audio_format = 1; + fmt->channels = 1; + fmt->sample_rate = MIMI_VOICE_SAMPLE_RATE; + fmt->bits_per_sample = 16; + } + + if (fmt->audio_format != 1) { + ESP_LOGE(TAG, "Unsupported WAV audio_format=%u", (unsigned)fmt->audio_format); + return ESP_ERR_NOT_SUPPORTED; + } + + if (fmt->bits_per_sample != 16) { + ESP_LOGE(TAG, "Unsupported WAV bits_per_sample=%u", (unsigned)fmt->bits_per_sample); + return ESP_ERR_NOT_SUPPORTED; + } + + return ESP_OK; +} + +/* ========================= + * STT / TTS JSON helpers + * ========================= */ + +static char *build_data_url_from_wav(const uint8_t *wav, size_t wav_len) +{ + if (!wav || wav_len == 0) { + return NULL; + } + + size_t b64_len = 0; + int rc = mbedtls_base64_encode(NULL, 0, &b64_len, wav, wav_len); + if (rc != MBEDTLS_ERR_BASE64_BUFFER_TOO_SMALL && rc != 0) { + return NULL; + } + + const char *prefix = "data:audio/wav;base64,"; + size_t prefix_len = strlen(prefix); + char *out = (char *)malloc(prefix_len + b64_len + 1); + if (!out) { + return NULL; + } + + memcpy(out, prefix, prefix_len); + + size_t actual = 0; + rc = mbedtls_base64_encode((unsigned char *)(out + prefix_len), + b64_len, + &actual, + wav, + wav_len); + if (rc != 0) { + free(out); + return NULL; + } + + out[prefix_len + actual] = '\0'; + return out; +} + +static esp_err_t parse_stt_response_text(const char *json, char *out_text, size_t out_size) +{ + if (!json || !out_text || out_size == 0) { + return ESP_ERR_INVALID_ARG; + } + + out_text[0] = '\0'; + + cJSON *root = cJSON_Parse(json); + if (!root) { + return ESP_ERR_INVALID_RESPONSE; + } + + cJSON *choices = cJSON_GetObjectItem(root, "choices"); + cJSON *choice0 = (choices && cJSON_IsArray(choices)) ? cJSON_GetArrayItem(choices, 0) : NULL; + cJSON *message = choice0 ? cJSON_GetObjectItem(choice0, "message") : NULL; + cJSON *content = message ? cJSON_GetObjectItem(message, "content") : NULL; + + if (cJSON_IsString(content) && content->valuestring) { + strlcpy(out_text, content->valuestring, out_size); + cJSON_Delete(root); + return ESP_OK; + } + + cJSON_Delete(root); + return ESP_ERR_NOT_FOUND; +} + +static esp_err_t parse_tts_audio_url(const char *json, char *out_url, size_t out_size) +{ + if (!json || !out_url || out_size == 0) { + return ESP_ERR_INVALID_ARG; + } + + out_url[0] = '\0'; + + cJSON *root = cJSON_Parse(json); + if (!root) { + return ESP_ERR_INVALID_RESPONSE; + } + + cJSON *output = cJSON_GetObjectItem(root, "output"); + cJSON *audio = output ? cJSON_GetObjectItem(output, "audio") : NULL; + cJSON *url = audio ? cJSON_GetObjectItem(audio, "url") : NULL; + + if (cJSON_IsString(url) && url->valuestring && url->valuestring[0]) { + strlcpy(out_url, url->valuestring, out_size); + cJSON_Delete(root); + return ESP_OK; + } + + cJSON_Delete(root); + return ESP_ERR_NOT_FOUND; +} + +/* ========================= + * Bus integration + * ========================= */ + +static void push_voice_inbound(const char *text) +{ + if (!text || !text[0]) { + return; + } + + mimi_msg_t msg = {0}; + strlcpy(msg.channel, MIMI_CHAN_VOICE, sizeof(msg.channel)); + strlcpy(msg.chat_id, MIMI_VOICE_CHAT_ID, sizeof(msg.chat_id)); + msg.content = strdup(text); + + if (!msg.content) { + ESP_LOGE(TAG, "No memory for voice inbound text"); + return; + } + + if (message_bus_push_inbound(&msg) != ESP_OK) { + ESP_LOGW(TAG, "Inbound queue full, drop voice transcript"); + free(msg.content); + } +} + +/* ========================= + * I2S init / playback + * ========================= */ + +static esp_err_t i2s_init_xvf3800(void) +{ + esp_err_t err; + + i2s_chan_config_t chan_cfg = { + .id = (i2s_port_t)MIMI_VOICE_I2S_PORT, + .role = I2S_ROLE_MASTER, + .dma_desc_num = MIMI_VOICE_I2S_DMA_DESC_NUM, + .dma_frame_num = MIMI_VOICE_I2S_DMA_FRAME_NUM, + .auto_clear_after_cb = false, + .auto_clear_before_cb = false, + .allow_pd = false, + .intr_priority = 0, + }; + + err = i2s_new_channel(&chan_cfg, &s_tx_chan, &s_rx_chan); + if (err != ESP_OK) { + ESP_LOGE(TAG, "i2s_new_channel failed: %s", esp_err_to_name(err)); + return err; + } + + i2s_std_config_t rx_cfg = { + .clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(MIMI_VOICE_SAMPLE_RATE), + .slot_cfg = I2S_STD_PHILIPS_SLOT_DEFAULT_CONFIG(I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_STEREO), + .gpio_cfg = { + .mclk = I2S_GPIO_UNUSED, + .bclk = MIMI_VOICE_I2S_BCLK, + .ws = MIMI_VOICE_I2S_WS, + .dout = I2S_GPIO_UNUSED, + .din = MIMI_VOICE_I2S_DIN, + .invert_flags = { + .mclk_inv = false, + .bclk_inv = false, + .ws_inv = false, + }, + }, + }; + + i2s_std_config_t tx_cfg = { + .clk_cfg = I2S_STD_CLK_DEFAULT_CONFIG(MIMI_VOICE_SAMPLE_RATE), + .slot_cfg = I2S_STD_MSB_SLOT_DEFAULT_CONFIG(I2S_DATA_BIT_WIDTH_32BIT, I2S_SLOT_MODE_STEREO), + .gpio_cfg = { + .mclk = I2S_GPIO_UNUSED, + .bclk = MIMI_VOICE_I2S_BCLK, + .ws = MIMI_VOICE_I2S_WS, + .dout = MIMI_VOICE_I2S_DOUT, + .din = I2S_GPIO_UNUSED, + .invert_flags = { + .mclk_inv = false, + .bclk_inv = false, + .ws_inv = false, + }, + }, + }; + + err = i2s_channel_init_std_mode(s_rx_chan, &rx_cfg); + if (err != ESP_OK) { + ESP_LOGE(TAG, "i2s rx init failed: %s", esp_err_to_name(err)); + return err; + } + + err = i2s_channel_init_std_mode(s_tx_chan, &tx_cfg); + if (err != ESP_OK) { + ESP_LOGE(TAG, "i2s tx init failed: %s", esp_err_to_name(err)); + return err; + } + + /* Seed TX DMA with silence, otherwise some DAC/amps output "咚咚/沙沙" due to + * undefined initial DMA content or repeating last buffer when idle. + * + * Only allowed before enabling the channel. + */ + { + int32_t zeros[128 * 2] = {0}; + size_t loaded = 0; + (void)i2s_channel_preload_data(s_tx_chan, zeros, sizeof(zeros), &loaded); + } + + err = i2s_channel_enable(s_rx_chan); + if (err != ESP_OK) { + ESP_LOGE(TAG, "i2s rx enable failed: %s", esp_err_to_name(err)); + return err; + } + + err = i2s_channel_enable(s_tx_chan); + if (err != ESP_OK) { + ESP_LOGE(TAG, "i2s tx enable failed: %s", esp_err_to_name(err)); + return err; + } + + s_i2s_ready = true; + ESP_LOGI(TAG, "I2S ready: %dHz stereo s32 in / stereo s32 out (%s timing)", + MIMI_VOICE_SAMPLE_RATE, + i2s_slot_style_str()); + return ESP_OK; +} +static int16_t *resample_s16_mono_linear(const int16_t *src, + size_t src_samples, + uint32_t src_rate, + uint32_t dst_rate, + size_t *out_samples) +{ + if (!src || src_samples == 0 || !out_samples || src_rate == 0 || dst_rate == 0) { + return NULL; + } + + if (src_rate == dst_rate) { + int16_t *copy = (int16_t *)malloc_prefer_spiram(src_samples * sizeof(int16_t)); + if (!copy) { + return NULL; + } + memcpy(copy, src, src_samples * sizeof(int16_t)); + *out_samples = src_samples; + return copy; + } + + const bool is_downsampling = src_rate > dst_rate; + + size_t dst_samples = (size_t)(((uint64_t)src_samples * dst_rate) / src_rate); + if (dst_samples == 0) { + return NULL; + } + + int16_t *dst = (int16_t *)malloc_prefer_spiram(dst_samples * sizeof(int16_t)); + if (!dst) { + return NULL; + } + + /* When downsampling (e.g. 24k -> 16k), naive linear interpolation tends to fold + * high-frequency content above the new Nyquist into the audible band (aliasing), + * often perceived as "沙沙" on sibilants/background. + * + * Apply a tiny 5-tap low-pass FIR [1,4,6,4,1]/16 on the source indices we touch. + * This is cheap and improves subjective quality significantly without pulling in DSP deps. + */ + for (size_t i = 0; i < dst_samples; i++) { + float src_pos = ((float)i * (float)src_rate) / (float)dst_rate; + size_t idx = (size_t)src_pos; + float frac = src_pos - (float)idx; + + if (idx >= src_samples - 1) { + dst[i] = src[src_samples - 1]; + } else { + float a = (float)(is_downsampling ? fir5_s16_at_clamped(src, src_samples, idx) : src[idx]); + float b = (float)(is_downsampling ? fir5_s16_at_clamped(src, src_samples, idx + 1) : src[idx + 1]); + float v = a + (b - a) * frac; + + if (v > 32767.0f) v = 32767.0f; + if (v < -32768.0f) v = -32768.0f; + dst[i] = (int16_t)v; + } + } + + *out_samples = dst_samples; + return dst; +} +static esp_err_t i2s_play_wav_pcm16(const uint8_t *wav, size_t wav_len) +{ + if (!s_i2s_ready || !s_tx_chan || !wav || wav_len == 0) { + return ESP_ERR_INVALID_STATE; + } + + wav_fmt_t fmt; + const uint8_t *pcm = NULL; + size_t pcm_len = 0; + + esp_err_t err = wav_find_data_chunk(wav, wav_len, &fmt, &pcm, &pcm_len); + if (err != ESP_OK) { + ESP_LOGE(TAG, "wav_find_data_chunk failed: %s", esp_err_to_name(err)); + return err; + } + + ESP_LOGI(TAG, "WAV fmt: format=%u channels=%u sample_rate=%u bits=%u data_len=%u", + (unsigned)fmt.audio_format, + (unsigned)fmt.channels, + (unsigned)fmt.sample_rate, + (unsigned)fmt.bits_per_sample, + (unsigned)pcm_len); + + if (fmt.audio_format != 1 || fmt.bits_per_sample != 16) { + return ESP_ERR_NOT_SUPPORTED; + } + + const int16_t *src16 = (const int16_t *)pcm; + size_t src_samples_total = pcm_len / sizeof(int16_t); + + const int16_t *mono_src = NULL; + int16_t *mono_owned = NULL; + size_t mono_samples = 0; + + if (fmt.channels == 1) { + mono_src = src16; + mono_samples = src_samples_total; + } else if (fmt.channels == 2) { + mono_samples = src_samples_total / 2; + mono_owned = (int16_t *)malloc_prefer_spiram(mono_samples * sizeof(int16_t)); + if (!mono_owned) { + return ESP_ERR_NO_MEM; + } + + for (size_t i = 0, j = 0; j < mono_samples; i += 2, j++) { + int32_t v = ((int32_t)src16[i] + (int32_t)src16[i + 1]) / 2; + mono_owned[j] = (int16_t)v; + } + mono_src = mono_owned; + } else { + return ESP_ERR_NOT_SUPPORTED; + } + + const int16_t *play_src = NULL; + int16_t *play_owned = NULL; + size_t play_samples = 0; + + if (fmt.sample_rate == MIMI_VOICE_SAMPLE_RATE) { + play_src = mono_src; + play_samples = mono_samples; + } else { + play_owned = resample_s16_mono_linear( + mono_src, + mono_samples, + fmt.sample_rate, + MIMI_VOICE_SAMPLE_RATE, + &play_samples + ); + free(mono_owned); + mono_owned = NULL; + + if (!play_owned || play_samples == 0) { + return ESP_ERR_NO_MEM; + } + play_src = play_owned; + } + + ESP_LOGI(TAG, "Playback PCM: %u samples @ %u Hz (~%u ms)", + (unsigned)play_samples, + (unsigned)MIMI_VOICE_SAMPLE_RATE, + (unsigned)((play_samples * 1000ULL) / MIMI_VOICE_SAMPLE_RATE)); + + s_is_playing = true; + + size_t frames_total = play_samples; + size_t frames_sent = 0; + + while (frames_sent < frames_total) { + const size_t frames_chunk = (frames_total - frames_sent > 256) ? 256 : (frames_total - frames_sent); + + int32_t tx_buf[256 * 2]; + for (size_t i = 0; i < frames_chunk; i++) { + int16_t s16 = play_src[frames_sent + i]; + int32_t s32 = ((int32_t)s16) << 16; + tx_buf[i * 2 + 0] = s32; + tx_buf[i * 2 + 1] = s32; + } + + const uint8_t *p = (const uint8_t *)tx_buf; + size_t bytes_total = frames_chunk * 2 * sizeof(int32_t); + size_t bytes_sent = 0; + + while (bytes_sent < bytes_total) { + size_t written = 0; + err = i2s_channel_write(s_tx_chan, + p + bytes_sent, + bytes_total - bytes_sent, + &written, + pdMS_TO_TICKS(1000)); + if (err != ESP_OK) { + ESP_LOGE(TAG, "i2s write failed: %s", esp_err_to_name(err)); + free(play_owned); + free(mono_owned); + s_is_playing = false; + return err; + } + if (written == 0) { + ESP_LOGE(TAG, "i2s write returned 0 bytes"); + free(play_owned); + free(mono_owned); + s_is_playing = false; + return ESP_FAIL; + } + bytes_sent += written; + } + + frames_sent += frames_chunk; + } + + /* Leave a short silence tail so the TX engine doesn't keep repeating the last + * non-zero DMA buffer (often perceived as continuous "咚咚" when idle). + */ + (void)i2s_tx_write_silence_ms(MIMI_VOICE_TX_SILENCE_TAIL_MS); + (void)i2s_tx_overwrite_dma_with_zeros(); + + free(play_owned); + free(mono_owned); + s_is_playing = false; + return ESP_OK; +} + +/* ========================= + * STT / TTS core + * ========================= */ + +static esp_err_t stt_transcribe_pcm(const int16_t *pcm, + size_t pcm_bytes, + char *out_text, + size_t out_text_size) +{ + if (!pcm || pcm_bytes == 0 || !out_text || out_text_size == 0) { + return ESP_ERR_INVALID_ARG; + } + if (!stt_api_url()[0] || !stt_api_key()[0]) { + return ESP_ERR_INVALID_STATE; + } + + out_text[0] = '\0'; + + uint8_t *wav = NULL; + size_t wav_len = wav_build_from_pcm16(pcm, pcm_bytes, MIMI_VOICE_SAMPLE_RATE, 1, &wav); + if (!wav || wav_len == 0) { + return ESP_ERR_NO_MEM; + } + + char *data_url = build_data_url_from_wav(wav, wav_len); + free(wav); + if (!data_url) { + return ESP_ERR_NO_MEM; + } + + cJSON *root = cJSON_CreateObject(); + cJSON_AddStringToObject(root, "model", stt_model()); + cJSON_AddBoolToObject(root, "stream", false); + + cJSON *messages = cJSON_CreateArray(); + cJSON *msg = cJSON_CreateObject(); + cJSON_AddStringToObject(msg, "role", "user"); + + cJSON *content = cJSON_CreateArray(); + cJSON *audio_item = cJSON_CreateObject(); + cJSON_AddStringToObject(audio_item, "type", "input_audio"); + + cJSON *input_audio = cJSON_CreateObject(); + cJSON_AddStringToObject(input_audio, "data", data_url); + cJSON_AddItemToObject(audio_item, "input_audio", input_audio); + cJSON_AddItemToArray(content, audio_item); + + cJSON_AddItemToObject(msg, "content", content); + cJSON_AddItemToArray(messages, msg); + cJSON_AddItemToObject(root, "messages", messages); + + cJSON *asr_options = cJSON_CreateObject(); + cJSON_AddBoolToObject(asr_options, "enable_itn", false); + cJSON_AddItemToObject(root, "asr_options", asr_options); + + char *body = cJSON_PrintUnformatted(root); + cJSON_Delete(root); + free(data_url); + + if (!body) { + return ESP_ERR_NO_MEM; + } + + http_resp_t resp = {0}; + int http_status = 0; + esp_err_t err = http_post_json(stt_api_url(), stt_api_key(), body, false, &resp, &http_status); + free(body); + + if (err != ESP_OK) { + ESP_LOGE(TAG, "STT HTTP failed: %s", esp_err_to_name(err)); + free(resp.buf); + return err; + } + if (http_status < 200 || http_status >= 300) { + ESP_LOGE(TAG, "STT HTTP status=%d body=%s", http_status, resp.buf ? resp.buf : ""); + free(resp.buf); + return ESP_FAIL; + } + + err = parse_stt_response_text(resp.buf ? resp.buf : "", out_text, out_text_size); + if (err != ESP_OK) { + ESP_LOGE(TAG, "STT parse failed, body=%s", resp.buf ? resp.buf : ""); + } else { + ESP_LOGI(TAG, "STT transcript: %s", out_text); + } + + free(resp.buf); + return err; +} + +static esp_err_t tts_stream_play(const char *text) +{ + if (!text || !text[0]) { + return ESP_ERR_INVALID_ARG; + } + if (!tts_api_url()[0] || !tts_api_key()[0]) { + return ESP_ERR_INVALID_STATE; + } + + cJSON *body = cJSON_CreateObject(); + cJSON_AddStringToObject(body, "model", tts_model()); + + cJSON *input = cJSON_CreateObject(); + cJSON_AddStringToObject(input, "text", text); + cJSON_AddStringToObject(input, "voice", tts_voice()); + cJSON_AddStringToObject(input, "language_type", tts_language()); + cJSON_AddItemToObject(body, "input", input); + + char *json = cJSON_PrintUnformatted(body); + cJSON_Delete(body); + + if (!json) { + return ESP_ERR_NO_MEM; + } + + http_resp_t resp = {0}; + int http_status = 0; + esp_err_t err = http_post_json(tts_api_url(), tts_api_key(), json, false, &resp, &http_status); + free(json); + + if (err != ESP_OK) { + ESP_LOGE(TAG, "TTS HTTP failed: %s", esp_err_to_name(err)); + free(resp.buf); + return err; + } + if (http_status < 200 || http_status >= 300) { + ESP_LOGE(TAG, "TTS HTTP status=%d body=%s", http_status, resp.buf ? resp.buf : ""); + free(resp.buf); + return ESP_FAIL; + } + + char wav_url[1024] = {0}; + err = parse_tts_audio_url(resp.buf ? resp.buf : "", wav_url, sizeof(wav_url)); + if (err != ESP_OK) { + ESP_LOGE(TAG, "TTS parse failed, body=%s", resp.buf ? resp.buf : ""); + free(resp.buf); + return err; + } + free(resp.buf); + + ESP_LOGI(TAG, "TTS audio url: %s", wav_url); + + http_resp_t wav_resp = {0}; + http_status = 0; + err = http_get_binary(wav_url, &wav_resp, &http_status); + if (err != ESP_OK) { + ESP_LOGE(TAG, "TTS wav download failed: %s", esp_err_to_name(err)); + free(wav_resp.buf); + return err; + } + if (http_status < 200 || http_status >= 300) { + ESP_LOGE(TAG, "TTS wav status=%d", http_status); + free(wav_resp.buf); + return ESP_FAIL; + } + ESP_LOGI(TAG, "TTS wav http_status=%d len=%d", http_status, (int)wav_resp.len); + + if (wav_resp.len >= 12) { + ESP_LOGI(TAG, "TTS wav magic: %.4s / %.4s", + wav_resp.buf, + wav_resp.buf + 8); + } + + if (wav_resp.len >= 4 && memcmp(wav_resp.buf, "RIFF", 4) != 0) { + ESP_LOGE(TAG, "TTS response is not WAV, preview: %.120s", wav_resp.buf); + } + + err = i2s_play_wav_pcm16((const uint8_t *)wav_resp.buf, wav_resp.len); + free(wav_resp.buf); + return err; +} + +/* ========================= + * Voice capture loop + * ========================= */ + +static void voice_capture_task(void *arg) +{ + (void)arg; + + const size_t frame_samples = (MIMI_VOICE_SAMPLE_RATE * MIMI_VOICE_FRAME_MS) / 1000; + const size_t stereo_frame_bytes = frame_samples * VOICE_I2S_BYTES_PER_STEREO_FRAME; + const size_t mono16_frame_bytes = frame_samples * sizeof(int16_t); + const size_t max_frames = MIMI_VOICE_MAX_UTTERANCE_MS / MIMI_VOICE_FRAME_MS; + const size_t silence_frames_end = MIMI_VOICE_SILENCE_END_MS / MIMI_VOICE_FRAME_MS; + + uint8_t *rx_buf = (uint8_t *)heap_caps_malloc(stereo_frame_bytes, MALLOC_CAP_SPIRAM); + int16_t *mono_frame = (int16_t *)heap_caps_malloc(mono16_frame_bytes, MALLOC_CAP_SPIRAM); + int16_t *utterance = (int16_t *)heap_caps_malloc(max_frames * frame_samples * sizeof(int16_t), MALLOC_CAP_SPIRAM); + + if (!rx_buf || !mono_frame || !utterance) { + ESP_LOGE(TAG, "voice_capture_task alloc failed"); + free(rx_buf); + free(mono_frame); + free(utterance); + vTaskDelete(NULL); + return; + } + + bool in_speech = false; + size_t total_frames = 0; + size_t silence_frames = 0; + size_t start_frames = 0; + TickType_t cooldown_until = 0; + + /* Simple adaptive noise floor */ + uint32_t noise_floor = MIMI_VOICE_VAD_THRESHOLD / 2; + if (noise_floor < 100) noise_floor = 100; + + while (1) { + if (!s_i2s_ready || !s_rx_chan) { + vTaskDelay(pdMS_TO_TICKS(100)); + continue; + } + + if (s_is_playing) { + /* Avoid self-trigger during playback */ + vTaskDelay(pdMS_TO_TICKS(MIMI_VOICE_FRAME_MS)); + continue; + } + + TickType_t now = xTaskGetTickCount(); + if (cooldown_until != 0 && now < cooldown_until) { + vTaskDelay(cooldown_until - now); + continue; + } + + size_t bytes_read = 0; + esp_err_t err = i2s_channel_read(s_rx_chan, + rx_buf, + stereo_frame_bytes, + &bytes_read, + pdMS_TO_TICKS(1000)); + if (err != ESP_OK || bytes_read == 0) { + continue; + } + + size_t mono_samples = 0; + pcm_s32_stereo_to_s16_mono(rx_buf, bytes_read, mono_frame, &mono_samples); + if (mono_samples == 0) { + continue; + } + + uint32_t energy = pcm_energy_absavg(mono_frame, mono_samples); + + /* Update noise floor only when not in speech */ + if (!in_speech) { + noise_floor = (noise_floor * 15 + energy) / 16; + } + + uint32_t dynamic_threshold = noise_floor + MIMI_VOICE_VAD_THRESHOLD; + bool speech_now = (energy > dynamic_threshold); + + if (!in_speech) { + if (!speech_now) { + start_frames = 0; + continue; + } + start_frames++; + if (start_frames < MIMI_VOICE_VAD_START_FRAMES) { + continue; + } + in_speech = true; + total_frames = 0; + silence_frames = 0; + start_frames = 0; + } + + if (total_frames < max_frames) { + memcpy(&utterance[total_frames * frame_samples], mono_frame, mono16_frame_bytes); + total_frames++; + } + + if (speech_now) { + silence_frames = 0; + } else { + silence_frames++; + } + + bool end_by_silence = (silence_frames >= silence_frames_end); + bool end_by_limit = (total_frames >= max_frames); + + if (!end_by_silence && !end_by_limit) { + continue; + } + + in_speech = false; + + /* Ignore ultra-short bursts */ + if (total_frames < MIMI_VOICE_VAD_MIN_FRAMES) { + total_frames = 0; + silence_frames = 0; + cooldown_until = xTaskGetTickCount() + pdMS_TO_TICKS(MIMI_VOICE_STT_COOLDOWN_MS); + continue; + } + + size_t pcm_bytes = total_frames * frame_samples * sizeof(int16_t); + char text[512] = {0}; + + if (xSemaphoreTake(s_http_lock, pdMS_TO_TICKS(30000)) == pdTRUE) { + esp_err_t stt_err = stt_transcribe_pcm(utterance, pcm_bytes, text, sizeof(text)); + xSemaphoreGive(s_http_lock); + + if (stt_err == ESP_OK && text[0]) { + ESP_LOGI(TAG, "Voice STT: %s", text); + push_voice_inbound(text); + } else { + ESP_LOGW(TAG, "STT failed or empty transcript"); + } + } + + total_frames = 0; + silence_frames = 0; + cooldown_until = xTaskGetTickCount() + pdMS_TO_TICKS(MIMI_VOICE_STT_COOLDOWN_MS); + } +} + +/* ========================= + * Public API + * ========================= */ + +esp_err_t voice_channel_init(void) +{ + s_enabled = (MIMI_VOICE_ENABLED_DEFAULT != 0) || + (stt_api_key()[0] && tts_api_key()[0]); + + if (!s_enabled) { + ESP_LOGI(TAG, "Voice channel disabled (set STT/TTS API key or enable default)"); + return ESP_OK; + } + + esp_err_t err = i2s_init_xvf3800(); + if (err != ESP_OK) { + s_enabled = false; + return ESP_OK; + } + + s_http_lock = xSemaphoreCreateMutex(); + if (!s_http_lock) { + ESP_LOGE(TAG, "Voice init failed: cannot allocate mutex"); + s_enabled = false; + return ESP_ERR_NO_MEM; + } + + return ESP_OK; +} + +esp_err_t voice_channel_start(void) +{ + if (!s_enabled || !s_i2s_ready) { + return ESP_OK; + } + + if (!s_capture_task) { + if (xTaskCreatePinnedToCore(voice_capture_task, + "voice_cap", + MIMI_VOICE_CAPTURE_STACK, + NULL, + MIMI_VOICE_TASK_PRIO, + &s_capture_task, + MIMI_VOICE_CORE) != pdPASS) { + return ESP_FAIL; + } + } + + ESP_LOGI(TAG, "Voice channel started"); + return ESP_OK; +} + +esp_err_t voice_channel_speak_text(const char *text) +{ + if (!s_enabled || !s_i2s_ready || !text || text[0] == '\0') { + return ESP_ERR_INVALID_STATE; + } + + if (xSemaphoreTake(s_http_lock, pdMS_TO_TICKS(30000)) != pdTRUE) { + return ESP_ERR_TIMEOUT; + } + + char *tts_text = voice_build_tts_text(text); + esp_err_t err = tts_stream_play(tts_text ? tts_text : text); + free(tts_text); + + xSemaphoreGive(s_http_lock); + return err; +} + +bool voice_channel_is_enabled(void) +{ + return s_enabled; +} + +void voice_channel_get_status(voice_channel_status_t *status) +{ + if (!status) { + return; + } + + status->enabled = s_enabled; + status->i2s_ready = s_i2s_ready; + status->is_playing = s_is_playing; + status->stt_configured = (stt_api_url()[0] != '\0' && stt_api_key()[0] != '\0'); + status->tts_configured = (tts_api_url()[0] != '\0' && tts_api_key()[0] != '\0'); +} diff --git a/main/voice/voice_channel.h b/main/voice/voice_channel.h new file mode 100644 index 00000000..ffcc2504 --- /dev/null +++ b/main/voice/voice_channel.h @@ -0,0 +1,32 @@ +#pragma once + +#include +#include "esp_err.h" + +typedef struct { + bool enabled; + bool i2s_ready; + bool is_playing; + bool stt_configured; + bool tts_configured; +} voice_channel_status_t; + +/* + * Voice channel for ReSpeaker XVF3800 over I2S. + * + * Inbound path: + * Mic PCM -> VAD utterance -> STT -> message_bus inbound (channel=voice) + * + * Outbound path: + * Agent text (channel=voice) -> TTS -> speaker playback (I2S) + */ +esp_err_t voice_channel_init(void); +esp_err_t voice_channel_start(void); + +/* + * Convert text to speech and enqueue for playback. + */ +esp_err_t voice_channel_speak_text(const char *text); + +bool voice_channel_is_enabled(void); +void voice_channel_get_status(voice_channel_status_t *status); diff --git a/partitions.csv b/partitions.csv index 24c87784..017cf429 100644 --- a/partitions.csv +++ b/partitions.csv @@ -4,5 +4,5 @@ otadata, data, ota, 0xF000, 0x2000 phy_init, data, phy, 0x11000, 0x1000 ota_0, app, ota_0, 0x20000, 0x200000 ota_1, app, ota_1, 0x220000, 0x200000 -spiffs, data, spiffs, 0x420000, 0xBD0000 -coredump, data, coredump,0xFF0000, 0x10000 +spiffs, data, spiffs, 0x420000, 0x3D0000 +coredump, data, coredump,0x7F0000, 0x10000 diff --git a/sdkconfig.defaults.esp32s3 b/sdkconfig.defaults.esp32s3 index 4774cd93..eed91926 100644 --- a/sdkconfig.defaults.esp32s3 +++ b/sdkconfig.defaults.esp32s3 @@ -2,7 +2,7 @@ CONFIG_IDF_TARGET="esp32s3" # Flash 16MB + QIO -CONFIG_ESPTOOLPY_FLASHSIZE_16MB=y +CONFIG_ESPTOOLPY_FLASHSIZE_8MB=y CONFIG_ESPTOOLPY_FLASHMODE_QIO=y # CPU 240MHz