AI-powered ASL-to-text wearable system for hackathon delivery.
Build a software pipeline that:
- Captures hand-sign video from a
Seeed Studio XIAO ESP32S3 Sensecamera module mounted on glasses. - Extracts structured hand movement features in real time.
- Converts sign sequences into English text with low latency.
- Streams translated text to a live web dashboard.
Primary demo target: continuous ASL-to-text captioning that feels near real time in a live conversation setting.
The system uses a distributed setup so compute-heavy work stays off the microcontroller.
XIAO ESP32S3 Sense(on glasses): camera capture + lightweight Wi-Fi frame serving.- MacBook Air (M4): frame ingestion, MediaPipe processing, buffering, Gemini translation orchestration.
- FastAPI backend (on laptop): APIs, ingest adapter, WebSocket broadcast, service coordination.
- Next.js frontend: live transcript UI and system status display.
flowchart LR
A["XIAO ESP32S3 Sense Camera Node"] -->|JPEG frames / MJPEG stream over Wi-Fi| B["FastAPI Ingest Adapter (Laptop)"]
B --> C["MediaPipe Hands Tracker"]
C -->|21x3 landmarks/frame| D["Temporal Buffer (1-2s windows)"]
D --> E["Gemini 1.5 Flash Translation Service"]
E -->|Text hypotheses + final text| F["FastAPI WebSocket Hub"]
F --> G["Next.js Live Caption Dashboard"]
XIAO ESP32S3 Senseis ideal for low-power capture/streaming, not full CV + LLM inference.- Laptop offloading protects real-time performance for MediaPipe and Gemini calls.
- Sending landmark JSON (instead of raw video) to Gemini reduces bandwidth, latency, and cost.
- FastAPI + WebSocket gives low-latency push updates to the UI.
Responsibilities:
- Capture camera frames continuously.
- Encode as JPEG and expose a stream endpoint over Wi-Fi.
- Support reconnect behavior after transient network loss.
- Emit minimal device health data (uptime, reconnect count, frame cadence).
Hackathon stream profile (starting baseline):
- Resolution:
320x240(primary),640x480optional for clarity testing. - Target camera output:
10-15 FPSsustained on stable Wi-Fi. - Priority: low latency and consistency over image quality.
Device constraints to plan around:
- Limited RAM/CPU compared with SBCs, so no on-device MediaPipe or LLM inference.
- Wi-Fi jitter can be the dominant bottleneck.
- Thermal/power conditions affect sustained frame cadence.
Responsibilities:
- Pull or subscribe to ESP32 camera stream (MJPEG/JPEG feed).
- Decode frames and timestamp them on ingest.
- Run MediaPipe Hands on incoming frames.
- Normalize and serialize landmarks.
- Maintain short rolling temporal buffer.
- Route buffered sequences to translation service.
- Publish incremental/final text updates to frontend.
Output per detected hand per frame:
- 21 landmarks.
- 3D coordinates
(x, y, z). - Confidence metadata for filtering.
Normalization strategy:
- Normalize to wrist or palm reference to reduce camera-position dependence.
- Apply scale-invariant transforms to improve signer robustness.
- Apply temporal smoothing to reduce jitter in noisy frames.
Input:
- Time-ordered landmark sequence over short windows (1-2 seconds).
- Optional context from previous partial translation.
Output:
- Partial hypotheses for live-caption feel.
- Stabilized final text for transcript.
Prompting strategy:
- Constrain the model to translation behavior only.
- Prefer concise, plain English.
- Require uncertainty handling (
[unclear]) instead of hallucinated words.
Features:
- Live caption text area (streaming updates).
- Transcript history panel.
- System health indicators (camera connected, ingest FPS, inference latency).
- Manual controls (start/stop session, clear transcript).
XIAO ESP32S3 Sensepublishes camera frames over local Wi-Fi.- FastAPI ingest adapter receives/decodes frames.
- MediaPipe Hands processes each eligible frame.
- Landmark records are appended to a rolling buffer.
- When a window threshold is met, buffered sequence is sent to Gemini.
- Gemini returns translated text (partial/final).
- FastAPI broadcasts updates via WebSocket to Next.js dashboard.
- UI displays live captions and archives transcript entries.
- Camera ingest FPS:
>= 12 FPSsustained at320x240. - Landmark extraction latency:
<= 35 ms/frameaverage on laptop. - Translation update cadence: every
1-2 seconds. - End-to-end caption latency (gesture to text):
<= 3.0 seconds(stretch goal<= 2.0 seconds). - Demo stability:
> 95%uptime during a 10-minute live test.
Goal: stable XIAO ESP32S3 Sense camera feed to laptop.
Deliverables:
- Firmware configured for reliable Wi-Fi camera streaming.
- Verified stream endpoint(s) consumable by FastAPI ingest adapter.
- Diagnostics for connection state, frame cadence, and reconnect behavior.
Exit criteria:
- Continuous 10-minute camera stream without unrecoverable crash.
- Automatic reconnect after temporary Wi-Fi interruption.
Goal: robust landmark extraction from ESP32 camera stream.
Deliverables:
- MediaPipe integration on laptop ingest stream.
- Serialized landmark output per frame.
- Visual/log validation for landmark consistency.
Exit criteria:
- Landmarks produced on most frames during signing.
- Confidence filtering removes obvious false detections.
Goal: convert landmark windows into readable English text.
Deliverables:
- Temporal buffer manager (1-2 second windows).
- Gemini request/response orchestration.
- Partial and final text update states.
Exit criteria:
- Recognizable translations for a curated demo sign set.
- Graceful handling of uncertain segments.
Goal: real-time user-facing caption UI.
Deliverables:
- Next.js dashboard with live transcript.
- WebSocket integration from FastAPI.
- Session controls and health panel.
Exit criteria:
- Full pipeline demo from hand signing to on-screen text.
- Dashboard remains responsive under live stream load.
These interface definitions guide implementation and team alignment.
- Purpose: provide JPEG frame stream over LAN.
- Required metadata at ingest: receive timestamp, frame index, source id.
- Behavior: best-effort stream with dropped-frame tolerance and backpressure handling.
- Purpose: submit buffered landmarks for inference.
- Input: ordered landmark sequence + optional context id.
- Output: partial/final text + confidence/status.
Event types:
caption.partialcaption.finalsystem.metricssystem.alert
- If network quality drops, lower resolution and/or requested frame rate.
- If stream decode fails intermittently, skip bad frames and keep session alive.
- If translation service delays, continue buffering and show
processingstate. - If hands are not detected, show explicit
No hands detectedstatus. - If Gemini is unavailable, preserve landmark logs for replay-based translation.
- Frame skipping: process every 2nd or 3rd frame when compute load spikes.
- Adaptive ingest: dynamically tune frame size/quality to maintain steady latency.
- Landmark-only cloud payloads: minimize request size and token overhead.
- Optional single-hand mode during unstable network periods.
- Queue limits and timeouts to prevent cascading lag.
- ESP32 camera stream starts/stops cleanly.
- Camera reconnect behavior works after Wi-Fi interruption.
- Landmarks generate from known hand poses.
- Translation responses match expected schema.
- WebSocket events render correctly in dashboard.
- End-to-end signing session from
XIAO ESP32S3 Senseto UI. - Recovery after temporary camera or network interruption.
- Stability over repeated start/stop cycles.
- Prepare a fixed list of ASL signs/phrases for repeatable checks.
- Measure caption latency and readability per phrase.
- Record known failure cases for transparent demo communication.
- Keep camera-to-laptop traffic on trusted local network.
- Do not persist raw video unless explicitly needed for debugging.
- Prefer landmark logging over image logging.
- Redact sensitive identifiers from logs before sharing.
Suggested division of responsibilities:
- Embedded engineer:
XIAO ESP32S3 Sensefirmware + stream reliability. - CV engineer: MediaPipe tracking + feature normalization.
- Backend engineer: FastAPI orchestration + WebSocket transport.
- AI engineer: Gemini prompting + translation quality tuning.
- Frontend engineer: Next.js live dashboard + UX polish.
Before demo:
- Validate Wi-Fi path between
XIAO ESP32S3 Senseand laptop. - Confirm API keys and environment configuration.
- Run a 10-minute soak test with live camera feed.
- Warm up all services before judges arrive.
- Keep a fallback replay path for resilience.
During demo:
- Show live hand signing and real-time captions.
- Briefly show ingest FPS + latency metrics to prove real pipeline.
- Explain camera-frame to landmark-to-text efficiency strategy.
After demo:
- Save logs and notes on translation misses.
- Prioritize top 3 failure modes for next iteration.
- Add spoken-language translation pipeline (speech -> translated text/audio).
- Add multilingual output mode.
- Add edge-device model options for reduced cloud dependency.
- Personalize signer profiles for improved translation quality.
- Integrate audio output for accessibility scenarios.
Project is considered successful when:
- Live ASL signing is captured through
XIAO ESP32S3 Sensecamera node. - Pipeline produces readable text with acceptable lag in real time.
- Text appears continuously in Next.js dashboard via WebSocket.
- Team can run the full demo reliably on command.
The detailed website/UX plan for the live translation reader is documented in:
UI_SPEC.md
Use that file to drive frontend implementation before writing production UI code.
The backend phased execution plan and phase-completion tests are documented in:
BACKEND_README.md
Use that file to implement backend work phase by phase with objective completion checks.
This README defines the software implementation strategy and execution plan for the ESP32S3 Sense-based build. Code implementation will follow this document phase by phase.