From be733ebe9bc514a32ab5830bfd5d52af5aec2ab6 Mon Sep 17 00:00:00 2001
From: Boy Steven Benaya Aritonang <stevenprobot@gmail.com>
Date: Sun, 22 Mar 2026 16:36:58 +0700
Subject: [PATCH] docs: add sign language and tooling guides

---
 docs/sign-language-template.md | 202 +++++++++++++++++++++++++
 docs/tooling.md                | 260 +++++++++++++++++++++++++++++++++
 2 files changed, 462 insertions(+)
 create mode 100644 docs/sign-language-template.md
 create mode 100644 docs/tooling.md

diff --git a/docs/sign-language-template.md b/docs/sign-language-template.md
new file mode 100644
index 0000000..2292f4b
--- /dev/null
+++ b/docs/sign-language-template.md
@@ -0,0 +1,202 @@
+# Sign Language Stack For This Template
+
+This repo is a good fit for a sign-language project, but the best stack depends on what you mean by "sign language."
+
+## Start With The Problem Shape
+
+There are three common versions of this project:
+
+1. `Static hand signs`
+   Example: alphabet letters or a small fixed set of hand poses.
+2. `Dynamic signs`
+   Example: signs that depend on motion over time, not a single frame.
+3. `Full sign-language understanding`
+   Example: larger vocabularies where hand shape, motion, body pose, and face cues matter together.
+
+The further you move from static poses into real sign language, the less a simple object detector is enough on its own.
+
+## Best Recommendation For This Repo
+
+For this template, the strongest path is:
+
+- `Frontend`: keep using the existing Next.js webcam or upload flow
+- `Feature extraction`: use `MediaPipe` hand landmarks first
+- `Model training`: use `PyTorch`
+- `Inference runtime`: export to `ONNX` and run with `ONNX Runtime` in the backend
+- `Backend API`: keep FastAPI as the contract boundary
+
+That gives you a practical stack that is:
+
+- fast enough for demos and hackathons
+- easier to train than raw image-to-label models
+- more stable than trying to force YOLO into a gesture problem
+- compatible with this repo's existing "analyze image or frame and return typed results" shape
+
+## What To Use By Project Type
+
+### 1. Static Sign Demo
+
+Use this when you want:
+
+- alphabet recognition
+- a small vocabulary
+- one signer in front of a webcam
+- a fast MVP
+
+Recommended stack:
+
+- `MediaPipe Hand Landmarker`
+- a small classifier on top of hand landmarks
+- `PyTorch` for training
+- `ONNX Runtime` for backend inference
+
+Why:
+
+- landmarks reduce the amount of visual noise
+- you do not need a heavy detector for a single webcam user
+- training on landmarks is usually easier than training on raw images
+
+### 2. Dynamic Sign Recognition
+
+Use this when the sign depends on motion across multiple frames.
+
+Recommended stack:
+
+- `MediaPipe Holistic` or at least `hands + pose`
+- sequence model such as `LSTM`, `GRU`, or a small `Transformer`
+- `PyTorch` for training
+- `ONNX Runtime` for serving
+
+Why:
+
+- many signs are not defined by one frame
+- temporal context matters
+- body and face cues can matter, not only the hand outline
+
+### 3. Larger Or More Realistic Sign-Language Systems
+
+Use this when you want more than a demo and need better linguistic coverage.
+
+Recommended stack:
+
+- `MediaPipe Holistic`
+- a sequence model over landmarks and possibly cropped image features
+- optional dataset tooling for alignment and labeling
+- `ONNX Runtime` or another production runtime
+
+Important note:
+
+If the goal is actual sign language rather than "gesture control," a hands-only pipeline will likely cap out early.
+
+## Where It Fits In This Repo
+
+### Frontend
+
+Use the existing webcam and upload experience as the input layer:
+
+- `frontend/src/components/webcam-console.tsx`
+- `frontend/src/components/inference-console.tsx`
+
+That means you can keep the product flow the repo already teaches:
+
+1. capture or upload an image or frame
+2. send it to the backend
+3. receive typed results
+4. render overlays, labels, and metrics
+
+### Backend
+
+The backend is where the actual CV or ML logic should live:
+
+- `backend/app/vision/service.py`
+- `backend/app/vision/pipelines.py`
+- `backend/app/api/routes/inference.py`
+
+The cleanest extension is to add a new pipeline entry such as:
+
+- `sign-static`
+- `sign-sequence`
+
+That keeps the repo's pipeline registry pattern intact.
+
+### Contract
+
+If you change the shape of the response, also update:
+
+- `docs/openapi.yaml`
+- `frontend/src/generated/openapi.ts`
+
+If you can keep the response close to the existing typed contract, integration stays easier.
+
+## Recommended Output Shape
+
+For a sign-language MVP in this template, I would return:
+
+- top predicted sign label
+- confidence score
+- optional hand boxes or landmark-derived regions
+- metrics such as handedness, frame count, or latency
+
+For dynamic signs, consider adding:
+
+- sequence window size
+- temporal confidence
+- optional "still collecting frames" status
+
+Try to avoid coupling the frontend to raw model internals. Keep the backend responsible for translating model output into product-friendly fields.
+
+## When To Use YOLO
+
+`YOLO` is useful when you need detection, such as:
+
+- multiple people in frame
+- signer localization in a wide camera view
+- hand or person detection before a second-stage recognizer
+
+It is usually not my first recommendation for a single-user webcam sign demo because:
+
+- you still need recognition after detection
+- landmarks are often a better representation for sign tasks
+- it adds training and inference complexity early
+
+## When To Use A Hosted Model
+
+A hosted model can be useful for:
+
+- quick experiments
+- low-ops prototypes
+- testing ideas before local deployment
+
+But for sign-language interaction, local inference is often better because of:
+
+- lower latency
+- lower recurring cost
+- better privacy
+- fewer network dependencies during demos
+
+## Suggested Build Order
+
+1. `MVP`
+   Add a `sign-static` backend pipeline using hand landmarks and a small classifier.
+2. `Webcam loop`
+   Reuse the current webcam page and submit captured frames to the same inference endpoint.
+3. `Temporal model`
+   Add a second pipeline for dynamic signs using short frame sequences.
+4. `Contract refinement`
+   Expand the API only when the frontend truly needs more than label, confidence, and review metadata.
+
+## Simple Decision Guide
+
+- If you want a fast hackathon demo: `MediaPipe Hand Landmarker + small classifier`
+- If you want real-time local inference: `PyTorch -> ONNX -> ONNX Runtime`
+- If you want broader sign understanding: `MediaPipe Holistic + sequence model`
+- If you need person or hand detection in messy scenes: add `YOLO` as a helper, not the whole solution
+
+## Official References
+
+- MediaPipe Hand Landmarker: <https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker>
+- MediaPipe Gesture Recognizer: <https://ai.google.dev/edge/mediapipe/solutions/vision/gesture_recognizer>
+- MediaPipe Gesture customization: <https://ai.google.dev/edge/mediapipe/solutions/customization/gesture_recognizer>
+- MediaPipe Holistic Landmarker: <https://ai.google.dev/edge/mediapipe/solutions/vision/holistic_landmarker>
+- ONNX Runtime docs: <https://onnxruntime.ai/docs/>
+- Ultralytics YOLO docs: <https://docs.ultralytics.com/>
diff --git a/docs/tooling.md b/docs/tooling.md
new file mode 100644
index 0000000..b06b2ee
--- /dev/null
+++ b/docs/tooling.md
@@ -0,0 +1,260 @@
+# Tooling For This Template
+
+This template is built around one stable idea:
+
+1. the frontend captures an image or frame
+2. the backend runs inference
+3. the API returns typed results
+4. the frontend renders those results without knowing model internals
+
+That means the best tools are the ones that fit this contract cleanly.
+
+## Good Tool Categories For This Repo
+
+### 1. CV And Inference Libraries
+
+#### OpenCV
+
+Best for:
+
+- image preprocessing
+- thresholding
+- contour extraction
+- box and polygon generation
+- lightweight CPU-first pipelines
+
+Fit in this repo:
+
+- already used in `backend/app/vision/service.py`
+- ideal for starter logic and quick preprocessing before a real model
+
+#### MediaPipe
+
+Best for:
+
+- hand landmarks
+- pose landmarks
+- face landmarks
+- gesture-style interaction
+- sign-language prototypes
+
+Fit in this repo:
+
+- strong option when the project moves from generic detection into human motion or hand understanding
+- especially useful for webcam-driven experiences
+
+#### ONNX Runtime
+
+Best for:
+
+- serving trained models locally
+- CPU or GPU inference without shipping a full training stack to production
+- stable deployment after training elsewhere
+
+Fit in this repo:
+
+- one of the best upgrades from the current OpenCV sample pipelines
+- works well behind the existing FastAPI service boundary
+
+#### PyTorch
+
+Best for:
+
+- training custom models
+- experimenting with sequence models
+- research-friendly development
+
+Fit in this repo:
+
+- strongest choice for training
+- often paired with export to ONNX for serving
+
+#### Ultralytics YOLO
+
+Best for:
+
+- object detection
+- hand or person localization
+- segmentation variants when the task is detection-heavy
+
+Fit in this repo:
+
+- good when the product really is detection-first
+- useful as a first stage before a second model
+- not always the best first tool for sign-language recognition
+
+#### TensorRT
+
+Best for:
+
+- high-performance NVIDIA deployment
+- lower latency once the model path is already stable
+
+Fit in this repo:
+
+- better as a later optimization than an early template choice
+
+### 2. API And Contract Tooling
+
+#### OpenAPI
+
+Best for:
+
+- keeping frontend and backend aligned
+- documenting request and response shapes
+- making model swaps safer
+
+Fit in this repo:
+
+- central to the current architecture
+- the source of truth is `docs/openapi.yaml`
+
+#### openapi-typescript
+
+Best for:
+
+- generating frontend types from the backend contract
+
+Fit in this repo:
+
+- already used through `frontend/src/generated/openapi.ts`
+- should be rerun whenever the contract changes
+
+### 3. Frontend Product Layer
+
+#### Next.js
+
+Best for:
+
+- app shell
+- file upload flows
+- webcam UI
+- review and QA interfaces
+
+Fit in this repo:
+
+- already provides the user-facing product layer
+- should stay decoupled from model-specific logic
+
+#### React
+
+Best for:
+
+- interactive result rendering
+- overlays
+- metrics panels
+- webcam state and upload state
+
+### 4. Backend Serving Layer
+
+#### FastAPI
+
+Best for:
+
+- inference endpoints
+- validation
+- typed response models
+- keeping model code behind a stable HTTP boundary
+
+Fit in this repo:
+
+- already the core backend
+- the right place for model loading and inference orchestration
+
+## Recommended Tool Combinations
+
+### A. Detection Product
+
+Use:
+
+- `OpenCV` for simple CPU starter logic
+- `YOLO` when you need real object detection
+- `ONNX Runtime` for serving exported models
+- `FastAPI + OpenAPI` for the contract
+
+### B. Sign-Language MVP
+
+Use:
+
+- `MediaPipe Hand Landmarker`
+- `PyTorch` for training
+- `ONNX Runtime` for inference
+- `FastAPI + OpenAPI`
+- existing `Next.js` webcam flow
+
+### C. Dynamic Sign Recognition
+
+Use:
+
+- `MediaPipe Holistic`
+- `PyTorch` sequence model
+- `ONNX Runtime`
+- `FastAPI + OpenAPI`
+
+### D. Analytics Or Quality Pipelines
+
+Use:
+
+- `OpenCV`
+- `NumPy`
+- existing metrics-oriented response shape
+
+## Tool Choices By Question
+
+- `Do I need a heavy ML model yet?`
+  Use `OpenCV` first if the task is simple and deterministic.
+- `Do I need detection boxes?`
+  Use `YOLO` if classical CV is no longer enough.
+- `Do I need landmarks, pose, or hand structure?`
+  Use `MediaPipe`.
+- `Do I need custom training?`
+  Use `PyTorch`.
+- `Do I want local serving after training?`
+  Use `ONNX Runtime`.
+- `Do I want the frontend to stay stable while models change?`
+  Keep using `OpenAPI` and generated types.
+
+## What To Avoid Early
+
+- pushing raw model logic into the frontend
+- tightly coupling UI components to one specific model output
+- changing the response contract without updating `docs/openapi.yaml`
+- adding a hosted API dependency for real-time features before you understand the latency tradeoff
+- using YOLO for every vision problem just because it is popular
+
+## Where These Tools Plug Into The Repo
+
+- `frontend/`
+  UI, upload, webcam capture, results rendering
+- `backend/app/api/routes/`
+  HTTP entrypoints
+- `backend/app/vision/`
+  model and pipeline logic
+- `docs/openapi.yaml`
+  contract source of truth
+- `frontend/src/generated/openapi.ts`
+  generated frontend types
+
+## Practical Recommendation
+
+If you are extending this template today:
+
+- keep `Next.js + FastAPI + OpenAPI` as-is
+- use `OpenCV` for preprocessing and utility steps
+- choose `MediaPipe` for landmarks and gesture-like tasks
+- choose `YOLO` for detection-heavy tasks
+- train in `PyTorch`
+- deploy inference with `ONNX Runtime`
+
+That combination keeps the repo teachable, modular, and close to production patterns without making a hackathon project too heavy too early.
+
+## Official References
+
+- OpenCV: <https://opencv.org/>
+- MediaPipe: <https://ai.google.dev/edge/mediapipe/solutions/guide>
+- ONNX Runtime: <https://onnxruntime.ai/docs/>
+- PyTorch: <https://pytorch.org/docs/stable/index.html>
+- Ultralytics YOLO: <https://docs.ultralytics.com/>
+- FastAPI: <https://fastapi.tiangolo.com/>
+- Next.js: <https://nextjs.org/docs>
+- OpenAPI: <https://swagger.io/specification/>