From be733ebe9bc514a32ab5830bfd5d52af5aec2ab6 Mon Sep 17 00:00:00 2001 From: Boy Steven Benaya Aritonang Date: Sun, 22 Mar 2026 16:36:58 +0700 Subject: [PATCH] docs: add sign language and tooling guides --- docs/sign-language-template.md | 202 +++++++++++++++++++++++++ docs/tooling.md | 260 +++++++++++++++++++++++++++++++++ 2 files changed, 462 insertions(+) create mode 100644 docs/sign-language-template.md create mode 100644 docs/tooling.md diff --git a/docs/sign-language-template.md b/docs/sign-language-template.md new file mode 100644 index 0000000..2292f4b --- /dev/null +++ b/docs/sign-language-template.md @@ -0,0 +1,202 @@ +# Sign Language Stack For This Template + +This repo is a good fit for a sign-language project, but the best stack depends on what you mean by "sign language." + +## Start With The Problem Shape + +There are three common versions of this project: + +1. `Static hand signs` + Example: alphabet letters or a small fixed set of hand poses. +2. `Dynamic signs` + Example: signs that depend on motion over time, not a single frame. +3. `Full sign-language understanding` + Example: larger vocabularies where hand shape, motion, body pose, and face cues matter together. + +The further you move from static poses into real sign language, the less a simple object detector is enough on its own. + +## Best Recommendation For This Repo + +For this template, the strongest path is: + +- `Frontend`: keep using the existing Next.js webcam or upload flow +- `Feature extraction`: use `MediaPipe` hand landmarks first +- `Model training`: use `PyTorch` +- `Inference runtime`: export to `ONNX` and run with `ONNX Runtime` in the backend +- `Backend API`: keep FastAPI as the contract boundary + +That gives you a practical stack that is: + +- fast enough for demos and hackathons +- easier to train than raw image-to-label models +- more stable than trying to force YOLO into a gesture problem +- compatible with this repo's existing "analyze image or frame and return typed results" shape + +## What To Use By Project Type + +### 1. Static Sign Demo + +Use this when you want: + +- alphabet recognition +- a small vocabulary +- one signer in front of a webcam +- a fast MVP + +Recommended stack: + +- `MediaPipe Hand Landmarker` +- a small classifier on top of hand landmarks +- `PyTorch` for training +- `ONNX Runtime` for backend inference + +Why: + +- landmarks reduce the amount of visual noise +- you do not need a heavy detector for a single webcam user +- training on landmarks is usually easier than training on raw images + +### 2. Dynamic Sign Recognition + +Use this when the sign depends on motion across multiple frames. + +Recommended stack: + +- `MediaPipe Holistic` or at least `hands + pose` +- sequence model such as `LSTM`, `GRU`, or a small `Transformer` +- `PyTorch` for training +- `ONNX Runtime` for serving + +Why: + +- many signs are not defined by one frame +- temporal context matters +- body and face cues can matter, not only the hand outline + +### 3. Larger Or More Realistic Sign-Language Systems + +Use this when you want more than a demo and need better linguistic coverage. + +Recommended stack: + +- `MediaPipe Holistic` +- a sequence model over landmarks and possibly cropped image features +- optional dataset tooling for alignment and labeling +- `ONNX Runtime` or another production runtime + +Important note: + +If the goal is actual sign language rather than "gesture control," a hands-only pipeline will likely cap out early. + +## Where It Fits In This Repo + +### Frontend + +Use the existing webcam and upload experience as the input layer: + +- `frontend/src/components/webcam-console.tsx` +- `frontend/src/components/inference-console.tsx` + +That means you can keep the product flow the repo already teaches: + +1. capture or upload an image or frame +2. send it to the backend +3. receive typed results +4. render overlays, labels, and metrics + +### Backend + +The backend is where the actual CV or ML logic should live: + +- `backend/app/vision/service.py` +- `backend/app/vision/pipelines.py` +- `backend/app/api/routes/inference.py` + +The cleanest extension is to add a new pipeline entry such as: + +- `sign-static` +- `sign-sequence` + +That keeps the repo's pipeline registry pattern intact. + +### Contract + +If you change the shape of the response, also update: + +- `docs/openapi.yaml` +- `frontend/src/generated/openapi.ts` + +If you can keep the response close to the existing typed contract, integration stays easier. + +## Recommended Output Shape + +For a sign-language MVP in this template, I would return: + +- top predicted sign label +- confidence score +- optional hand boxes or landmark-derived regions +- metrics such as handedness, frame count, or latency + +For dynamic signs, consider adding: + +- sequence window size +- temporal confidence +- optional "still collecting frames" status + +Try to avoid coupling the frontend to raw model internals. Keep the backend responsible for translating model output into product-friendly fields. + +## When To Use YOLO + +`YOLO` is useful when you need detection, such as: + +- multiple people in frame +- signer localization in a wide camera view +- hand or person detection before a second-stage recognizer + +It is usually not my first recommendation for a single-user webcam sign demo because: + +- you still need recognition after detection +- landmarks are often a better representation for sign tasks +- it adds training and inference complexity early + +## When To Use A Hosted Model + +A hosted model can be useful for: + +- quick experiments +- low-ops prototypes +- testing ideas before local deployment + +But for sign-language interaction, local inference is often better because of: + +- lower latency +- lower recurring cost +- better privacy +- fewer network dependencies during demos + +## Suggested Build Order + +1. `MVP` + Add a `sign-static` backend pipeline using hand landmarks and a small classifier. +2. `Webcam loop` + Reuse the current webcam page and submit captured frames to the same inference endpoint. +3. `Temporal model` + Add a second pipeline for dynamic signs using short frame sequences. +4. `Contract refinement` + Expand the API only when the frontend truly needs more than label, confidence, and review metadata. + +## Simple Decision Guide + +- If you want a fast hackathon demo: `MediaPipe Hand Landmarker + small classifier` +- If you want real-time local inference: `PyTorch -> ONNX -> ONNX Runtime` +- If you want broader sign understanding: `MediaPipe Holistic + sequence model` +- If you need person or hand detection in messy scenes: add `YOLO` as a helper, not the whole solution + +## Official References + +- MediaPipe Hand Landmarker: +- MediaPipe Gesture Recognizer: +- MediaPipe Gesture customization: +- MediaPipe Holistic Landmarker: +- ONNX Runtime docs: +- Ultralytics YOLO docs: diff --git a/docs/tooling.md b/docs/tooling.md new file mode 100644 index 0000000..b06b2ee --- /dev/null +++ b/docs/tooling.md @@ -0,0 +1,260 @@ +# Tooling For This Template + +This template is built around one stable idea: + +1. the frontend captures an image or frame +2. the backend runs inference +3. the API returns typed results +4. the frontend renders those results without knowing model internals + +That means the best tools are the ones that fit this contract cleanly. + +## Good Tool Categories For This Repo + +### 1. CV And Inference Libraries + +#### OpenCV + +Best for: + +- image preprocessing +- thresholding +- contour extraction +- box and polygon generation +- lightweight CPU-first pipelines + +Fit in this repo: + +- already used in `backend/app/vision/service.py` +- ideal for starter logic and quick preprocessing before a real model + +#### MediaPipe + +Best for: + +- hand landmarks +- pose landmarks +- face landmarks +- gesture-style interaction +- sign-language prototypes + +Fit in this repo: + +- strong option when the project moves from generic detection into human motion or hand understanding +- especially useful for webcam-driven experiences + +#### ONNX Runtime + +Best for: + +- serving trained models locally +- CPU or GPU inference without shipping a full training stack to production +- stable deployment after training elsewhere + +Fit in this repo: + +- one of the best upgrades from the current OpenCV sample pipelines +- works well behind the existing FastAPI service boundary + +#### PyTorch + +Best for: + +- training custom models +- experimenting with sequence models +- research-friendly development + +Fit in this repo: + +- strongest choice for training +- often paired with export to ONNX for serving + +#### Ultralytics YOLO + +Best for: + +- object detection +- hand or person localization +- segmentation variants when the task is detection-heavy + +Fit in this repo: + +- good when the product really is detection-first +- useful as a first stage before a second model +- not always the best first tool for sign-language recognition + +#### TensorRT + +Best for: + +- high-performance NVIDIA deployment +- lower latency once the model path is already stable + +Fit in this repo: + +- better as a later optimization than an early template choice + +### 2. API And Contract Tooling + +#### OpenAPI + +Best for: + +- keeping frontend and backend aligned +- documenting request and response shapes +- making model swaps safer + +Fit in this repo: + +- central to the current architecture +- the source of truth is `docs/openapi.yaml` + +#### openapi-typescript + +Best for: + +- generating frontend types from the backend contract + +Fit in this repo: + +- already used through `frontend/src/generated/openapi.ts` +- should be rerun whenever the contract changes + +### 3. Frontend Product Layer + +#### Next.js + +Best for: + +- app shell +- file upload flows +- webcam UI +- review and QA interfaces + +Fit in this repo: + +- already provides the user-facing product layer +- should stay decoupled from model-specific logic + +#### React + +Best for: + +- interactive result rendering +- overlays +- metrics panels +- webcam state and upload state + +### 4. Backend Serving Layer + +#### FastAPI + +Best for: + +- inference endpoints +- validation +- typed response models +- keeping model code behind a stable HTTP boundary + +Fit in this repo: + +- already the core backend +- the right place for model loading and inference orchestration + +## Recommended Tool Combinations + +### A. Detection Product + +Use: + +- `OpenCV` for simple CPU starter logic +- `YOLO` when you need real object detection +- `ONNX Runtime` for serving exported models +- `FastAPI + OpenAPI` for the contract + +### B. Sign-Language MVP + +Use: + +- `MediaPipe Hand Landmarker` +- `PyTorch` for training +- `ONNX Runtime` for inference +- `FastAPI + OpenAPI` +- existing `Next.js` webcam flow + +### C. Dynamic Sign Recognition + +Use: + +- `MediaPipe Holistic` +- `PyTorch` sequence model +- `ONNX Runtime` +- `FastAPI + OpenAPI` + +### D. Analytics Or Quality Pipelines + +Use: + +- `OpenCV` +- `NumPy` +- existing metrics-oriented response shape + +## Tool Choices By Question + +- `Do I need a heavy ML model yet?` + Use `OpenCV` first if the task is simple and deterministic. +- `Do I need detection boxes?` + Use `YOLO` if classical CV is no longer enough. +- `Do I need landmarks, pose, or hand structure?` + Use `MediaPipe`. +- `Do I need custom training?` + Use `PyTorch`. +- `Do I want local serving after training?` + Use `ONNX Runtime`. +- `Do I want the frontend to stay stable while models change?` + Keep using `OpenAPI` and generated types. + +## What To Avoid Early + +- pushing raw model logic into the frontend +- tightly coupling UI components to one specific model output +- changing the response contract without updating `docs/openapi.yaml` +- adding a hosted API dependency for real-time features before you understand the latency tradeoff +- using YOLO for every vision problem just because it is popular + +## Where These Tools Plug Into The Repo + +- `frontend/` + UI, upload, webcam capture, results rendering +- `backend/app/api/routes/` + HTTP entrypoints +- `backend/app/vision/` + model and pipeline logic +- `docs/openapi.yaml` + contract source of truth +- `frontend/src/generated/openapi.ts` + generated frontend types + +## Practical Recommendation + +If you are extending this template today: + +- keep `Next.js + FastAPI + OpenAPI` as-is +- use `OpenCV` for preprocessing and utility steps +- choose `MediaPipe` for landmarks and gesture-like tasks +- choose `YOLO` for detection-heavy tasks +- train in `PyTorch` +- deploy inference with `ONNX Runtime` + +That combination keeps the repo teachable, modular, and close to production patterns without making a hackathon project too heavy too early. + +## Official References + +- OpenCV: +- MediaPipe: +- ONNX Runtime: +- PyTorch: +- Ultralytics YOLO: +- FastAPI: +- Next.js: +- OpenAPI: