diff --git a/README.md b/README.md index 8062283..b9a2583 100644 --- a/README.md +++ b/README.md @@ -155,6 +155,7 @@ An SBOM workflow also publishes SPDX artifacts for the repository source plus th 5. Split training and experimentation into a separate workspace later. The short public roadmap lives in [soon.md](./soon.md). +A sign-language adaptation roadmap for this template lives in [roadmap.md](./roadmap.md). ## Repository Standards diff --git a/roadmap.md b/roadmap.md new file mode 100644 index 0000000..42c3e05 --- /dev/null +++ b/roadmap.md @@ -0,0 +1,371 @@ +# Sign-Language Roadmap For This Template + +This roadmap answers a specific question: + +What is the best way to turn this `Next.js + FastAPI` computer-vision template into a sign-language project without fighting the repo shape? + +## Short Answer + +For this template, the optimal path is: + +1. prototype in `Colab` or a local notebook +2. train a small model on landmarks, not raw images +3. export the model to `ONNX` +4. run inference in the FastAPI backend +5. reuse the existing webcam and upload flows in the frontend +6. keep the API contract stable while the model improves + +That is the best fit for this repo when the goal is a usable MVP, especially for: + +- a sign alphabet demo +- a small vocabulary of static signs +- a single-user webcam experience + +It is not automatically the best path for: + +- full sign-language translation +- multi-person scenes +- long video understanding +- mobile-first deployment + +## Scope Assumption + +This roadmap assumes the first release is: + +- one signer +- webcam-first +- real-time or near-real-time +- a limited sign set +- product demo quality before research-grade accuracy + +If the target is full language understanding from day one, this roadmap should still be used as the starting path, but you should expect an additional sequence-model and dataset phase later. + +## Core Principles + +- keep the repo detection-first and inference-first +- do training outside the runtime path +- keep the backend responsible for model loading and output shaping +- keep the frontend focused on capture, review, and feedback +- preserve the API contract as long as possible +- add complexity only when the current phase is clearly limiting you + +## Why This Is The Optimal Path Here + +This repo already gives you: + +- webcam capture +- image upload +- a backend inference service +- a typed API contract +- a review-oriented frontend + +The fastest way to make that useful for sign language is not to rebuild the whole stack. It is to swap the starter backend pipeline for a sign-focused pipeline and keep the rest of the product flow intact. + +## Recommended Stack + +- `MediaPipe Hand Landmarker` for the MVP +- `PyTorch` for training +- `ONNX` as the exported model format +- `ONNX Runtime` for backend serving +- `FastAPI` as the inference boundary +- existing `Next.js` webcam and upload UI for the product layer + +Why: + +- landmarks are easier to learn from than full frames for a small sign set +- webcam latency is better with local inference than a hosted API +- `ONNX Runtime` is a strong deployment path from training into production +- this fits the current repo without turning it into a research notebook dump + +## What Not To Do First + +- do not start with `YOLO` as the main recognizer for a single-person webcam demo +- do not start by changing the frontend to run the whole model client-side +- do not jump to full sentence-level sign translation before a static-sign baseline works +- do not mix training notebooks and runtime inference code into the same backend module +- do not add hosted model dependencies unless you are comfortable with latency and cost + +## Phase 0: Define The Product Slice + +Goal: + +- pick a first version of the problem that this template can actually ship + +Recommended choice: + +- `ASL alphabet` or a `small sign set` of 10 to 30 classes + +Deliverables: + +- sign list +- class naming convention +- target frame size +- camera assumptions +- simple success metric such as top-1 accuracy plus prediction latency + +Exit criteria: + +- the team agrees on whether this is `static signs` or `dynamic signs` +- the project has a clear demo target + +## Phase 1: Prototype In Colab Or A Notebook + +Goal: + +- prove that the signs can be separated with a lightweight pipeline + +Use: + +- `Colab` if you want quick setup and easy sharing +- local notebook if you want tighter control and local files + +Tasks: + +- collect or import a small labeled dataset +- run `MediaPipe Hand Landmarker` +- extract hand landmarks +- build a baseline classifier in `PyTorch` +- measure accuracy, confusion, and latency + +Deliverables: + +- one notebook that can reproduce baseline results +- sample confusion matrix +- saved training artifacts + +Exit criteria: + +- the model is clearly better than guessing +- you know which labels are confused +- you can export the trained model or reproduce the training run + +## Phase 2: Separate Training From Runtime + +Goal: + +- stop treating the notebook as the product + +Recommended repo shape: + +- `notebooks/` for experiments +- `training/` later if training becomes a real workspace +- backend stays focused on inference only + +Tasks: + +- document dataset assumptions +- save model version metadata +- define reproducible preprocessing steps +- export the best baseline to `ONNX` + +Deliverables: + +- `ONNX` model artifact +- preprocessing notes +- label map + +Exit criteria: + +- the model can be loaded outside the notebook +- preprocessing is stable and documented + +## Phase 3: Add A Sign Pipeline To The Backend + +Goal: + +- make the trained model available through the template's inference service + +Best fit in this repo: + +- add a new pipeline in `backend/app/vision/service.py` +- keep model-specific loading behind the vision service boundary +- reuse `backend/app/api/routes/inference.py` + +Recommended first pipeline: + +- `sign-static` + +Tasks: + +- load the `ONNX` model in the backend +- run landmark extraction +- run classification +- return typed results +- add tests for the pipeline behavior + +Contract guidance: + +- preserve the existing response shape where possible +- use detections for hand boxes if available +- use metrics for latency or handedness +- if classification needs first-class output, add a clean typed field in `docs/openapi.yaml` instead of model-specific ad hoc fields + +Deliverables: + +- working backend sign pipeline +- tests for known fixtures +- updated API contract if needed + +Exit criteria: + +- the frontend can call the pipeline through the existing endpoint +- the output is typed and documented + +## Phase 4: Reuse The Existing Frontend + +Goal: + +- get value from the template instead of rewriting the UI + +Use: + +- `frontend/src/components/webcam-console.tsx` +- `frontend/src/components/inference-console.tsx` + +Tasks: + +- add the new pipeline to the pipeline list +- show the predicted sign prominently +- show confidence and relevant metrics +- optionally render hand boxes or landmarks +- keep the review surface simple + +Recommended UX for the first version: + +- live prediction +- confidence score +- top alternative prediction +- capture frame button +- clear visual state when confidence is low + +Exit criteria: + +- a user can open the webcam page and get understandable predictions +- the result panel feels product-shaped, not notebook-shaped + +## Phase 5: Add Evaluation And Regression Checks + +Goal: + +- make the sign pipeline safe to change + +Tasks: + +- add fixture images or short frame sets +- add snapshot-backed API responses when practical +- measure latency in the backend +- track per-class accuracy outside the runtime path + +Deliverables: + +- backend tests +- sample evaluation report +- performance notes + +Exit criteria: + +- you can change the model without guessing whether the app regressed + +## Phase 6: Move From Static Signs To Dynamic Signs + +Goal: + +- support signs that depend on motion over time + +When to do this: + +- only after the static-sign path is stable + +Recommended stack: + +- `MediaPipe Holistic` or `hands + pose` +- a sequence model such as `LSTM`, `GRU`, or a small `Transformer` + +Tasks: + +- collect short sign sequences +- train a temporal model +- decide whether the backend needs a frame window or short clip input +- extend the API carefully if the current single-frame shape is no longer enough + +Deliverables: + +- `sign-sequence` pipeline +- temporal confidence output +- updated contract if frame windows are introduced + +Exit criteria: + +- the dynamic model beats the static baseline on motion-dependent signs + +## Phase 7: Production Hardening + +Goal: + +- make the project reliable enough for real demos or deployment + +Tasks: + +- add model versioning +- improve error handling for camera and input failures +- benchmark CPU and memory usage +- consider GPU or TensorRT only if latency actually requires it +- add observability for inference timing and failure rates + +Deliverables: + +- versioned model loading +- release notes for model changes +- deployment checklist + +Exit criteria: + +- the app is repeatable, testable, and stable across environments + +## Suggested Milestone Order + +1. static-sign scope +2. notebook baseline +3. `ONNX` export +4. backend `sign-static` pipeline +5. webcam UI integration +6. tests and evaluation +7. dynamic-sign extension +8. production hardening + +## Decision Rules + +- if one webcam user is the target, prefer landmarks before object detection +- if you need full-body or facial context, move from hands-only to holistic features +- if the notebook cannot reproduce results, do not integrate the model yet +- if the frontend needs model-specific fields, add them through OpenAPI, not hidden assumptions +- if latency is good enough on CPU, do not optimize infrastructure early + +## Where To Put Things + +- experiments: `notebooks/` +- future repeatable training workspace: `training/` +- inference integration: `backend/app/vision/` +- contract updates: `docs/openapi.yaml` +- generated frontend types: `frontend/src/generated/openapi.ts` +- user-facing capture and review UI: `frontend/src/components/` + +## Recommended First Release + +The best first release for a sign-language adaptation of this template is: + +- static signs only +- webcam-first +- one signer +- local inference +- typed backend contract +- visible confidence score +- clear fallback when confidence is low + +That is realistic, demonstrable, and aligned with the template's strengths. + +## Related Docs + +- `docs/sign-language-template.md` +- `docs/tooling.md` +- `soon.md` diff --git a/soon.md b/soon.md index bd179c1..48ded21 100644 --- a/soon.md +++ b/soon.md @@ -37,6 +37,24 @@ Training should stay adjacent to the app, not mixed into the runtime path. 3. Add evaluation and regression scripts for sample predictions. 4. Add experiment tracking hooks for metrics, artifacts, and model versions. +## Deployment Status + +The template itself is close to deploy-ready today: + +1. production Dockerfiles already exist for the frontend and backend +2. release tags already publish images and a GitHub Release +3. release smoke checks already validate the published images + +The sign-language adaptation path is not deploy-ready yet. + +Before treating that version as deployable, the next gaps to close are: + +1. add the actual sign-language inference pipeline in the backend +2. define model artifact packaging and versioning +3. set production CORS and environment values for the deployed frontend domain +4. add a production-oriented deployment target or guide for a real host +5. add regression checks for the sign-language model outputs + ## Recommended Sequence If you are extending the repo from here, the cleanest order is: