Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ An SBOM workflow also publishes SPDX artifacts for the repository source plus th
5. Split training and experimentation into a separate workspace later.

The short public roadmap lives in [soon.md](./soon.md).
A sign-language adaptation roadmap for this template lives in [roadmap.md](./roadmap.md).

## Repository Standards

Expand Down
371 changes: 371 additions & 0 deletions roadmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,371 @@
# Sign-Language Roadmap For This Template

This roadmap answers a specific question:

What is the best way to turn this `Next.js + FastAPI` computer-vision template into a sign-language project without fighting the repo shape?

## Short Answer

For this template, the optimal path is:

1. prototype in `Colab` or a local notebook
2. train a small model on landmarks, not raw images
3. export the model to `ONNX`
4. run inference in the FastAPI backend
5. reuse the existing webcam and upload flows in the frontend
6. keep the API contract stable while the model improves

That is the best fit for this repo when the goal is a usable MVP, especially for:

- a sign alphabet demo
- a small vocabulary of static signs
- a single-user webcam experience

It is not automatically the best path for:

- full sign-language translation
- multi-person scenes
- long video understanding
- mobile-first deployment

## Scope Assumption

This roadmap assumes the first release is:

- one signer
- webcam-first
- real-time or near-real-time
- a limited sign set
- product demo quality before research-grade accuracy

If the target is full language understanding from day one, this roadmap should still be used as the starting path, but you should expect an additional sequence-model and dataset phase later.

## Core Principles

- keep the repo detection-first and inference-first
- do training outside the runtime path
- keep the backend responsible for model loading and output shaping
- keep the frontend focused on capture, review, and feedback
- preserve the API contract as long as possible
- add complexity only when the current phase is clearly limiting you

## Why This Is The Optimal Path Here

This repo already gives you:

- webcam capture
- image upload
- a backend inference service
- a typed API contract
- a review-oriented frontend

The fastest way to make that useful for sign language is not to rebuild the whole stack. It is to swap the starter backend pipeline for a sign-focused pipeline and keep the rest of the product flow intact.

## Recommended Stack

- `MediaPipe Hand Landmarker` for the MVP
- `PyTorch` for training
- `ONNX` as the exported model format
- `ONNX Runtime` for backend serving
- `FastAPI` as the inference boundary
- existing `Next.js` webcam and upload UI for the product layer

Why:

- landmarks are easier to learn from than full frames for a small sign set
- webcam latency is better with local inference than a hosted API
- `ONNX Runtime` is a strong deployment path from training into production
- this fits the current repo without turning it into a research notebook dump

## What Not To Do First

- do not start with `YOLO` as the main recognizer for a single-person webcam demo
- do not start by changing the frontend to run the whole model client-side
- do not jump to full sentence-level sign translation before a static-sign baseline works
- do not mix training notebooks and runtime inference code into the same backend module
- do not add hosted model dependencies unless you are comfortable with latency and cost

## Phase 0: Define The Product Slice

Goal:

- pick a first version of the problem that this template can actually ship

Recommended choice:

- `ASL alphabet` or a `small sign set` of 10 to 30 classes

Deliverables:

- sign list
- class naming convention
- target frame size
- camera assumptions
- simple success metric such as top-1 accuracy plus prediction latency

Exit criteria:

- the team agrees on whether this is `static signs` or `dynamic signs`
- the project has a clear demo target

## Phase 1: Prototype In Colab Or A Notebook

Goal:

- prove that the signs can be separated with a lightweight pipeline

Use:

- `Colab` if you want quick setup and easy sharing
- local notebook if you want tighter control and local files

Tasks:

- collect or import a small labeled dataset
- run `MediaPipe Hand Landmarker`
- extract hand landmarks
- build a baseline classifier in `PyTorch`
- measure accuracy, confusion, and latency

Deliverables:

- one notebook that can reproduce baseline results
- sample confusion matrix
- saved training artifacts

Exit criteria:

- the model is clearly better than guessing
- you know which labels are confused
- you can export the trained model or reproduce the training run

## Phase 2: Separate Training From Runtime

Goal:

- stop treating the notebook as the product

Recommended repo shape:

- `notebooks/` for experiments
- `training/` later if training becomes a real workspace
- backend stays focused on inference only

Tasks:

- document dataset assumptions
- save model version metadata
- define reproducible preprocessing steps
- export the best baseline to `ONNX`

Deliverables:

- `ONNX` model artifact
- preprocessing notes
- label map

Exit criteria:

- the model can be loaded outside the notebook
- preprocessing is stable and documented

## Phase 3: Add A Sign Pipeline To The Backend

Goal:

- make the trained model available through the template's inference service

Best fit in this repo:

- add a new pipeline in `backend/app/vision/service.py`
- keep model-specific loading behind the vision service boundary
- reuse `backend/app/api/routes/inference.py`

Recommended first pipeline:

- `sign-static`

Tasks:

- load the `ONNX` model in the backend
- run landmark extraction
- run classification
- return typed results
- add tests for the pipeline behavior

Contract guidance:

- preserve the existing response shape where possible
- use detections for hand boxes if available
- use metrics for latency or handedness
- if classification needs first-class output, add a clean typed field in `docs/openapi.yaml` instead of model-specific ad hoc fields

Deliverables:

- working backend sign pipeline
- tests for known fixtures
- updated API contract if needed

Exit criteria:

- the frontend can call the pipeline through the existing endpoint
- the output is typed and documented

## Phase 4: Reuse The Existing Frontend

Goal:

- get value from the template instead of rewriting the UI

Use:

- `frontend/src/components/webcam-console.tsx`
- `frontend/src/components/inference-console.tsx`

Tasks:

- add the new pipeline to the pipeline list
- show the predicted sign prominently
- show confidence and relevant metrics
- optionally render hand boxes or landmarks
- keep the review surface simple

Recommended UX for the first version:

- live prediction
- confidence score
- top alternative prediction
- capture frame button
- clear visual state when confidence is low

Exit criteria:

- a user can open the webcam page and get understandable predictions
- the result panel feels product-shaped, not notebook-shaped

## Phase 5: Add Evaluation And Regression Checks

Goal:

- make the sign pipeline safe to change

Tasks:

- add fixture images or short frame sets
- add snapshot-backed API responses when practical
- measure latency in the backend
- track per-class accuracy outside the runtime path

Deliverables:

- backend tests
- sample evaluation report
- performance notes

Exit criteria:

- you can change the model without guessing whether the app regressed

## Phase 6: Move From Static Signs To Dynamic Signs

Goal:

- support signs that depend on motion over time

When to do this:

- only after the static-sign path is stable

Recommended stack:

- `MediaPipe Holistic` or `hands + pose`
- a sequence model such as `LSTM`, `GRU`, or a small `Transformer`

Tasks:

- collect short sign sequences
- train a temporal model
- decide whether the backend needs a frame window or short clip input
- extend the API carefully if the current single-frame shape is no longer enough

Deliverables:

- `sign-sequence` pipeline
- temporal confidence output
- updated contract if frame windows are introduced

Exit criteria:

- the dynamic model beats the static baseline on motion-dependent signs

## Phase 7: Production Hardening

Goal:

- make the project reliable enough for real demos or deployment

Tasks:

- add model versioning
- improve error handling for camera and input failures
- benchmark CPU and memory usage
- consider GPU or TensorRT only if latency actually requires it
- add observability for inference timing and failure rates

Deliverables:

- versioned model loading
- release notes for model changes
- deployment checklist

Exit criteria:

- the app is repeatable, testable, and stable across environments

## Suggested Milestone Order

1. static-sign scope
2. notebook baseline
3. `ONNX` export
4. backend `sign-static` pipeline
5. webcam UI integration
6. tests and evaluation
7. dynamic-sign extension
8. production hardening

## Decision Rules

- if one webcam user is the target, prefer landmarks before object detection
- if you need full-body or facial context, move from hands-only to holistic features
- if the notebook cannot reproduce results, do not integrate the model yet
- if the frontend needs model-specific fields, add them through OpenAPI, not hidden assumptions
- if latency is good enough on CPU, do not optimize infrastructure early

## Where To Put Things

- experiments: `notebooks/`
- future repeatable training workspace: `training/`
- inference integration: `backend/app/vision/`
- contract updates: `docs/openapi.yaml`
- generated frontend types: `frontend/src/generated/openapi.ts`
- user-facing capture and review UI: `frontend/src/components/`

## Recommended First Release

The best first release for a sign-language adaptation of this template is:

- static signs only
- webcam-first
- one signer
- local inference
- typed backend contract
- visible confidence score
- clear fallback when confidence is low

That is realistic, demonstrable, and aligned with the template's strengths.

## Related Docs

- `docs/sign-language-template.md`
- `docs/tooling.md`
- `soon.md`
Loading
Loading