This is an end-to-end ASL sign language translation model designed to be deployable and production-ready for inference. The model is a finetune of VideoMAE pretrained on ssv2 original weights using the ASL-Citizen dataset.
The current state of the model is more for single word translation, generating a sequence of words that may or may not form a grammatically correct sentence (will update this after testing), but future goals include adding temporally-aware translation (possibly via large language models?). I'll be looking for more research that explore this field for better foundation models to use in gesture translation.
Currently, the project is using:
- Pytorch, Pytorch Lightning, Huggingface transformers
- model architecture, training loop, initial model weights
- distributed training (multi-node and multi-gpu)
- Kubernetes/Kubeflow
- Distributed multi-node training
- MLflow (Databricks hosted)
- experiment analytics and tracking
- artifact store
- ONNX and TensorRT
- Desployment and production-ready for inference
- GitHub Actions, Docker
- CI/CD triggering on new production weights (marked in MLflow)
- building inference container
TODO:
- secrets setup instructions
- github actions workflows/secrets
- mlflow instructions
- dataset directory setup instructions
- dockerhub setup instructions
- dockerfile (inference)
Theres two options for training.
- Kubernetes (using the Kubeflow operator)
- Local training
TODO:
- kubernetes manifests
- training
- pvc
First, you will need to install UV if you haven't already run the following command to install it:
wget -qO- https://astral.sh/uv/install.sh | sh
Then, run
uv sync
to install the required packages to train the model.