Production ML platform for 7-class emotion classification (anger, disgust, fear, joy, neutral, sadness, surprise) deployed on AWS EKS.
┌─────────────────────────────────────────────────────────────────────┐
│ Data Pipeline Training Pipeline Inference │
│ ───────────── ───────────────── ───────── │
│ S3 raw data ──► Kubeflow PyTorchJob ──► FastAPI (EKS) │
│ preprocess.py RoBERTa fine-tune /predict endpoint │
│ ingest.py MLflow tracking HPA autoscaling │
│ S3 model artefacts MongoDB logging │
└─────────────────────────────────────────────────────────────────────┘
Model: cardiffnlp/twitter-roberta-base-emotion fine-tuned for 7 classes
Infrastructure: AWS EKS 1.29, Terraform, Helm, GitHub Actions
Observability: Prometheus + Grafana, DCGM GPU metrics, custom alerting rules
- Python 3.11+
- Docker
kubectl,helm,terraform,awsCLI- AWS credentials with EKS / ECR / S3 access
git clone https://github.com/your-org/emox && cd emox
make install
source .venv/bin/activate
make test
make lint# 1. Provision infrastructure
make tf-apply-prod
# 2. Bootstrap cluster components (Kubeflow, Prometheus, IRSA, etc.)
export CLUSTER_NAME=ml-platform-prod
export TRAINING_ROLE_ARN=$(terraform -chdir=infra/terraform/environments/prod output -raw training_role_arn)
export INFERENCE_ROLE_ARN=$(terraform -chdir=infra/terraform/environments/prod output -raw inference_role_arn)
bash scripts/bootstrap.sh
# 3. Deploy inference service
IMAGE_TAG=latest make deploy-prodexport RUN_NAME=roberta-v2-$(date +%Y%m%d)
export MODEL_S3_BUCKET=emox-models-prod
export DATASET_S3_KEY=datasets/emotion-v3.jsonl
export ECR_TRAINING_REPO=$(terraform -chdir=infra/terraform/environments/prod output -raw ecr_training_url)
export IMAGE_TAG=abc1234
envsubst < k8s/training/pytorch-job.yaml | kubectl apply -f -
kubectl logs -n training -l run=${RUN_NAME} -fexport MODEL_S3_BUCKET=emox-models-prod
export MODEL_S3_KEY=models/roberta-v2-20260313/model.tar.gz
export CLUSTER=ml-platform-prod
bash scripts/deploy_model.shemox/
├── src/
│ ├── training/ # RoBERTa fine-tuning (Hugging Face Trainer)
│ │ ├── model.py # EmotionClassifier, predict_proba(), predict()
│ │ ├── train.py # Training entrypoint, MLflow integration
│ │ ├── dataset.py # EmotionDataset, compute_metrics
│ │ ├── callbacks.py # EarlyStoppingWithLogging
│ │ └── config.py # TrainingConfig (pydantic)
│ ├── inference/ # FastAPI prediction service
│ │ ├── main.py # /predict, /health, /metrics endpoints
│ │ ├── predictor.py # Model loading, batch inference
│ │ └── schemas.py # Request / response Pydantic models
│ ├── data_pipeline/ # Data ingestion and preprocessing
│ │ ├── ingest.py
│ │ └── preprocess.py
│ └── shared/ # Cross-component utilities
│ ├── s3_utils.py
│ ├── mongo_client.py
│ └── logging_config.py
├── tests/
│ ├── unit/ # Fast, no external deps
│ └── integration/ # Require real AWS / MongoDB
├── docker/
│ ├── inference/Dockerfile
│ └── training/Dockerfile
├── k8s/
│ ├── inference/ # Deployment, Service, Ingress
│ ├── training/ # PyTorchJob manifest (envsubst)
│ └── monitoring/ # PrometheusRule, Grafana configmap
├── infra/
│ ├── helm/
│ │ ├── inference/ # Helm chart for inference service
│ │ ├── mlflow/ # MLflow tracking server
│ │ ├── mongodb/ # MongoDB replica set (Bitnami)
│ │ └── monitoring/ # Prometheus rules + Grafana dashboards
│ └── terraform/
│ ├── modules/ # vpc, eks, iam, s3, ecr
│ └── environments/
│ ├── prod/
│ └── staging/
├── cicd/
│ └── .github/workflows/deploy.yaml
├── scripts/
│ ├── bootstrap.sh # One-time cluster setup
│ ├── deploy_model.sh # Hot-swap model in running inference pods
│ └── rotate_keys.sh # Rotate AWS / MongoDB credentials
├── Makefile # Developer shortcuts
└── README.md
| # | File | Severity | Description |
|---|---|---|---|
| 1 | src/training/model.py |
Critical | Added missing predict_proba() and predict() methods to EmotionClassifier — both called by test_model.py |
| 2 | src/training/train.py |
Critical | Fixed trainer.tokenizer.save_pretrained() — Trainer has no .tokenizer attribute; now passes tokenizer explicitly to _save_and_register() |
| 3 | infra/helm/inference/templates/deployment.yaml |
Medium | Fixed $(VAR) → ${VAR} in initContainer shell command — K8s interprets $(VAR) as variable substitution from env:, not envFrom: |
| 4 | scripts/bootstrap.sh |
Medium | Replaced broken kubectl create sa | kubectl annotate --local -f - pipe with idempotent kubectl apply -f - <<EOF heredoc |
| 5 | scripts/deploy_model.sh |
Medium | Port-forward failure now detectable — logs stderr to temp file, polls kill -0 $PF_PID to confirm process alive before curl |
- Prometheus scrapes inference metrics at
:8000/metrics - Grafana dashboards in
k8s/monitoring/grafana-dashboard-configmap.yaml - Alerts:
InferenceLatencyHigh,InferenceErrorRateHigh,GPUMemoryHighPressure,TrainingJobStalled - Logs: structured JSON via
shared/logging_config.py, shipped to CloudWatch via Fluent Bit
- All workloads use IRSA (IAM Roles for Service Accounts) — no static AWS keys in-cluster
- Secrets stored in AWS Secrets Manager, mounted via External Secrets Operator
- ECR image scanning enabled on push
- S3 buckets are private with server-side encryption (AES256)
- Terraform remote state in S3 with DynamoDB locking
MIT © 2026 Shubham — see LICENSE