Skip to content

ssupshub/Emox

Repository files navigation

emox — Emotion Classifier ML Platform

Production ML platform for 7-class emotion classification (anger, disgust, fear, joy, neutral, sadness, surprise) deployed on AWS EKS.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│  Data Pipeline          Training Pipeline         Inference          │
│  ─────────────         ─────────────────         ─────────          │
│  S3 raw data    ──►    Kubeflow PyTorchJob  ──►  FastAPI (EKS)      │
│  preprocess.py         RoBERTa fine-tune         /predict endpoint   │
│  ingest.py             MLflow tracking           HPA autoscaling     │
│                        S3 model artefacts        MongoDB logging     │
└─────────────────────────────────────────────────────────────────────┘

Model: cardiffnlp/twitter-roberta-base-emotion fine-tuned for 7 classes
Infrastructure: AWS EKS 1.29, Terraform, Helm, GitHub Actions
Observability: Prometheus + Grafana, DCGM GPU metrics, custom alerting rules


Quick Start

Prerequisites

  • Python 3.11+
  • Docker
  • kubectl, helm, terraform, aws CLI
  • AWS credentials with EKS / ECR / S3 access

Local development

git clone https://github.com/your-org/emox && cd emox
make install
source .venv/bin/activate
make test
make lint

First-time cluster bootstrap

# 1. Provision infrastructure
make tf-apply-prod

# 2. Bootstrap cluster components (Kubeflow, Prometheus, IRSA, etc.)
export CLUSTER_NAME=ml-platform-prod
export TRAINING_ROLE_ARN=$(terraform -chdir=infra/terraform/environments/prod output -raw training_role_arn)
export INFERENCE_ROLE_ARN=$(terraform -chdir=infra/terraform/environments/prod output -raw inference_role_arn)
bash scripts/bootstrap.sh

# 3. Deploy inference service
IMAGE_TAG=latest make deploy-prod

Launch a training job

export RUN_NAME=roberta-v2-$(date +%Y%m%d)
export MODEL_S3_BUCKET=emox-models-prod
export DATASET_S3_KEY=datasets/emotion-v3.jsonl
export ECR_TRAINING_REPO=$(terraform -chdir=infra/terraform/environments/prod output -raw ecr_training_url)
export IMAGE_TAG=abc1234

envsubst < k8s/training/pytorch-job.yaml | kubectl apply -f -
kubectl logs -n training -l run=${RUN_NAME} -f

Deploy a trained model

export MODEL_S3_BUCKET=emox-models-prod
export MODEL_S3_KEY=models/roberta-v2-20260313/model.tar.gz
export CLUSTER=ml-platform-prod
bash scripts/deploy_model.sh

Repository Structure

emox/
├── src/
│   ├── training/          # RoBERTa fine-tuning (Hugging Face Trainer)
│   │   ├── model.py       # EmotionClassifier, predict_proba(), predict()
│   │   ├── train.py       # Training entrypoint, MLflow integration
│   │   ├── dataset.py     # EmotionDataset, compute_metrics
│   │   ├── callbacks.py   # EarlyStoppingWithLogging
│   │   └── config.py      # TrainingConfig (pydantic)
│   ├── inference/         # FastAPI prediction service
│   │   ├── main.py        # /predict, /health, /metrics endpoints
│   │   ├── predictor.py   # Model loading, batch inference
│   │   └── schemas.py     # Request / response Pydantic models
│   ├── data_pipeline/     # Data ingestion and preprocessing
│   │   ├── ingest.py
│   │   └── preprocess.py
│   └── shared/            # Cross-component utilities
│       ├── s3_utils.py
│       ├── mongo_client.py
│       └── logging_config.py
├── tests/
│   ├── unit/              # Fast, no external deps
│   └── integration/       # Require real AWS / MongoDB
├── docker/
│   ├── inference/Dockerfile
│   └── training/Dockerfile
├── k8s/
│   ├── inference/         # Deployment, Service, Ingress
│   ├── training/          # PyTorchJob manifest (envsubst)
│   └── monitoring/        # PrometheusRule, Grafana configmap
├── infra/
│   ├── helm/
│   │   ├── inference/     # Helm chart for inference service
│   │   ├── mlflow/        # MLflow tracking server
│   │   ├── mongodb/       # MongoDB replica set (Bitnami)
│   │   └── monitoring/    # Prometheus rules + Grafana dashboards
│   └── terraform/
│       ├── modules/       # vpc, eks, iam, s3, ecr
│       └── environments/
│           ├── prod/
│           └── staging/
├── cicd/
│   └── .github/workflows/deploy.yaml
├── scripts/
│   ├── bootstrap.sh       # One-time cluster setup
│   ├── deploy_model.sh    # Hot-swap model in running inference pods
│   └── rotate_keys.sh     # Rotate AWS / MongoDB credentials
├── Makefile               # Developer shortcuts
└── README.md

Bugs Fixed

# File Severity Description
1 src/training/model.py Critical Added missing predict_proba() and predict() methods to EmotionClassifier — both called by test_model.py
2 src/training/train.py Critical Fixed trainer.tokenizer.save_pretrained()Trainer has no .tokenizer attribute; now passes tokenizer explicitly to _save_and_register()
3 infra/helm/inference/templates/deployment.yaml Medium Fixed $(VAR)${VAR} in initContainer shell command — K8s interprets $(VAR) as variable substitution from env:, not envFrom:
4 scripts/bootstrap.sh Medium Replaced broken kubectl create sa | kubectl annotate --local -f - pipe with idempotent kubectl apply -f - <<EOF heredoc
5 scripts/deploy_model.sh Medium Port-forward failure now detectable — logs stderr to temp file, polls kill -0 $PF_PID to confirm process alive before curl

Observability

  • Prometheus scrapes inference metrics at :8000/metrics
  • Grafana dashboards in k8s/monitoring/grafana-dashboard-configmap.yaml
  • Alerts: InferenceLatencyHigh, InferenceErrorRateHigh, GPUMemoryHighPressure, TrainingJobStalled
  • Logs: structured JSON via shared/logging_config.py, shipped to CloudWatch via Fluent Bit

Security

  • All workloads use IRSA (IAM Roles for Service Accounts) — no static AWS keys in-cluster
  • Secrets stored in AWS Secrets Manager, mounted via External Secrets Operator
  • ECR image scanning enabled on push
  • S3 buckets are private with server-side encryption (AES256)
  • Terraform remote state in S3 with DynamoDB locking

License

MIT © 2026 Shubham — see LICENSE

About

Production MLOps platform on AWS EKS — RoBERTa emotion classifier with distributed GPU training, FastAPI inference, Kubeflow, MLflow, Terraform, and Helm.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors