emox — Emotion Classifier ML Platform

Production ML platform for 7-class emotion classification (anger, disgust, fear, joy, neutral, sadness, surprise) deployed on AWS EKS.

Architecture

┌─────────────────────────────────────────────────────────────────────┐
│  Data Pipeline          Training Pipeline         Inference          │
│  ─────────────         ─────────────────         ─────────          │
│  S3 raw data    ──►    Kubeflow PyTorchJob  ──►  FastAPI (EKS)      │
│  preprocess.py         RoBERTa fine-tune         /predict endpoint   │
│  ingest.py             MLflow tracking           HPA autoscaling     │
│                        S3 model artefacts        MongoDB logging     │
└─────────────────────────────────────────────────────────────────────┘

Model: cardiffnlp/twitter-roberta-base-emotion fine-tuned for 7 classes
Infrastructure: AWS EKS 1.29, Terraform, Helm, GitHub Actions
Observability: Prometheus + Grafana, DCGM GPU metrics, custom alerting rules

Quick Start

Prerequisites

Python 3.11+
Docker
kubectl, helm, terraform, aws CLI
AWS credentials with EKS / ECR / S3 access

Local development

git clone https://github.com/your-org/emox && cd emox
make install
source .venv/bin/activate
make test
make lint

First-time cluster bootstrap

# 1. Provision infrastructure
make tf-apply-prod

# 2. Bootstrap cluster components (Kubeflow, Prometheus, IRSA, etc.)
export CLUSTER_NAME=ml-platform-prod
export TRAINING_ROLE_ARN=$(terraform -chdir=infra/terraform/environments/prod output -raw training_role_arn)
export INFERENCE_ROLE_ARN=$(terraform -chdir=infra/terraform/environments/prod output -raw inference_role_arn)
bash scripts/bootstrap.sh

# 3. Deploy inference service
IMAGE_TAG=latest make deploy-prod

Launch a training job

export RUN_NAME=roberta-v2-$(date +%Y%m%d)
export MODEL_S3_BUCKET=emox-models-prod
export DATASET_S3_KEY=datasets/emotion-v3.jsonl
export ECR_TRAINING_REPO=$(terraform -chdir=infra/terraform/environments/prod output -raw ecr_training_url)
export IMAGE_TAG=abc1234

envsubst < k8s/training/pytorch-job.yaml | kubectl apply -f -
kubectl logs -n training -l run=${RUN_NAME} -f

Deploy a trained model

export MODEL_S3_BUCKET=emox-models-prod
export MODEL_S3_KEY=models/roberta-v2-20260313/model.tar.gz
export CLUSTER=ml-platform-prod
bash scripts/deploy_model.sh

Repository Structure

emox/
├── src/
│   ├── training/          # RoBERTa fine-tuning (Hugging Face Trainer)
│   │   ├── model.py       # EmotionClassifier, predict_proba(), predict()
│   │   ├── train.py       # Training entrypoint, MLflow integration
│   │   ├── dataset.py     # EmotionDataset, compute_metrics
│   │   ├── callbacks.py   # EarlyStoppingWithLogging
│   │   └── config.py      # TrainingConfig (pydantic)
│   ├── inference/         # FastAPI prediction service
│   │   ├── main.py        # /predict, /health, /metrics endpoints
│   │   ├── predictor.py   # Model loading, batch inference
│   │   └── schemas.py     # Request / response Pydantic models
│   ├── data_pipeline/     # Data ingestion and preprocessing
│   │   ├── ingest.py
│   │   └── preprocess.py
│   └── shared/            # Cross-component utilities
│       ├── s3_utils.py
│       ├── mongo_client.py
│       └── logging_config.py
├── tests/
│   ├── unit/              # Fast, no external deps
│   └── integration/       # Require real AWS / MongoDB
├── docker/
│   ├── inference/Dockerfile
│   └── training/Dockerfile
├── k8s/
│   ├── inference/         # Deployment, Service, Ingress
│   ├── training/          # PyTorchJob manifest (envsubst)
│   └── monitoring/        # PrometheusRule, Grafana configmap
├── infra/
│   ├── helm/
│   │   ├── inference/     # Helm chart for inference service
│   │   ├── mlflow/        # MLflow tracking server
│   │   ├── mongodb/       # MongoDB replica set (Bitnami)
│   │   └── monitoring/    # Prometheus rules + Grafana dashboards
│   └── terraform/
│       ├── modules/       # vpc, eks, iam, s3, ecr
│       └── environments/
│           ├── prod/
│           └── staging/
├── cicd/
│   └── .github/workflows/deploy.yaml
├── scripts/
│   ├── bootstrap.sh       # One-time cluster setup
│   ├── deploy_model.sh    # Hot-swap model in running inference pods
│   └── rotate_keys.sh     # Rotate AWS / MongoDB credentials
├── Makefile               # Developer shortcuts
└── README.md

Bugs Fixed

#	File	Severity	Description
1	`src/training/model.py`	Critical	Added missing `predict_proba()` and `predict()` methods to `EmotionClassifier` — both called by `test_model.py`
2	`src/training/train.py`	Critical	Fixed `trainer.tokenizer.save_pretrained()` — `Trainer` has no `.tokenizer` attribute; now passes tokenizer explicitly to `_save_and_register()`
3	`infra/helm/inference/templates/deployment.yaml`	Medium	Fixed `$(VAR)` → `${VAR}` in initContainer shell command — K8s interprets `$(VAR)` as variable substitution from `env:`, not `envFrom:`
4	`scripts/bootstrap.sh`	Medium	Replaced broken `kubectl create sa \| kubectl annotate --local -f -` pipe with idempotent `kubectl apply -f - <<EOF` heredoc
5	`scripts/deploy_model.sh`	Medium	Port-forward failure now detectable — logs stderr to temp file, polls `kill -0 $PF_PID` to confirm process alive before curl

Observability

Prometheus scrapes inference metrics at :8000/metrics
Grafana dashboards in k8s/monitoring/grafana-dashboard-configmap.yaml
Alerts: InferenceLatencyHigh, InferenceErrorRateHigh, GPUMemoryHighPressure, TrainingJobStalled
Logs: structured JSON via shared/logging_config.py, shipped to CloudWatch via Fluent Bit

Security

All workloads use IRSA (IAM Roles for Service Accounts) — no static AWS keys in-cluster
Secrets stored in AWS Secrets Manager, mounted via External Secrets Operator
ECR image scanning enabled on push
S3 buckets are private with server-side encryption (AES256)
Terraform remote state in S3 with DynamoDB locking

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
cicd/.github/workflows		cicd/.github/workflows
docker		docker
docs		docs
infra		infra
k8s		k8s
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

emox — Emotion Classifier ML Platform

Architecture

Quick Start

Prerequisites

Local development

First-time cluster bootstrap

Launch a training job

Deploy a trained model

Repository Structure

Bugs Fixed

Observability

Security

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

emox — Emotion Classifier ML Platform

Architecture

Quick Start

Prerequisites

Local development

First-time cluster bootstrap

Launch a training job

Deploy a trained model

Repository Structure

Bugs Fixed

Observability

Security

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages