This repository contains resources for creating production-grade ML inference processors. Models are expected to be hosted on the EOTDL as Q2+ models (ONNX models + STAC metadata with the MLM extension). The following features are included:
- CPU/GPU inference
- Docker
- Kubernetes
- Auto-scaling
- Load testing
- Batch & Online processing
- Monitoring & Alerting
- Data drift detection
- Security & Safety
- Testing
To run the default API with Docker, you can use the following command:
# cpu
docker run -p 8000:80 -e EOTDL_API_KEY=<eotdl_api_key> earthpulseit/ml-inference
# gpu
docker run --gpus all -p 8000:80 -e EOTDL_API_KEY=<eotdl_api_key> earthpulseit/ml-inference-gpuYou can get your EOTDL API key for free by signing up at EOTDL and creating a new token in your profile.
You can also use the sample k8s manifests to deploy the API to a Kubernetes cluster.
kubectl apply -f k8s/deployment.yamlBy default, requests to the API are processed sequentially. You can change this behavior by setting the BATCH_SIZE and BATCH_TIMEOUT environment variables.
BATCH_SIZE: Maximum number of requests to process in a single batch.BATCH_TIMEOUT: Maximum time (in seconds) to wait before processing an incomplete batch.
# cpu
docker run -p 8000:80 -e EOTDL_API_KEY=<eotdl_api_key> -e BATCH_SIZE=<batch_size> -e BATCH_TIMEOUT=<batch_timeout> earthpulseit/ml-inference
# gpu
docker run --gpus all -p 8000:80 -e EOTDL_API_KEY=<eotdl_api_key> -e BATCH_SIZE=<batch_size> -e BATCH_TIMEOUT=<batch_timeout> earthpulseit/ml-inference-gpuThis setting is particularly useful if the number of concurrent requests exceeds the time it takes to run inference with a model on your hardware.
In order to monitor the API, you can use Prometheus and Grafana. For this case we recommend using the docker-compose.minitoring.yaml file or the corresponding k8s deployments.
docker compose -f docker-compose.monitoring.yaml -f docker-compose.cpu.yaml upIn Grafana:
- Add Prometheus as a data source (URL: http://prometheus:9090)
- Import dashboard for FastAPI monitoring (you can start with dashboard ID 18739 from Grafana's dashboard marketplace)
This setup will give you:
- Basic metrics like request count, latency, and status codes
- System metrics like CPU and memory usage
- Custom metrics that you can add later
- Visualization and alerting capabilities through Grafana
You can further customize the monitoring by:
- Adding custom metrics in your FastAPI code
- Creating custom Grafana dashboards
- Setting up alerts in Grafana
- Adding more Prometheus exporters for system metrics
Some custom metrics included are:
model_counter: Number of models requestedmodel_error_counter: Number of errors in model inferencemodel_inference_duration: Time spent processing inference requestsmodel_inference_batch_size: Number of images in the batchmodel_inference_timeout: Number of inference requests that timed out
You can set alerts in Grafana for these metrics by going to Alerting > Alert Rules in the Grafana dashboard. Some examples are:
rate(model_inference_errors_total[1m]) > 10: Alert if more than 10 errors in the last minutehistogram_quantile(0.95, rate(model_inference_duration_seconds[1m])) > 10: Alert if more than 95% of the requests take more than 10 seconds to process- You can add more alerts using
PromQLqueries.
TODO: It may be interesting to create a custom dashboard with the included metrics and share it through Grafana's dashboard marketplace.
Alternatively, you can use AlertManager to send notifications via email, Slack, etc. (out of the scope of this repository).
Data drift detection is an important monitoring practice in ML systems that helps identify when the statistical properties of your production data differ significantly from the training data. This difference can lead to model performance degradation over time.
- Feature drift: Changes in the input data distribution (e.g., image pixel values, color distributions)
- Label drift: Changes in the target variable distribution
- Concept drift: Changes in the relationship between features and target
Common Causes are seasonal changes, changes in data collection methods, population shifts, hardware/sensor changes, and data quality issues.
You can use the DRIFT_DETECTION environment variable to enable drift detection. This will add a DriftDetector for each model. By default, the drift detector will monitor the input size for a given number of requests and report mean values with Prometheus (which can be visualized in Grafana and used to set alerts). Feel free to modify the src/drift.py file to monitor other metrics or to implement a different drift detection algorithm.
The API contains several security features:
- API Key authentication
- Rate limiting
In order to require an API key, set the API_KEY environment variable to the desired key.
In order to enable rate limiting, set the RATE_LIMIT environment variable to the desired rate limit (e.g., 10/minute).
This repository offers functionality for creating production-grade APIs to perform inference on ML models.
To develop the api, run
# cpu support
docker-compose -f docker-compose.cpu.yaml up
# gpu support
docker-compose -f docker-compose.gpu.yaml up
You can try the api with the interactive documentation at http://localhost:8000/docs.
Build the docker image:
# cpu
docker build -t <username>/<image-name>:<tag> api
# gpu
docker build -t <username>/<image-name>:<tag> -f api/Dockerfile.gpu apiUse your dockerhub username and a tag for the image.
Push to docker hub:
docker push <username>/<image-name>:<tag>
You will need to login to dockerhub with your credentials before pushing the image.
You can run the image with:
# cpu
docker run -p 8000:8000 <username>/<image-name>:<tag>
# gpu
docker run --gpus all -p 8000:8000 <username>/<image-name>:<tag>Start minikube:
minikube startadd metrics server if you want to use autoscaling
minikube addons enable metrics-server
Create secrets
kubectl create configmap ml-inference-config --from-env-file=.env
Deploy api to cluster:
kubectl apply -f k8s/deployment.yamlChange the image name in the deployment file to the one you created.
Port forward to access the api:
kubectl port-forward service/ml-inference-service 8000:80 Get api logs
kubectl logs -f deployment/ml-inferenceYou can autoscale your API with the following command:
kubectl apply -f k8s/hpa.yamlModify the manifest to adjust to your needs.
You can test the autoscaling with a load test with locust.
You can monitor the API with Prometheus and Grafana.
kubectl apply -f k8s/prometheus-config.yaml
kubectl apply -f k8s/monitoring.yamlPort forward to access Prometheus and Grafana:
kubectl port-forward service/prometheus-service 9090:9090
kubectl port-forward service/grafana-service 3000:3000Connect Prometheus to Grafana with http://prometheus-service:9090 as the Prometheus instance.
Minikube gpu support is limited. Following https://minikube.sigs.k8s.io/docs/tutorials/nvidia/ does not seem to work.
- Deployment: install nvidia device plugin in k8s nodes and add
resources: limits: nvidia.com/gpu: 1to the deployment manifest. No more GPUs than available on nodes can be used. - Autoscaling: expose gpu usage as custom metric (prometheus + node exporter + prometheus adapter) and use it in the hpa manifest.
To run the api in a cloud kubernetes cluster, you should follow the same steps as for minikube taking the following into account:
- Change the service type to
LoadBalancerorClusterIPin the deployment manifests service section. - Use a cloud provider that supports gpu nodes (if you want to use the gpu version).
- Use
ingress.yamlto expose the api to the internet instead of port forwarding.
You can run the tests with:
docker-compose -f docker-compose.test.yamlYou can add more tests to the api/tests folder.