This repository was archived by the owner on Jul 11, 2023. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 27
This repository was archived by the owner on Jul 11, 2023. It is now read-only.
[0.1.6] Deploying argoflow-aws #227
Copy link
Copy link
Open
Description
We're setting up Kubeflow (argoflow-aws) from scratch, including the infrastructure and hit some stumbling blocks along the way. Wanted to document them all here (for now) and address as needed with PRs etc.
I realize that #84 exists, happy to merge into there but I'm not sure that issue deals with the specific 0.1.6 tag. That might be part of my issue as well since some things are more up-to-date on the master branch.
Current issues (can be triaged and split into separate issues or merged into existing issues)
❌ OPEN ISSUES
These are mainly based off of broken functionality or application statuses in ArgoCD
knative
- Impact: Low
- ArgoCD resources out of sync (MutatingWebhookConfiguration and ValidatingWebhookConfiguration)
- Auto Sync currently turned off to debug
- Details: [0.1.6] Deploying argoflow-aws #227 (comment)
mpi-operator (https://github.com/kubeflow/mpi-operator)
- Impact: Low (not used by our org)
- Related to
- What
mpi-operatordoes:
The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes.
-
Crashes
-
Logs
flag provided but not defined: -kubectl-delivery-image Usage of /opt/mpi-operator: -add_dir_header If true, adds the file directory to the header ...
aws-eks-resources
- Impact: Low
- ArgoCD resources out of sync (probably needs
ignoreDifferences) - Auto Sync currently turned off to debug
✅ SOLVED ISSUES
[✅ SOLVED] oauth2-proxy
- Impact: Unknown
- Problem - CreateContainerConfigError: secret "oauth2-proxy" not found
- Solution - Secrets need to be manually updated in AWS Secrets Manager for oauth2-proxy:
- client-id and client-secret (GCP link)
- cookie-secret (generated by Terraform - see the
kubeflow_oidc_cookie_secretoutput variable)
- Solution - Secrets need to be manually updated in AWS Secrets Manager for oauth2-proxy:
- Problem - cannot contact the Redis cluster
- Solution - Redis cluster needs to be in the same VPC Security Group as the EKS cluster (see https://github.com/honestbank/argoflow-aws-infrastructure/issues/2)
[✅ SOLVED] pipelines
- Impact: High
- Crash Loop
- Logs:
F0914 02:03:01.977497 7 main.go:240] Fatal error config file: While parsing config: invalid character 't' after object key:value pair
- Solution - values in
setup.confmust NOT be quoted
[✅ SOLVED] aws-load-balancer-controller
- Impact: High
- Blocks accessing UI/dashboard
- Load Balancer isn't being created, logs:
2021/09/14 09:46:15 http: TLS handshake error from 172.31.39.152:54030: remote error: tls: bad certificate
{"level":"error","ts":1631613104.4709718,"logger":"controller","msg":"Reconciler error","controller":"service","name":"istio-ingressgateway","namespace":"istio-system","error":"Internal error occurred: failed calling webhook \"mtargetgroupbinding.elbv2.k8s.aws\": Post \"https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"aws-load-balancer-controller-ca\")"}
- Solution:
- The EKS subnets need to be tagged correctly (see https://github.com/honestbank/argoflow-aws-infrastructure/issues/3)
- Deleted and re-sync'ed the ArgoCD Application. There was most likely some kind of race condition/invalid value from the first installation of argoflow-aws.
[✅ SOLVED] Central Dashboard
- Impact: High
- Can't access any applications - dashboard 404s for all sub-apps
- Related:
- Logs:
> kubeflow-centraldashboard@0.0.2 start /app
> npm run serve
> kubeflow-centraldashboard@0.0.2 serve /app
> node dist/server.js
Initializing Kubernetes configuration
Unable to fetch Application information: 404 page not found
"aws" is not a supported platform for Metrics
Using Profiles service at http://profiles-kfam.kubeflow:8081/kfam
Server listening on port http://localhost:8082 (in production mode)
Unable to fetch Application information: 404 page not found
2021-09-14T02:39:12.655692792Z
- Update - seems we shouldn't port-forward into the dashboard. However
aws-load-balancer-controllerhas an issue (see below) - Solution: the dashboard cannot be accessed using
kubectl port-forwardbut rather needs to be accessed through the proper URL of<<__subdomain_dashboard__>.<<__domain__>>
[✅ SOLVED] kube-prometheus-stack
- Impact: Low
kube-prometheus-stack-grafanaConfigMap and Secret are going out of sync (in ArgoCD), which causes checksums in the Deployment to go out of sync as well- Was an issue on v0.1.6, resolved by deploying
master(b90cb8a)
Metadata
Metadata
Assignees
Labels
No labels