Skip to content
This repository was archived by the owner on Jul 11, 2023. It is now read-only.
This repository was archived by the owner on Jul 11, 2023. It is now read-only.

[0.1.6] Deploying argoflow-aws #227

@jai

Description

@jai

We're setting up Kubeflow (argoflow-aws) from scratch, including the infrastructure and hit some stumbling blocks along the way. Wanted to document them all here (for now) and address as needed with PRs etc.

I realize that #84 exists, happy to merge into there but I'm not sure that issue deals with the specific 0.1.6 tag. That might be part of my issue as well since some things are more up-to-date on the master branch.

Current issues (can be triaged and split into separate issues or merged into existing issues)

❌ OPEN ISSUES

These are mainly based off of broken functionality or application statuses in ArgoCD

knative

mpi-operator (https://github.com/kubeflow/mpi-operator)

The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes.

  • Crashes

  • Logs

    flag provided but not defined: -kubectl-delivery-image
    Usage of /opt/mpi-operator:
      -add_dir_header
        	If true, adds the file directory to the header
    ...

aws-eks-resources

  • Impact: Low
  • ArgoCD resources out of sync (probably needs ignoreDifferences)
  • Auto Sync currently turned off to debug

✅ SOLVED ISSUES

[✅ SOLVED] oauth2-proxy

  • Impact: Unknown
  • Problem - CreateContainerConfigError: secret "oauth2-proxy" not found
    • Solution - Secrets need to be manually updated in AWS Secrets Manager for oauth2-proxy:
      • client-id and client-secret (GCP link)
      • cookie-secret (generated by Terraform - see the kubeflow_oidc_cookie_secret output variable)
  • Problem - cannot contact the Redis cluster

[✅ SOLVED] pipelines

  • Impact: High
  • Crash Loop
  • Logs:
F0914 02:03:01.977497       7 main.go:240] Fatal error config file: While parsing config: invalid character 't' after object key:value pair
  • Solution - values in setup.conf must NOT be quoted

[✅ SOLVED] aws-load-balancer-controller

  • Impact: High
  • Blocks accessing UI/dashboard
  • Load Balancer isn't being created, logs:
2021/09/14 09:46:15 http: TLS handshake error from 172.31.39.152:54030: remote error: tls: bad certificate

{"level":"error","ts":1631613104.4709718,"logger":"controller","msg":"Reconciler error","controller":"service","name":"istio-ingressgateway","namespace":"istio-system","error":"Internal error occurred: failed calling webhook \"mtargetgroupbinding.elbv2.k8s.aws\": Post \"https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"aws-load-balancer-controller-ca\")"}

[✅ SOLVED] Central Dashboard

> kubeflow-centraldashboard@0.0.2 start /app
> npm run serve


> kubeflow-centraldashboard@0.0.2 serve /app
> node dist/server.js

Initializing Kubernetes configuration
Unable to fetch Application information: 404 page not found

"aws" is not a supported platform for Metrics
Using Profiles service at http://profiles-kfam.kubeflow:8081/kfam
Server listening on port http://localhost:8082 (in production mode)
Unable to fetch Application information: 404 page not found
2021-09-14T02:39:12.655692792Z
  • Update - seems we shouldn't port-forward into the dashboard. However aws-load-balancer-controller has an issue (see below)
  • Solution: the dashboard cannot be accessed using kubectl port-forward but rather needs to be accessed through the proper URL of <<__subdomain_dashboard__>.<<__domain__>>

[✅ SOLVED] kube-prometheus-stack

  • Impact: Low
  • kube-prometheus-stack-grafana ConfigMap and Secret are going out of sync (in ArgoCD), which causes checksums in the Deployment to go out of sync as well
  • Was an issue on v0.1.6, resolved by deploying master (b90cb8a)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions