[0.1.6] Deploying argoflow-aws

We're setting up Kubeflow (argoflow-aws) from scratch, including the infrastructure and hit some stumbling blocks along the way. Wanted to document them all here (for now) and address as needed with PRs etc.

I realize that #84 exists, happy to merge into there but I'm not sure that issue deals with the specific 0.1.6 tag. That might be part of my issue as well since some things are more up-to-date on the master branch.

# Current issues (can be triaged and split into separate issues or merged into existing issues)

## ❌ OPEN ISSUES

These are mainly based off of broken functionality or application statuses in ArgoCD

### `knative`

- **Impact: Low**
- ArgoCD resources out of sync (MutatingWebhookConfiguration and ValidatingWebhookConfiguration)
- Auto Sync currently turned off to debug
- Details: https://github.com/argoflow/argoflow-aws/issues/227#issuecomment-923848205

### `mpi-operator` ([https://github.com/kubeflow/mpi-operator](https://github.com/kubeflow/mpi-operator))

- **Impact: Low** (not used by our org)
- Related to
  - https://github.com/kubeflow/mpi-operator/issues/412
  - https://github.com/argoflow/argoflow-aws/pull/226
- What `mpi-operator` does:

> The MPI Operator makes it easy to run allreduce-style distributed training on Kubernetes.

- Crashes
- Logs

    ```bash
    flag provided but not defined: -kubectl-delivery-image
    Usage of /opt/mpi-operator:
      -add_dir_header
        	If true, adds the file directory to the header
    ...
    ```

### `aws-eks-resources`

- **Impact: Low**
- ArgoCD resources out of sync (probably needs `ignoreDifferences`)
- Auto Sync currently turned off to debug

## ✅ SOLVED ISSUES

### [✅ SOLVED] `oauth2-proxy`

- **Impact: Unknown**
- **Problem -** CreateContainerConfigError: secret "oauth2-proxy" not found
    - **Solution -** Secrets need to be manually updated in AWS Secrets Manager for oauth2-proxy:
        - client-id and client-secret ([GCP link](https://console.cloud.google.com/apis/credentials/oauthclient/883210430459-sbbhktk32ktusjpia2215psv4ghermdl.apps.googleusercontent.com?authuser=0&project=test-api-cloud-infrastructure))
        - cookie-secret (generated by Terraform - see the `kubeflow_oidc_cookie_secret` output variable)
- **Problem** - cannot contact the Redis cluster
    - **Solution** - Redis cluster needs to be in the same VPC Security Group as the EKS cluster (see https://github.com/honestbank/argoflow-aws-infrastructure/issues/2)

### [✅ SOLVED] `pipelines`

* **Impact: High**
* Crash Loop
* Logs:
```
F0914 02:03:01.977497       7 main.go:240] Fatal error config file: While parsing config: invalid character 't' after object key:value pair
```
* **Solution** - values in `setup.conf` must **NOT** be quoted

### [✅ SOLVED] `aws-load-balancer-controller`

* **Impact: High**
* Blocks accessing UI/dashboard
* Load Balancer isn't being created, logs:
```
2021/09/14 09:46:15 http: TLS handshake error from 172.31.39.152:54030: remote error: tls: bad certificate
```
---
```
{"level":"error","ts":1631613104.4709718,"logger":"controller","msg":"Reconciler error","controller":"service","name":"istio-ingressgateway","namespace":"istio-system","error":"Internal error occurred: failed calling webhook \"mtargetgroupbinding.elbv2.k8s.aws\": Post \"https://aws-load-balancer-webhook-service.kube-system.svc:443/mutate-elbv2-k8s-aws-v1beta1-targetgroupbinding?timeout=10s\": x509: certificate signed by unknown authority (possibly because of \"crypto/rsa: verification error\" while trying to verify candidate authority certificate \"aws-load-balancer-controller-ca\")"}
```
* **Solution**:
  * The EKS subnets need to be tagged correctly (see https://github.com/honestbank/argoflow-aws-infrastructure/issues/3)
  * Deleted and re-sync'ed the ArgoCD Application. There was most likely some kind of race condition/invalid value from the first installation of argoflow-aws.

### [✅ SOLVED] Central Dashboard

* **Impact: High**
* Can't access any applications - dashboard 404s for all sub-apps
* Related:
  * https://github.com/kubeflow/kubeflow/issues/3615
  * https://github.com/kubeflow/kubeflow/issues/6080
* Logs:
```
> kubeflow-centraldashboard@0.0.2 start /app
> npm run serve


> kubeflow-centraldashboard@0.0.2 serve /app
> node dist/server.js

Initializing Kubernetes configuration
Unable to fetch Application information: 404 page not found

"aws" is not a supported platform for Metrics
Using Profiles service at http://profiles-kfam.kubeflow:8081/kfam
Server listening on port http://localhost:8082 (in production mode)
Unable to fetch Application information: 404 page not found
2021-09-14T02:39:12.655692792Z
```
* Update - seems we shouldn't port-forward into the dashboard. However `aws-load-balancer-controller` has an issue (see below)
* **Solution**: the dashboard cannot be accessed using `kubectl port-forward` but rather needs to be accessed through the proper URL of `<<__subdomain_dashboard__>.<<__domain__>>`

### [✅ SOLVED] `kube-prometheus-stack`

- **Impact: Low**
- `kube-prometheus-stack-grafana` ConfigMap and Secret are going out of sync (in ArgoCD), which causes checksums in the Deployment to go out of sync as well
- Was an issue on v0.1.6, resolved by deploying `master` (b90cb8af46439f25a15306975fc99fd42a06f378)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[0.1.6] Deploying argoflow-aws #227

Current issues (can be triaged and split into separate issues or merged into existing issues)

❌ OPEN ISSUES

`knative`

`mpi-operator` (https://github.com/kubeflow/mpi-operator)

`aws-eks-resources`

✅ SOLVED ISSUES

[✅ SOLVED] `oauth2-proxy`

[✅ SOLVED] `pipelines`

[✅ SOLVED] `aws-load-balancer-controller`

[✅ SOLVED] Central Dashboard

[✅ SOLVED] `kube-prometheus-stack`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[0.1.6] Deploying argoflow-aws #227

Description

Current issues (can be triaged and split into separate issues or merged into existing issues)

❌ OPEN ISSUES

knative

mpi-operator (https://github.com/kubeflow/mpi-operator)

aws-eks-resources

✅ SOLVED ISSUES

[✅ SOLVED] oauth2-proxy

[✅ SOLVED] pipelines

[✅ SOLVED] aws-load-balancer-controller

[✅ SOLVED] Central Dashboard

[✅ SOLVED] kube-prometheus-stack

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`knative`

`mpi-operator` (https://github.com/kubeflow/mpi-operator)

`aws-eks-resources`

[✅ SOLVED] `oauth2-proxy`

[✅ SOLVED] `pipelines`

[✅ SOLVED] `aws-load-balancer-controller`

[✅ SOLVED] `kube-prometheus-stack`