Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion cmd/argoexec/executor/init.go
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@

"k8s.io/apimachinery/pkg/types"
"k8s.io/client-go/kubernetes"
"k8s.io/client-go/rest"

Check failure on line 13 in cmd/argoexec/executor/init.go

View workflow job for this annotation

GitHub Actions / Lint

ST1019: package "k8s.io/client-go/rest" is being imported more than once (staticcheck)

Check failure on line 13 in cmd/argoexec/executor/init.go

View workflow job for this annotation

GitHub Actions / Codegen

ST1019: package "k8s.io/client-go/rest" is being imported more than once (staticcheck)
restclient "k8s.io/client-go/rest"

Check failure on line 14 in cmd/argoexec/executor/init.go

View workflow job for this annotation

GitHub Actions / Lint

ST1019(related information): other import of "k8s.io/client-go/rest" (staticcheck)

Check failure on line 14 in cmd/argoexec/executor/init.go

View workflow job for this annotation

GitHub Actions / Codegen

ST1019(related information): other import of "k8s.io/client-go/rest" (staticcheck)
"k8s.io/client-go/tools/clientcmd"

"github.com/argoproj/argo-workflows/v3"
Expand Down Expand Up @@ -79,7 +80,13 @@
wfExecutor := executor.NewExecutor(
ctx,
clientset,
versioned.NewForConfigOrDie(config).ArgoprojV1alpha1().WorkflowTaskResults(namespace),
versioned.NewForConfigOrDie(&rest.Config{
Host: "https://argo-wtr-apiserver.argo.svc.cluster.local:6443",
BearerToken: "mytoken",
TLSClientConfig: rest.TLSClientConfig{
Insecure: true,
},
}).ArgoprojV1alpha1().WorkflowTaskResults(namespace),
restClient,
podName,
types.UID(os.Getenv(common.EnvVarPodUID)),
Expand Down
11 changes: 11 additions & 0 deletions config/config.go
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,9 @@ type Config struct {

// ArtifactDrivers lists artifact driver plugins we can use
ArtifactDrivers []ArtifactDriver `json:"artifactDrivers,omitempty"`

// OffloadTaskResults holds the config for offloading task results to a separate store
OffloadTaskResults *OffloadTaskResultsConfig `json:"offloadTaskResults,omitempty"`
}

// ArtifactDriver is a plugin for an artifact driver
Expand Down Expand Up @@ -452,3 +455,11 @@ func (req *WorkflowRestrictions) MustNotChangeSpec() bool {
}
return req.TemplateReferencing == TemplateReferencingSecure
}

type OffloadTaskResultsConfig struct {
// Enabled controls offloading. Default false.
Enabled bool `json:"enabled,omitempty"`

// APIServer is the Kube API endpoint to write WorkflowTaskResults to.
APIServer string `json:"APIServer,omitempty"`
}
2 changes: 2 additions & 0 deletions docs/running-at-massive-scale.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@ Where Argo has a lot of work to do, the Kubernetes API can be overwhelmed. There
* Limit the number of concurrent workflows using parallelism.
* Rate-limit pod creation [configuration](workflow-controller-configmap.yaml) (>= v3.1).
* Set [`DEFAULT_REQUEUE_TIME=1m`](environment-variables.md)
* Offload Workflow Task Results by using an external Kubernetes API Server via the `OffloadTaskResultsConfig` in the Workflow Controller ConfigMap. (>=4.0 [TBD]).
Read more in [Vertically Scaling](./scaling.md#offloading-workflow-task-results-to-a-secondary-kubernetes-api-server).

## Overwhelmed Database

Expand Down
230 changes: 230 additions & 0 deletions docs/scaling.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,236 @@ It is not possible to provide a one-size-fits-all recommendation for these value
!!! Note
Despite the name, this rate limit only applies to the creation of Pods and not the creation of other Kubernetes resources (for example, ConfigMaps or PersistentVolumeClaims).

### Offloading Workflow Task Results to a Secondary Kubernetes API Server

Workflow Task Results are how Argo Workflows tracks outputs of pods and passes them between tasks (in a DAG) and steps.
They are provided as a Custom Resource Definition (CRD) within the Argo Workflows installation, as `WorkflowTaskResults`, with the Argo executor creating and updating them, and the Workflow Controller reading them.
It is possible that with many workflows the Kubernetes API will be overwhelmed due to the creation and deletion of many `WorkflowTaskResults` on the cluster.
To solve this, the Workflow Controller ConfigMap can specify an `OffloadTaskResultsConfig`.

#### POC Setup (Not for upstream docs)

The goal is to have a fully functional Kubernetes API endpoint that stores Argo's `WorkflowTaskResults` in its own data store.
Conceptually, we will be running a lightweight sub-cluster within the main cluster, in a similar way to tools like `vcluster`.
For this, we will run a Kubernetes API Server (Service and Deployment) and point it to an `etcd` Service/Deployment for its backend storage.
This is loosely how the Kubernetes Control Plane itself runs -- for more information, take a look at the [Kubernetes Components](https://kubernetes.io/docs/concepts/overview/components/) documentation.

##### Running a Kubernetes API Server

###### Run an `etcd` instance

We deploy a single-node `etcd`. The API server uses it exactly like the real Kubernetes control plane would.

| Flag | Why we need it |
| ------------------------------------------------------ | ------------------------------------------------------- |
| `--data-dir=/var/lib/etcd` | Local storage. We use `emptyDir:` for ephemeral POC. |
| `--advertise-client-urls` / `--listen-client-urls` | Expose the client API on port 2379. |
| `--listen-peer-urls` / `--initial-advertise-peer-urls` | Required even for a single-member “cluster”. |
| `--initial-cluster` | Defines the cluster membership. Required syntactically. |

The Service simply exposes port 2379 inside the namespace so the API server can reach it at `http://argo-wtr-etcd.argo.svc:2379`.

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: argo-wtr-etcd
namespace: argo
spec:
replicas: 1
selector:
matchLabels:
app: argo-wtr-etcd
template:
metadata:
labels:
app: argo-wtr-etcd
spec:
containers:
- name: etcd
image: quay.io/coreos/etcd:v3.6.6
command:
- etcd
- --name=argo-wtr-etcd
- --data-dir=/var/lib/etcd
- --advertise-client-urls=http://0.0.0.0:2379
- --listen-client-urls=http://0.0.0.0:2379
- --listen-peer-urls=http://0.0.0.0:2380
- --initial-advertise-peer-urls=http://0.0.0.0:2380
- --initial-cluster=argo-wtr-etcd=http://0.0.0.0:2380
ports:
- containerPort: 2379
- containerPort: 2380
volumeMounts:
- name: data
mountPath: /var/lib/etcd
volumes:
- name: data
emptyDir: {}
---
apiVersion: v1
kind: Service
metadata:
name: argo-wtr-etcd
namespace: argo
spec:
selector:
app: argo-wtr-etcd
ports:
- port: 2379
targetPort: 2379
name: client
```

###### Set Up Certs & API Server Security

A full kube-apiserver normally requires multiple certificates, CA bundles, front-proxy certs, and authentication plugins.

For this POC we run with the absolute minimum we can get away with:

| File | Purpose |
| -------------------- | ------------------------------------------------------------------------------------------ |
| `tls.crt`, `tls.key` | Server certificate & private key for HTTPS endpoint (`--secure-port=6443`). |
| `serviceaccount.key` | Used both as the *public* and *private* key for signing service account tokens. |
| `tokens.csv` | Static token authentication. Used so kubectl can authenticate without bootstrap machinery. |

We create `tls.crt` and `tls.key` using:

```console
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout tls.key -out tls.crt -subj "/CN=argo-wtr-apiserver"
```

We create `serviceaccount.key` using:

```console
openssl genrsa -out serviceaccount.key 2048
```

`tokens.csv` contains a static token authentication, where the file format is `<token>,<user>,<uid>,<group1>,<group2>,...`.

Copy these values to the ConfigMap:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: certs-and-keys
namespace: argo
data:
serviceaccount.key: |
-----BEGIN PRIVATE KEY-----
<snipped>
-----END PRIVATE KEY-----
tls.crt: |
-----BEGIN CERTIFICATE-----
<snipped>
-----END CERTIFICATE-----

tls.key: |
-----BEGIN PRIVATE KEY-----
<snipped>
-----END PRIVATE KEY-----
tokens.csv: |
mytoken,admin,1,"system:masters"
```

###### Run the kube-apiserver

| Flag | Why |
| --------------------------------------------------- | --------------------------------------------------- |
| `--etcd-servers=http://argo-wtr-etcd.argo.svc:2379` | Backend database. |
| `--secure-port=6443` | Only expose HTTPS; insecure port removed in >=1.31. |
| `--tls-cert-file`, `--tls-private-key-file` | Required since insecure-port is gone. |
| `--token-auth-file=/var/run/kubernetes/tokens.csv` | Simplest auth flow for kubectl. |
| `--service-account-key-file` | Needed even if we don’t actually use SA tokens. |
| `--service-account-signing-key-file` | Required in 1.20+ to serve the SA issuer. |
| `--service-account-issuer` | Must match what your workloads use when validating. |
| `--authorization-mode=AlwaysAllow` | Disables RBAC entirely. |
| `--enable-admission-plugins=NamespaceLifecycle` | Default admission plugin required for namespace-scoped CRDs and is on by default in upstream. |

###### Apply the `WorkflowTaskResults` CRD

<!-- Doesn't seem to be needed? `kit` tasks forwards everything for you? -->
Once the API server is running, we can port forward it and apply the CRD directly to it.

Port forward in a separate terminal:

```console
kubectl -n argo port-forward service/argo-wtr-apiserver 6443:6443;
```

And then run the `apply`:

```console
kubectl \
--server=https://localhost:6443 \
--token=mytoken \
--insecure-skip-tls-verify=true \
apply -f manifests/base/crds/minimal/argoproj.io_workflowtaskresults.yaml
```

Also create the `argo` namespace in the API server:

```console
kubectl \
--server=https://localhost:6443 \
--token=mytoken \
--insecure-skip-tls-verify=true \
create ns argo
```


###### Optional convenience: use a Config for `kubectl` and `k9s`

To save writing out the args to `kubectl` and `k9s`, you can use this Config:

```yaml
apiVersion: v1
kind: Config
clusters:
- cluster:
server: https://localhost:6443
insecure-skip-tls-verify: true
name: argo-wtr-cluster
users:
- name: argo-wtr-user
user:
token: mytoken
contexts:
- context:
cluster: argo-wtr-cluster
user: argo-wtr-user
name: argo-wtr-context
current-context: argo-wtr-context
```

And run commands like:

```console
KUBECONFIG=api-server-kubeconfig.yaml kubectl get ns
KUBECONFIG=api-server-kubeconfig.yaml ./k9s
```

(Download `k9s` to the container if using Dev Containers.)

##### Set Up the Controller Config

The final step is to tell our Workflows Controller about the offloadTaskResults config.
Based on the above config with the server at `https://localhost:6443`, we can use this `ConfigMap` as `manifests/base/workflow-controller/workflow-controller-configmap.yaml`:

```yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: workflow-controller-configmap
data:
offloadTaskResults: |
enabled: true
APIServer: https://localhost:6443
```

And finally (with the api-server still port-forwarded) run `make start` to run the workflow controller with workflowtaskresult offloading!

## Sharding

### One Install Per Namespace
Expand Down
Loading
Loading