Add bootstrap vs day 2 integration tests based on envtest #2687

JoelSpeed · 2021-07-21T12:54:11Z

- What I did

Created a new test suite using a a controller-runtime tool envtest, note the new dependency is just so we can use this envtest tool for these tests, no other packages are being imported.

With envtest, the new test suite in test/e2e-bootstrap/bootstrap_test.go sets up the controllers from the MCC against a running kubeapi/etcd (provided by envtest).

It then performs the following steps:

Verifies that the environment is clean (no resources left over from previous tests)
Creates basic MCPs et al based on data from the pkg/controller/bootstrap/testdata/bootstrap folder
Waits for the controllers to render the config, captures the name for both worker and master pools
Copies the bootstrap data + additional tests specific manifests into a temp dir
Runs the bootstrap logic against this temp dir
Captures the rendered output of the bootstrap tool and compares the names
If the names of the rendered configs don't match, it prints out both configs for analysis
Cleanup: Stops the controllers and then deletes all resources created during the test, ready for the next test

The above is captured in the first couple of commits, the remaining commits:

modify one of the test helpers to allow it to wait for multiple machine config names in the MCP
Add tests for KubeletConfigs during bootstrap (note these fail currently)
Add a couple of scripts that allow the relevant pre-test setup to be done (downloading kubeapi/etcd binaries to a temp directory)

- Why I did

The aim of this test suite is to ensure that the day 1 and day 2 configurations that MCC generates are consistent, no matter the inputs. This isn't the case today so the full suite won't pass, but we can aspire to fixing those. It should also be relatively easy to add new test cases in the future if we identify other ways day 1/day 2 can differ.

- How to verify it

You should be able to run make bootstrap-e2e on any machine and observe that the first test currently passes, while the remaining three fail. The second should be fixed by #2668 (I have tested this already), and the remaining should be fixed by #2547.

If you already have the "kubebuilder" binaries installed, you can run make bootstrap-e2e-local to prevent the test from fetching new binaries.

If this is an acceptable test suite (I appreciate the envtest concept is new to this team), then I will set up a CI check for this test explicitly so that you have a clear signal if there's drift/regressions on this bootstrapping in the future.

Am also happy to jump on a call to do a demo/talk through what this does and how it works.

Also happy to go on an owners file for this if that would make the team feel more comfortable.

kikisdeliveryservice · 2021-07-21T16:55:24Z

My main question would be given:

The aim of this test suite is to ensure that the day 1 and day 2 configurations that MCC generates are consistent, no matter the inputs. This isn't the case today so the full suite won't pass, but we can aspire to fixing those. It should also be relatively easy to add new test cases in the future if we identify other ways day 1/day 2 can differ.

Are we already saying this is going to fail? (Maybe I'm misreading - please correct me)

Generally, I'm really concerned about the number of e2e tests being added specifically to the MCO repo that don't really pass or give good signal or actionable things to the MCO team. Do you plan on monitoring the e2e? Would this be better suited in a periodic?

We currently have 19 things running on every PR in the repo and this generally seems like it's too much at this point (generally not specific to this test).

JoelSpeed · 2021-07-21T21:02:22Z

I completely understand your concern!

In terms of does this fail today, yes. But, I've written this test with 2668 and 2547 in mind. The idea, based on my conversation with Ryan and Jerry, is that this should pass as soon as both of those PRs are merged.

If these ever fail in the future, either the test needs updating for some legitimate reason, or, MCO broke its day 1 to day 2 compatibility.

There are several options for this:

Merge it only once it passes (ie 2547+2668 are merged)
Decide we don't need it and close without merging (I'm ok with this, just trying to make the team as comfortable as possible by offering the new tests that would catch future regressions)
Comment out the failing tests and uncomment them as the PRs that fix them, fix them
Merge these tests into an existing status check(?) (On my machine this takes about 1m30s to run the entire thing, could that be added to the unit tests?)

I'll set up a branch tomorrow that includes this, 2547 and 2668, and verify that the tests do indeed all pass :)

sinnykumari · 2021-07-22T09:22:18Z

test/e2e-bootstrap/bootstrap_test.go

It's a bit unclear to me, are these individual test cases(applying manifests) running during bootstrap and we are later on comparing rendered config of bootstrap version and rendered config available during real cluster. How we are ensuring that each manifests are getting applied with fresh bootstrap process?

This might be better explained via a call, does your team have an office hours I could attend to demo this test and explain properly how it works?

The envtest tool I've set up in this PR runs an etcd and kubeapiserver locally on your machine (or in CI), so there isn't actually a real cluster per se.

https://github.com/openshift/machine-config-operator/blob/f4aad781788b1a864e7bfe6cbd0572bffa590562/test/e2e-bootstrap/bootstrap_test.go#L82-L86

This test suite runs the controllers from the MCC against this local kubeapiserver. It then creates all the resources that would normally be created during bootstrap, which the MCC controllers react to, and render the config as they would do in a real cluster.

https://github.com/openshift/machine-config-operator/blob/f4aad781788b1a864e7bfe6cbd0572bffa590562/test/e2e-bootstrap/bootstrap_test.go#L262-L280

Alongside this, we take the same resources and put them into a temp folder, which we then run the bootstrap against.

https://github.com/openshift/machine-config-operator/blob/f4aad781788b1a864e7bfe6cbd0572bffa590562/test/e2e-bootstrap/bootstrap_test.go#L198-L221

Because this isn't a real cluster, and the only controllers running are the MCC controllers, we then shut down the controllers between each test and restart them. And we clean the environment by deleting all of the resources that we created.

https://github.com/openshift/machine-config-operator/blob/f4aad781788b1a864e7bfe6cbd0572bffa590562/test/e2e-bootstrap/bootstrap_test.go#L186-L187

One of the really cool things about this style of testing is you can see how your controller reacts to real API resources, without fakes, without a real cluster :)

Thanks for explaining. We don't have office hours scheduled so far, we can plan a meeting but that can earliest happen next week.

sinnykumari · 2021-07-22T09:39:15Z

If these ever fail in the future, either the test needs updating for some legitimate reason, or, MCO broke its day 1 to day 2 compatibility.

There are several options for this:
* Merge it only once it passes (ie 2547+2668 are merged)

not sure, I think merging actual code flow PR with passing test is better as it would catch regression early.

* Decide we don't need it and close without merging (I'm ok with this, just trying to make the team as comfortable as possible by offering the new tests that would catch future regressions)

If I understand correctly, this test adds e2e coverage during bootstrap process for issues that we have seen in several occasion i.e. rendered-config mismatch from bootstrap and actual cluster. If my understanding is correct, I find this test useful in general and more coverage(from encountered bugs) can be added by adding additional test cases.

* Comment out the failing tests and uncomment them as the PRs that fix them, fix them

I'll set up a branch tomorrow that includes this, 2547 and 2668, and verify that the tests do indeed all pass :)

+1, that will provide some real data.

I am wondering if we add this new test, do we really need e2e-aws-techpreview-featuregate e2e test running in MCO repo? e2e-aws-techpreview-featuregate could be added in periodic and for day2 coverage, we can add a sub-test in MCO repo in e2e-gcp-op (hopefully we will break this test in future between two {node,mco} tests)

JoelSpeed · 2021-07-22T12:07:48Z

If I understand correctly, this test adds e2e coverage during bootstrap process for issues that we have seen in several occasion i.e. rendered-config mismatch from bootstrap and actual cluster. If my understanding is correct, I find this test useful in general and more coverage(from encountered bugs) can be added by adding additional test cases.

I think we are on the same page here based on this comment :)

+1, that will provide some real data.

Here we go https://github.com/JoelSpeed/machine-config-operator/tree/bootstrap-e2e%2B2547%2B2668, this branch is 2668, then 2547, then this PR, plus 2 extra commits. The first extra commit, I noticed in other controllers we default the feature gate if it's nil, that's not the case in bootstrap, so to avoid panic, I allowed the feature map generation to handle a nil. This could come as part of 2547 IMO.

The second extra commit integrates the refactoring done in 2668 into 2547 and allows it to pass the feature gate through, fixing the compatibility with the tests from the feature gate perspective.

I've uploaded the output of a test run to a gist. From this, we can see that the first two tests pass (ie no extra manifests and a feature gate only), but then the 3rd and 4th tests fail. In the 3rd test, we expect the kubelet config to only attach to the masters, but it is attaching to the master and worker during bootstrap, hence the worker configs differ, which is a known issue that @yuqi-zhang pointed out with 2547. Same thing is happening in the 4th test except we expect it only in the worker config.

Once the label matching element of 2547 is fixed, I expect these tests should pass without issue.

JoelSpeed · 2021-07-22T12:09:06Z

I am wondering if we add this new test, do we really need e2e-aws-techpreview-featuregate e2e test running in MCO repo? e2e-aws-techpreview-featuregate could be added in periodic and for day2 coverage, we can add a sub-test in MCO repo in e2e-gcp-op (hopefully we will break this test in future between two {node,mco} tests)

That is completely up to your team. I added that one first as it was an easy add to prove 2668 fixed cluster bootstrap, I think this test is better in the long term (as you can run it locally, it gives you full diffs, and doesn't need a cluster to run against. But I appreciate this is new and takes more time to review and understand, so I left this as the second part of the testing :)

JoelSpeed · 2021-09-07T09:57:44Z

@sinnykumari @kikisdeliveryservice Now that 4.10 is open for business, I've just rebased this, hoping we can merge soon if you both think this is a valid addition to the test suite, can't remember where we left it after the call at the end of July

JoelSpeed · 2021-10-18T09:33:42Z

@sinnykumari What would you like to do with this PR? Is this useful still?

sinnykumari · 2021-10-18T13:10:09Z

@sinnykumari What would you like to do with this PR? Is this useful still?

Sorry, it got missed out. yup, will be nice to have this test in MCO.
Can you rebase the PR and then will take a final look before merging. Thanks!

JoelSpeed · 2021-10-18T16:47:03Z

Thanks @sinnykumari, rebased

sinnykumari · 2021-10-19T09:41:33Z

hack/fetch-ext-bins.sh

Looks like this is unused variable in the test

hack/fetch-ext-bins.sh

test/helpers/utils.go

sinnykumari

added few comments, rest lgtm

JoelSpeed

Thanks for the review @sinnykumari, think I've addressed the comments. I've added a filter to pick out the not found machine config names in the helper in an additional commit on top, happy to squash that into another commit if you think that's appropriate

hack/fetch-ext-bins.sh

sinnykumari

Thanks Joel for getting this test added and helping team understand the envtest.
LGTM.
@kikisdeliveryservice @yuqi-zhang any final thoughts before we merge this?

yuqi-zhang · 2021-10-21T20:41:55Z

Also the CI job appears to be failing with go install github.com/jstemmer/go-junit-report@latest: cannot query module due to -mod=vendor

While we work on the new e2e tests in: openshift/machine-config-operator#2687 set always_run to false to avoid having extra noise in the repo.

JoelSpeed · 2021-10-25T09:51:30Z

/test bootstrap-e2e

JoelSpeed · 2021-10-25T10:10:21Z

/test bootstrap-e2e

JoelSpeed · 2021-10-25T10:11:38Z

@yuqi-zhang I found this little gem

machine-config-operator/hack/build-go.sh

Lines 38 to 44 in 7615a61

    
           # Use the containers_image_openpgp flag to avoid the default CGO implementation of signatures dragged in by 
        
           # containers/image/signature, which we use only to edit the /etc/containers/policy.json file without doing any cryptography 
        
           CGO_ENABLED=0 
        
           if [[ $WHAT == "machine-config-controller" ]]; then 
        
               GOTAGS="containers_image_openpgp exclude_graphdriver_devicemapper exclude_graphdriver_btrfs containers_image_ostree_stub" 
        
           fi

which explains what the issue is here, seems this has already been handled elsewhere during the build process so have just included the CGO_ENABLED=0 and GOTAGS in the make task for the bootstrap-e2e, which seems to have resolved the issue with extra C dependencies

When the HOME directory is unset or is set to /, kubebuilder tries to create a cache directory but fails. Make sure in these cases that we set it to a temp folder so that the test can create its cache

JoelSpeed · 2021-10-25T10:18:54Z

/test bootstrap-e2e

JoelSpeed · 2021-10-25T10:24:06Z

Had to do a couple of fixups to the scripts (stuff we had already encountered but I hadn't copied over), but the tests are now working, https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2687/pull-ci-openshift-machine-config-operator-master-bootstrap-e2e/1452580424227229696

openshift-ci · 2021-10-25T12:35:40Z

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-techpreview-featuregate	17cb6296fb4f47018566ec656f715d7b8f672c76	link	false	`/test e2e-aws-techpreview-featuregate`
ci/prow/okd-e2e-aws	`66ec3c9`	link	false	`/test okd-e2e-aws`
ci/prow/e2e-aws-workers-rhel8	`66ec3c9`	link	false	`/test e2e-aws-workers-rhel8`
ci/prow/e2e-aws-workers-rhel7	`66ec3c9`	link	false	`/test e2e-aws-workers-rhel7`
ci/prow/e2e-aws-upgrade-single-node	`66ec3c9`	link	false	`/test e2e-aws-upgrade-single-node`
ci/prow/e2e-vsphere-upgrade	`66ec3c9`	link	false	`/test e2e-vsphere-upgrade`
ci/prow/e2e-aws-disruptive	`66ec3c9`	link	false	`/test e2e-aws-disruptive`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

yuqi-zhang · 2021-10-25T23:12:51Z

@yuqi-zhang I found this little gem

machine-config-operator/hack/build-go.sh

Lines 38 to 44 in 7615a61

# Use the containers_image_openpgp flag to avoid the default CGO implementation of signatures dragged in by

# containers/image/signature, which we use only to edit the /etc/containers/policy.json file without doing any cryptography

CGO_ENABLED=0

if [[ $WHAT == "machine-config-controller" ]]; then

GOTAGS="containers_image_openpgp exclude_graphdriver_devicemapper exclude_graphdriver_btrfs containers_image_ostree_stub"

fi

which explains what the issue is here, seems this has already been handled elsewhere during the build process so have just included the CGO_ENABLED=0 and GOTAGS in the make task for the bootstrap-e2e, which seems to have resolved the issue with extra C dependencies

Ah ok I completely forgot about that snippet, good catch

Just to make sure, I tested one more time locally. make bootstrap-e2e works fine, but make bootstrap-e2e-local does not, failing with

=== RUN   TestE2EBootstrap
    bootstrap_test.go:83: 
        	Error Trace:	bootstrap_test.go:83
        	Error:      	Received unexpected error:
        	            	unable to start control plane itself: failed to start the controlplane. retried 5 times: fork/exec /usr/local/kubebuilder/bin/etcd: no such file or directory
        	Test:       	TestE2EBootstrap
--- FAIL: TestE2EBootstrap (0.01s)

I presume that is expected because it's lacking the setup_envs function and I shouldn't be running it directly in the first place. So make bootstrap-e2e is the canonical way to run this, and the -local is just an alias for the actual test run (that does not work independently)? If that's the case I don't feel strongly either way, just wanted to make sure.

Overall lgtm, will also let Sinny take a final look

sinnykumari · 2021-10-26T08:47:47Z

LGTM, Log from bootstrap-e2e test looks good. Let's get this in and we can discuss with Joel offline if needed help with running locally.
Thanks Jerry and Joel!
/lgtm

openshift-ci · 2021-10-26T08:48:26Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2021-10-26T08:49:31Z

/retest-required