Skip to content

Conversation

@JoelSpeed
Copy link
Contributor

@JoelSpeed JoelSpeed commented Jul 21, 2021

- What I did

Created a new test suite using a a controller-runtime tool envtest, note the new dependency is just so we can use this envtest tool for these tests, no other packages are being imported.

With envtest, the new test suite in test/e2e-bootstrap/bootstrap_test.go sets up the controllers from the MCC against a running kubeapi/etcd (provided by envtest).

It then performs the following steps:

  • Verifies that the environment is clean (no resources left over from previous tests)
  • Creates basic MCPs et al based on data from the pkg/controller/bootstrap/testdata/bootstrap folder
  • Waits for the controllers to render the config, captures the name for both worker and master pools
  • Copies the bootstrap data + additional tests specific manifests into a temp dir
  • Runs the bootstrap logic against this temp dir
  • Captures the rendered output of the bootstrap tool and compares the names
  • If the names of the rendered configs don't match, it prints out both configs for analysis
  • Cleanup: Stops the controllers and then deletes all resources created during the test, ready for the next test

The above is captured in the first couple of commits, the remaining commits:

  • modify one of the test helpers to allow it to wait for multiple machine config names in the MCP
  • Add tests for KubeletConfigs during bootstrap (note these fail currently)
  • Add a couple of scripts that allow the relevant pre-test setup to be done (downloading kubeapi/etcd binaries to a temp directory)

- Why I did

The aim of this test suite is to ensure that the day 1 and day 2 configurations that MCC generates are consistent, no matter the inputs. This isn't the case today so the full suite won't pass, but we can aspire to fixing those. It should also be relatively easy to add new test cases in the future if we identify other ways day 1/day 2 can differ.

- How to verify it

You should be able to run make bootstrap-e2e on any machine and observe that the first test currently passes, while the remaining three fail. The second should be fixed by #2668 (I have tested this already), and the remaining should be fixed by #2547.

If you already have the "kubebuilder" binaries installed, you can run make bootstrap-e2e-local to prevent the test from fetching new binaries.

If this is an acceptable test suite (I appreciate the envtest concept is new to this team), then I will set up a CI check for this test explicitly so that you have a clear signal if there's drift/regressions on this bootstrapping in the future.

Am also happy to jump on a call to do a demo/talk through what this does and how it works.

Also happy to go on an owners file for this if that would make the team feel more comfortable.

@kikisdeliveryservice
Copy link
Contributor

My main question would be given:

The aim of this test suite is to ensure that the day 1 and day 2 configurations that MCC generates are consistent, no matter the inputs. This isn't the case today so the full suite won't pass, but we can aspire to fixing those. It should also be relatively easy to add new test cases in the future if we identify other ways day 1/day 2 can differ.

Are we already saying this is going to fail? (Maybe I'm misreading - please correct me)

Generally, I'm really concerned about the number of e2e tests being added specifically to the MCO repo that don't really pass or give good signal or actionable things to the MCO team. Do you plan on monitoring the e2e? Would this be better suited in a periodic?

We currently have 19 things running on every PR in the repo and this generally seems like it's too much at this point (generally not specific to this test).

@JoelSpeed
Copy link
Contributor Author

I completely understand your concern!

In terms of does this fail today, yes. But, I've written this test with 2668 and 2547 in mind. The idea, based on my conversation with Ryan and Jerry, is that this should pass as soon as both of those PRs are merged.

If these ever fail in the future, either the test needs updating for some legitimate reason, or, MCO broke its day 1 to day 2 compatibility.

There are several options for this:

  • Merge it only once it passes (ie 2547+2668 are merged)
  • Decide we don't need it and close without merging (I'm ok with this, just trying to make the team as comfortable as possible by offering the new tests that would catch future regressions)
  • Comment out the failing tests and uncomment them as the PRs that fix them, fix them
  • Merge these tests into an existing status check(?) (On my machine this takes about 1m30s to run the entire thing, could that be added to the unit tests?)

I'll set up a branch tomorrow that includes this, 2547 and 2668, and verify that the tests do indeed all pass :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit unclear to me, are these individual test cases(applying manifests) running during bootstrap and we are later on comparing rendered config of bootstrap version and rendered config available during real cluster. How we are ensuring that each manifests are getting applied with fresh bootstrap process?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be better explained via a call, does your team have an office hours I could attend to demo this test and explain properly how it works?

The envtest tool I've set up in this PR runs an etcd and kubeapiserver locally on your machine (or in CI), so there isn't actually a real cluster per se.

https://github.com/openshift/machine-config-operator/blob/f4aad781788b1a864e7bfe6cbd0572bffa590562/test/e2e-bootstrap/bootstrap_test.go#L82-L86

This test suite runs the controllers from the MCC against this local kubeapiserver. It then creates all the resources that would normally be created during bootstrap, which the MCC controllers react to, and render the config as they would do in a real cluster.

https://github.com/openshift/machine-config-operator/blob/f4aad781788b1a864e7bfe6cbd0572bffa590562/test/e2e-bootstrap/bootstrap_test.go#L262-L280

Alongside this, we take the same resources and put them into a temp folder, which we then run the bootstrap against.

https://github.com/openshift/machine-config-operator/blob/f4aad781788b1a864e7bfe6cbd0572bffa590562/test/e2e-bootstrap/bootstrap_test.go#L198-L221

Because this isn't a real cluster, and the only controllers running are the MCC controllers, we then shut down the controllers between each test and restart them. And we clean the environment by deleting all of the resources that we created.

https://github.com/openshift/machine-config-operator/blob/f4aad781788b1a864e7bfe6cbd0572bffa590562/test/e2e-bootstrap/bootstrap_test.go#L186-L187

One of the really cool things about this style of testing is you can see how your controller reacts to real API resources, without fakes, without a real cluster :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for explaining. We don't have office hours scheduled so far, we can plan a meeting but that can earliest happen next week.

@sinnykumari
Copy link
Contributor

If these ever fail in the future, either the test needs updating for some legitimate reason, or, MCO broke its day 1 to day 2 compatibility.

There are several options for this:

* Merge it only once it passes (ie 2547+2668 are merged)

not sure, I think merging actual code flow PR with passing test is better as it would catch regression early.

* Decide we don't need it and close without merging (I'm ok with this, just trying to make the team as comfortable as possible by offering the new tests that would catch future regressions)

If I understand correctly, this test adds e2e coverage during bootstrap process for issues that we have seen in several occasion i.e. rendered-config mismatch from bootstrap and actual cluster. If my understanding is correct, I find this test useful in general and more coverage(from encountered bugs) can be added by adding additional test cases.

* Comment out the failing tests and uncomment them as the PRs that fix them, fix them

I'll set up a branch tomorrow that includes this, 2547 and 2668, and verify that the tests do indeed all pass :)

+1, that will provide some real data.

I am wondering if we add this new test, do we really need e2e-aws-techpreview-featuregate e2e test running in MCO repo? e2e-aws-techpreview-featuregate could be added in periodic and for day2 coverage, we can add a sub-test in MCO repo in e2e-gcp-op (hopefully we will break this test in future between two {node,mco} tests)

@JoelSpeed
Copy link
Contributor Author

If I understand correctly, this test adds e2e coverage during bootstrap process for issues that we have seen in several occasion i.e. rendered-config mismatch from bootstrap and actual cluster. If my understanding is correct, I find this test useful in general and more coverage(from encountered bugs) can be added by adding additional test cases.

I think we are on the same page here based on this comment :)

+1, that will provide some real data.

Here we go https://github.com/JoelSpeed/machine-config-operator/tree/bootstrap-e2e%2B2547%2B2668, this branch is 2668, then 2547, then this PR, plus 2 extra commits. The first extra commit, I noticed in other controllers we default the feature gate if it's nil, that's not the case in bootstrap, so to avoid panic, I allowed the feature map generation to handle a nil. This could come as part of 2547 IMO.

The second extra commit integrates the refactoring done in 2668 into 2547 and allows it to pass the feature gate through, fixing the compatibility with the tests from the feature gate perspective.

I've uploaded the output of a test run to a gist. From this, we can see that the first two tests pass (ie no extra manifests and a feature gate only), but then the 3rd and 4th tests fail. In the 3rd test, we expect the kubelet config to only attach to the masters, but it is attaching to the master and worker during bootstrap, hence the worker configs differ, which is a known issue that @yuqi-zhang pointed out with 2547. Same thing is happening in the 4th test except we expect it only in the worker config.

Once the label matching element of 2547 is fixed, I expect these tests should pass without issue.

@JoelSpeed
Copy link
Contributor Author

I am wondering if we add this new test, do we really need e2e-aws-techpreview-featuregate e2e test running in MCO repo? e2e-aws-techpreview-featuregate could be added in periodic and for day2 coverage, we can add a sub-test in MCO repo in e2e-gcp-op (hopefully we will break this test in future between two {node,mco} tests)

That is completely up to your team. I added that one first as it was an easy add to prove 2668 fixed cluster bootstrap, I think this test is better in the long term (as you can run it locally, it gives you full diffs, and doesn't need a cluster to run against. But I appreciate this is new and takes more time to review and understand, so I left this as the second part of the testing :)

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 25, 2021
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2021
@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 16, 2021
@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 7, 2021
@JoelSpeed
Copy link
Contributor Author

@sinnykumari @kikisdeliveryservice Now that 4.10 is open for business, I've just rebased this, hoping we can merge soon if you both think this is a valid addition to the test suite, can't remember where we left it after the call at the end of July

@JoelSpeed
Copy link
Contributor Author

@sinnykumari What would you like to do with this PR? Is this useful still?

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2021
@sinnykumari
Copy link
Contributor

@sinnykumari What would you like to do with this PR? Is this useful still?

Sorry, it got missed out. yup, will be nice to have this test in MCO.
Can you rebase the PR and then will take a final look before merging. Thanks!

@openshift-ci openshift-ci bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 18, 2021
@JoelSpeed
Copy link
Contributor Author

Thanks @sinnykumari, rebased

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is unused variable in the test

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added few comments, rest lgtm

Copy link
Contributor Author

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @sinnykumari, think I've addressed the comments. I've added a filter to pick out the not found machine config names in the helper in an additional commit on top, happy to squash that into another commit if you think that's appropriate

Copy link
Contributor

@sinnykumari sinnykumari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Joel for getting this test added and helping team understand the envtest.
LGTM.
@kikisdeliveryservice @yuqi-zhang any final thoughts before we merge this?

@yuqi-zhang
Copy link
Contributor

Also the CI job appears to be failing with go install github.com/jstemmer/go-junit-report@latest: cannot query module due to -mod=vendor

openshift-merge-robot pushed a commit to openshift/release that referenced this pull request Oct 21, 2021
While we work on the new e2e tests in:
openshift/machine-config-operator#2687

set always_run to false to avoid having extra noise in the repo.
@JoelSpeed
Copy link
Contributor Author

/test bootstrap-e2e

@JoelSpeed
Copy link
Contributor Author

/test bootstrap-e2e

@JoelSpeed
Copy link
Contributor Author

@yuqi-zhang I found this little gem

# Use the containers_image_openpgp flag to avoid the default CGO implementation of signatures dragged in by
# containers/image/signature, which we use only to edit the /etc/containers/policy.json file without doing any cryptography
CGO_ENABLED=0
if [[ $WHAT == "machine-config-controller" ]]; then
GOTAGS="containers_image_openpgp exclude_graphdriver_devicemapper exclude_graphdriver_btrfs containers_image_ostree_stub"
fi
which explains what the issue is here, seems this has already been handled elsewhere during the build process so have just included the CGO_ENABLED=0 and GOTAGS in the make task for the bootstrap-e2e, which seems to have resolved the issue with extra C dependencies

When the HOME directory is unset or is set to /, kubebuilder tries to 
create a cache directory but fails. Make sure in these cases that we set 
it to a temp folder so that the test can create its cache
@JoelSpeed
Copy link
Contributor Author

/test bootstrap-e2e

@JoelSpeed
Copy link
Contributor Author

Had to do a couple of fixups to the scripts (stuff we had already encountered but I hadn't copied over), but the tests are now working, https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2687/pull-ci-openshift-machine-config-operator-master-bootstrap-e2e/1452580424227229696

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 25, 2021

@JoelSpeed: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-techpreview-featuregate 17cb6296fb4f47018566ec656f715d7b8f672c76 link false /test e2e-aws-techpreview-featuregate
ci/prow/okd-e2e-aws 66ec3c9 link false /test okd-e2e-aws
ci/prow/e2e-aws-workers-rhel8 66ec3c9 link false /test e2e-aws-workers-rhel8
ci/prow/e2e-aws-workers-rhel7 66ec3c9 link false /test e2e-aws-workers-rhel7
ci/prow/e2e-aws-upgrade-single-node 66ec3c9 link false /test e2e-aws-upgrade-single-node
ci/prow/e2e-vsphere-upgrade 66ec3c9 link false /test e2e-vsphere-upgrade
ci/prow/e2e-aws-disruptive 66ec3c9 link false /test e2e-aws-disruptive

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@yuqi-zhang
Copy link
Contributor

@yuqi-zhang I found this little gem

# Use the containers_image_openpgp flag to avoid the default CGO implementation of signatures dragged in by
# containers/image/signature, which we use only to edit the /etc/containers/policy.json file without doing any cryptography
CGO_ENABLED=0
if [[ $WHAT == "machine-config-controller" ]]; then
GOTAGS="containers_image_openpgp exclude_graphdriver_devicemapper exclude_graphdriver_btrfs containers_image_ostree_stub"
fi

which explains what the issue is here, seems this has already been handled elsewhere during the build process so have just included the CGO_ENABLED=0 and GOTAGS in the make task for the bootstrap-e2e, which seems to have resolved the issue with extra C dependencies

Ah ok I completely forgot about that snippet, good catch

Just to make sure, I tested one more time locally. make bootstrap-e2e works fine, but make bootstrap-e2e-local does not, failing with

=== RUN   TestE2EBootstrap
    bootstrap_test.go:83: 
        	Error Trace:	bootstrap_test.go:83
        	Error:      	Received unexpected error:
        	            	unable to start control plane itself: failed to start the controlplane. retried 5 times: fork/exec /usr/local/kubebuilder/bin/etcd: no such file or directory
        	Test:       	TestE2EBootstrap
--- FAIL: TestE2EBootstrap (0.01s)

I presume that is expected because it's lacking the setup_envs function and I shouldn't be running it directly in the first place. So make bootstrap-e2e is the canonical way to run this, and the -local is just an alias for the actual test run (that does not work independently)? If that's the case I don't feel strongly either way, just wanted to make sure.

Overall lgtm, will also let Sinny take a final look

@sinnykumari
Copy link
Contributor

LGTM, Log from bootstrap-e2e test looks good. Let's get this in and we can discuss with Joel offline if needed help with running locally.
Thanks Jerry and Joel!
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 26, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 26, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JoelSpeed, sinnykumari, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [sinnykumari,yuqi-zhang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

3 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@JoelSpeed
Copy link
Contributor Author

I presume that is expected because it's lacking the setup_envs function and I shouldn't be running it directly in the first place. So make bootstrap-e2e is the canonical way to run this, and the -local is just an alias for the actual test run (that does not work independently)? If that's the case I don't feel strongly either way, just wanted to make sure.

I have the binaries required for this (etcd, kube-apiserver) installed locally on my machine as I use this style of testing in a lot of places. With bootstrap-e2e-local it defaults to the local installed files and works for me, you could get this set up too just by placing the binaries in the right place. The bootstrap-e2e task extends this by downloading the binaries for you into a temp directory, so yeah, it can be used self contained, but if you use it a lot, it might be good to install the binaries locally. Totally up to the end user here

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants