forked from RamenDR/ramen
-
Notifications
You must be signed in to change notification settings - Fork 0
Fix data loss when using cg workload for cephfs v2 #30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
BenamarMk
wants to merge
60
commits into
main
Choose a base branch
from
fix-data-loss-when-using-cg-workload-for-cephfs-v2
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Fix data loss when using cg workload for cephfs v2 #30
BenamarMk
wants to merge
60
commits into
main
from
fix-data-loss-when-using-cg-workload-for-cephfs-v2
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
cd998cb to
5f17ff9
Compare
…licationDest Signed-off-by: pruthvitd <prd@redhat.com>
…letion When disabling DR, the UI adds the do-not-delete-pvc annotation and then deletes the DRPC in a second step. However, by the time Ramen reconciles the DRPC resource, it may already be in a deleted state. In that case, the annotation wouldn’t propagate to the VRG, which could result in unintended deletion of application PVCs. This change updates the controller to verify that the annotation is present on both the DRPC and the VRG before permitting VRG deletion. This ensures the annotation is properly propagated and prevents accidental PVC removal. Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
…a ManagedClusterView Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
5f17ff9 to
b972759
Compare
Signed-off-by: pruthvitd <prd@redhat.com>
With discovered apps + CephFS using recipe for selecting the PVCs that are not excluded in the recipe. When volsync is in use, during failover/relocate before finalsync a temporary PVC is created but is not labelled as created by ramen due to which temp PVC was getting protected, thus halting the clean up in former primary. Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
7f172c6 to
3c755bc
Compare
Minikube 1.37.0 was released but it broken on Fedora[1]. Update the download URL to use minikube 1.36.0. [1] kubernetes/minikube#21548 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When a ReplicationSource is created, the controller was immediately checking for sync completion. If etcd had the new resource but the cache was still stale, the check could incorrectly use the old sync completion state, causing the new manual trigger to be ignored. The fix is to exit the reconciler immediately after creating the ReplicationSource. Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Add velero exclude labels to: - Submariner resources: ServiceExport, Endpoints and EndpointSlices - Copied secrets in application namespaces - VolSync resources Fixes: RamenDR#1889 Signed-off-by: Abhijeet Shakya <abhijeetshakya21@gmail.com>
As of now the supported resource types were pod, deployment and statefulset. Additional changes have been done to support for any Custom Resources(CRs) for which the supported format of the resource is of the form <apiGroup>/<apiVersion>/<resourceNamePlural>. Additional user responsibility is to add the required Role and RoleBinding combination on the cluster. Adding support for other corev1 resources as well for eg. serviceaccounts, configmaps etc. Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
With the additional changes for supporting CRs for check hooks in recipe, adding unit tests. Correcting failing unit tests. With fakeClient fieldSelector returns nil results when used in ListOptions. Hence, filtering it manually to make the unit tests pass. Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
Previouly debug log were dropped by default so we never had enough information to debug drenv failures. We had to run again with --verbose mode. To keep the verbose log file we had to redirect stderrr to file which makes it harder to use. Now we always log all message to log file (defualt "drenv.log"), and alwasy log non-verbose log to the console. If the run fail we can inspect the complete debug log. This is the same way we log in e2e. The new log is included in the e2e build artifacts, so we have enough information to debug build failures in the CI. New options: - `--logfile` Removed options: - `-v, --verbose` Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used ocm-ramen-samples deployment rbd, which introduces a dependency
on rook-pool. Since we run the test in parallel it fail randomly if the
rook block pool is not ready when the test run. Change to a simpler
application from ramen source, using hostPath storage so it work on both
minikube and lima clusters.
Example manual run:
% time addons/argocd/test hub dr1 dr2
Deploying application busybox-dr1 in namespace argocd-test on cluster dr1
application 'busybox-dr1' created
Deploying application busybox-dr2 in namespace argocd-test on cluster dr2
application 'busybox-dr2' created
Waiting application busybox-dr1 to be healthy
application.argoproj.io/busybox-dr1 condition met
Waiting application busybox-dr2 to be healthy
application.argoproj.io/busybox-dr2 condition met
Deleting application busybox-dr1
application 'busybox-dr1' deleted
Deleting namespace argocd-test in cluster dr1
namespace "argocd-test" deleted
Deleting application busybox-dr2
application 'busybox-dr2' deleted
Deleting namespace argocd-test in cluster dr2
namespace "argocd-test" deleted
Waiting until application busybox-dr1 is deleted
application.argoproj.io/busybox-dr1 condition met
Waiting until namespace argocd-test is deleted in cluster dr1
namespace/argocd-test condition met
Waiting until application busybox-dr2 is deleted
Waiting until namespace argocd-test is deleted in cluster dr2
namespace/argocd-test condition met
addons/argocd/test hub dr1 dr2 1.33s user 0.63s system 17% cpu 11.430 total
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
For minikube clusters we upgrade to minikube 1.37.0, using Kubernetes 1.34 by default. For lima clusters we update Kubernetes version in out lima k8s.yaml template. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used "e2e-" prefix, but ramenctl is using "test-". This creates a conflict when ramenctl try to create an application when the application was already created by e2e. We want to keep the ramenctl prefix, so lets simplify by using the same prefix. To make it easy to consume in ramenctl, move the value to the config. Ramenctl will automatically get the value from the config. The new config is intentionally not documented since users do not have to change it normally. The value will be visible in in ramenctl report.config. This change also remove duplicate "e2e-" prefix hiding in defaultChannelNamespace. Previously we had to change both constants at the same time, now we have single constant. Fixes: RamenDR#2255 Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Adding "failed to create env: " in env.New() does not help and duplicate
the same message added by the caller, as can be seen in ramenctl.
In e2e we did not have duplicate message since the message was stale,
leftover from previous code.
Example error with this change:
% ./run.sh -test.run TestDR
2025-09-21T21:40:02.387+0300 INFO Using config file "config.yaml"
2025-09-21T21:40:02.388+0300 INFO Using log file "ramen-e2e.log"
2025-09-21T21:40:02.389+0300 ERROR Failed to create env: failed to create cluster
"hub": failed to build config from kubeconfig (/Users/nir/.config/drenv/rdr/kubeconfigs/hub):
stat /Users/nir/.config/drenv/rdr/kubeconfigs/hub: no such file or directory
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Updating recipe package so that SkipHookIfNotPresent can be implemented for check hooks. Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
When user sets SkipHookIfNotPresent to true, then for check hooks, we will wait till timeout and if the resource is not found and SkipHookIfNotPresent is set, we will continue with the next steps in the workflow. Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
This now macos-15, and soon it will switch to macos-26. There seems to be an issue with the macos-14 now, and our job times out since no runner pick it up. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Using pyproject.toml we can keep all tool configurations in a single file for easier maintenance. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Basedpyright is a fork of pyright with various type checking improvements, pylance features and more. It is the default python language server now in my editor and its default configuration is too annoying, adding too many warnings for valid code. Standard mode does not have any warnings with current code. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Instead of every 3 seconds. This speeds up starting a cluster by 3 seconds. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
So we can run arm64 image on macOS. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Wait for BackupStorageLocation to become Available before creating backup to prevent FailedValidation Fixes: DFBUGS-4220, DFBUGS-4224 Signed-off-by: Abhijeet Shakya <abhijeetshakya21@gmail.com>
We forgot to include this when adding codeowners. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: pruthvitd <prd@redhat.com>
…d PVCs Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
When starting a stopped vm we have timeouts in webhooks, so we added a
15 seconds sleep before using the cluster in the lima provider start().
The implementation was wrong, always waiting 15 seconds after the
cluster is ready, before returning. Fix it to wait only when starting
stopped vm.
Testing shows that removing the sleep shortens the time until the
cluster is started, but the first deployment takes more time, since k8s
is not completely ready when the cluster is started.
Example run vm - before:
% drenv start envs/vm.yaml --local-registry
2025-10-08 21:25:45,505 INFO [vm] Starting environment
2025-10-08 21:25:45,528 INFO [cluster] Starting lima cluster
2025-10-08 21:27:02,028 INFO [cluster] Cluster started in 76.50 seconds
2025-10-08 21:27:02,030 INFO [cluster/0] Running addons/example/start
2025-10-08 21:27:21,405 INFO [cluster/0] addons/example/start completed in 19.37 seconds
2025-10-08 21:27:21,405 INFO [cluster/0] Running addons/example/test
2025-10-08 21:27:21,542 INFO [cluster/0] addons/example/test completed in 0.14 seconds
2025-10-08 21:27:21,543 INFO [vm] Environment started in 96.04 seconds
Example run vm - after:
% drenv start envs/vm.yaml --local-registry
2025-10-08 21:27:44,382 INFO [vm] Starting environment
2025-10-08 21:27:44,405 INFO [cluster] Starting lima cluster
2025-10-08 21:28:46,500 INFO [cluster] Cluster started in 62.10 seconds
2025-10-08 21:28:46,501 INFO [cluster/0] Running addons/example/start
2025-10-08 21:29:21,167 INFO [cluster/0] addons/example/start completed in 34.67 seconds
2025-10-08 21:29:21,168 INFO [cluster/0] Running addons/example/test
2025-10-08 21:29:21,309 INFO [cluster/0] addons/example/test completed in 0.14 seconds
2025-10-08 21:29:21,310 INFO [vm] Environment started in 96.93 seconds
To total time to start a cluster and install the example deployment
remain the same.
Example run ocm - before:
% drenv start envs/ocm.yaml --local-registry
2025-10-08 22:04:31,997 INFO [ocm] Starting environment
2025-10-08 22:04:32,024 INFO [dr1] Starting lima cluster
2025-10-08 22:04:32,024 INFO [hub] Starting lima cluster
2025-10-08 22:04:32,024 INFO [dr2] Starting lima cluster
2025-10-08 22:05:55,983 INFO [hub] Cluster started in 83.96 seconds
2025-10-08 22:05:55,985 INFO [hub/0] Running addons/ocm-hub/start
2025-10-08 22:05:56,032 INFO [dr1] Cluster started in 84.01 seconds
2025-10-08 22:05:56,034 INFO [dr1/0] Running addons/ocm-cluster/start
2025-10-08 22:05:57,063 INFO [dr2] Cluster started in 85.04 seconds
2025-10-08 22:05:57,063 INFO [dr2/0] Running addons/ocm-cluster/start
2025-10-08 22:06:37,142 INFO [hub/0] addons/ocm-hub/start completed in 41.16 seconds
2025-10-08 22:06:37,142 INFO [hub/0] Running addons/ocm-controller/start
2025-10-08 22:06:48,347 INFO [hub/0] addons/ocm-controller/start completed in 11.21 seconds
2025-10-08 22:08:08,546 INFO [dr1/0] addons/ocm-cluster/start completed in 132.52 seconds
2025-10-08 22:08:08,546 INFO [dr1/0] Running addons/ocm-cluster/test
2025-10-08 22:08:08,993 INFO [dr2/0] addons/ocm-cluster/start completed in 131.94 seconds
2025-10-08 22:08:08,993 INFO [dr2/0] Running addons/ocm-cluster/test
2025-10-08 22:08:13,091 INFO [dr2/0] addons/ocm-cluster/test completed in 4.10 seconds
2025-10-08 22:08:14,582 INFO [dr1/0] addons/ocm-cluster/test completed in 6.04 seconds
2025-10-08 22:08:14,583 INFO [ocm] Environment started in 222.59 seconds
Example run ocm - after:
% rm drenv.log; drenv start envs/ocm.yaml --local-registry
2025-10-08 22:52:42,566 INFO [ocm] Starting environment
2025-10-08 22:52:42,592 INFO [dr1] Starting lima cluster
2025-10-08 22:52:42,592 INFO [dr2] Starting lima cluster
2025-10-08 22:52:42,593 INFO [hub] Starting lima cluster
2025-10-08 22:53:49,340 INFO [hub] Cluster started in 66.75 seconds
2025-10-08 22:53:49,341 INFO [hub/0] Running addons/ocm-hub/start
2025-10-08 22:53:49,390 INFO [dr1] Cluster started in 66.80 seconds
2025-10-08 22:53:49,390 INFO [dr1/0] Running addons/ocm-cluster/start
2025-10-08 22:53:50,483 INFO [dr2] Cluster started in 67.89 seconds
2025-10-08 22:53:50,483 INFO [dr2/0] Running addons/ocm-cluster/start
2025-10-08 22:54:43,553 INFO [hub/0] addons/ocm-hub/start completed in 54.21 seconds
2025-10-08 22:54:43,553 INFO [hub/0] Running addons/ocm-controller/start
2025-10-08 22:54:52,894 INFO [hub/0] addons/ocm-controller/start completed in 9.34 seconds
2025-10-08 22:56:02,998 INFO [dr2/0] addons/ocm-cluster/start completed in 132.51 seconds
2025-10-08 22:56:02,998 INFO [dr2/0] Running addons/ocm-cluster/test
2025-10-08 22:56:03,740 INFO [dr1/0] addons/ocm-cluster/start completed in 134.35 seconds
2025-10-08 22:56:03,740 INFO [dr1/0] Running addons/ocm-cluster/test
2025-10-08 22:56:07,002 INFO [dr2/0] addons/ocm-cluster/test completed in 4.00 seconds
2025-10-08 22:56:07,809 INFO [dr1/0] addons/ocm-cluster/test completed in 4.07 seconds
2025-10-08 22:56:07,809 INFO [ocm] Environment started in 205.24 seconds
The total time was slightly lower but starting clusters is very noisy so
more runs are needed to tell if this is a real improvement.
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: rakeshgm <rakeshgm@redhat.com>
We always want to test with latest versions. Hopefully this version can fix the random failure to delete volsync snapshot that we see in the CI. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Like e2e, create our standard per-test venv and use it to run drenv. Previously we install drenv globally which modifies the CI hosts, and requires the python3-pip package. With this change we don't modify the host and there is not need to install the python3-pip package. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
The current job runs on a random runner every day. Fix it to run on all runners every day, so we refresh the cache and prune images every day. Signed-off-by: Nir Soffer <nsoffer@redhat.com>
It is necessary, because deleteVGR function not only deletes VGR, but also verifies, that VGR was actually deleted from API server Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
- Add support for creating multiple DRPolicies for easier testing with different scheduling intervals. Both 1m and 5m DRPolicies are now created automatically using a single template. - Create vrc with 5m scheduling interval. - Rename vr/vgr/vgrc/vrc sample start-date/test-data files. - Update e2e configs to use updated 1m policy name. Signed-off-by: Parikshith <parikshithb@gmail.com>
3c755bc to
7a9d939
Compare
Signed-off-by: rakeshgm <rakeshgm@redhat.com>
update groupReplicationID from VGRC while processing peerClasses Signed-off-by: rakeshgm <rakeshgm@redhat.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Introduce a Paused flag in the ReplicationGroupDestination to mark it as non-reconcilable. This state is primarily used during a transition from primary to secondary, while the new primary is still in its initial restore phase. Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
7a9d939 to
1feda0a
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.