Skip to content

Conversation

@BenamarMk
Copy link
Owner

No description provided.

@BenamarMk BenamarMk force-pushed the fix-data-loss-when-using-cg-workload-for-cephfs-v2 branch 2 times, most recently from cd998cb to 5f17ff9 Compare September 3, 2025 07:07
pruthvitd and others added 3 commits September 5, 2025 07:55
…licationDest

Signed-off-by: pruthvitd <prd@redhat.com>
…letion

When disabling DR, the UI adds the do-not-delete-pvc annotation and then deletes
the DRPC in a second step. However, by the time Ramen reconciles the DRPC resource,
it may already be in a deleted state. In that case, the annotation wouldn’t propagate
to the VRG, which could result in unintended deletion of application PVCs.

This change updates the controller to verify that the annotation is present on both
the DRPC and the VRG before permitting VRG deletion. This ensures the annotation is
properly propagated and prevents accidental PVC removal.

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
…a ManagedClusterView

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
@BenamarMk BenamarMk force-pushed the fix-data-loss-when-using-cg-workload-for-cephfs-v2 branch from 5f17ff9 to b972759 Compare September 6, 2025 07:11
pruthvitd and others added 2 commits September 12, 2025 09:07
Signed-off-by: pruthvitd <prd@redhat.com>
With discovered apps + CephFS using recipe for selecting the PVCs that
are not excluded in the recipe. When volsync is in use, during
failover/relocate before finalsync a temporary PVC is created but is not
labelled as created by ramen due to which temp PVC was getting
protected, thus halting the clean up in former primary.

Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
@BenamarMk BenamarMk force-pushed the fix-data-loss-when-using-cg-workload-for-cephfs-v2 branch from 7f172c6 to 3c755bc Compare September 15, 2025 16:34
nirs and others added 21 commits September 16, 2025 18:43
Minikube 1.37.0 was released but it broken on Fedora[1]. Update the
download URL to use minikube 1.36.0.

[1] kubernetes/minikube#21548

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
When a ReplicationSource is created, the controller was immediately checking
for sync completion. If etcd had the new resource but the cache was still
stale, the check could incorrectly use the old sync completion state,
causing the new manual trigger to be ignored.

The fix is to exit the reconciler immediately after creating the
ReplicationSource.

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Add velero exclude labels to:
- Submariner resources: ServiceExport, Endpoints and EndpointSlices
- Copied secrets in application namespaces
- VolSync resources

Fixes: RamenDR#1889
Signed-off-by: Abhijeet Shakya <abhijeetshakya21@gmail.com>
As of now the supported resource types were pod, deployment and
statefulset. Additional changes have been done to support for any Custom
Resources(CRs) for which the supported format of the resource is of the
form <apiGroup>/<apiVersion>/<resourceNamePlural>. Additional user
responsibility is to add the required Role and RoleBinding combination
on the cluster.

Adding support for other corev1 resources as well for eg.
serviceaccounts, configmaps etc.

Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
With the additional changes for supporting CRs for check hooks in
recipe, adding unit tests.

Correcting failing unit tests. With fakeClient fieldSelector returns nil
results when used in ListOptions. Hence, filtering it manually to make
the unit tests pass.

Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
Previouly debug log were dropped by default so we never had enough
information to debug drenv failures. We had to run again with --verbose
mode. To keep the verbose log file we had to redirect stderrr to file
which makes it harder to use.

Now we always log all message to log file (defualt "drenv.log"), and
alwasy log non-verbose log to the console. If the run fail we can
inspect the complete debug log. This is the same way we log in e2e.

The new log is included in the e2e build artifacts, so we have enough
information to debug build failures in the CI.

New options:
- `--logfile`

Removed options:
- `-v, --verbose`

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used ocm-ramen-samples deployment rbd, which introduces a dependency
on rook-pool. Since we run the test in parallel it fail randomly if the
rook block pool is not ready when the test run. Change to a simpler
application from ramen source, using hostPath storage so it work on both
minikube and lima clusters.

Example manual run:

    % time addons/argocd/test hub dr1 dr2
    Deploying application busybox-dr1 in namespace argocd-test on cluster dr1
    application 'busybox-dr1' created
    Deploying application busybox-dr2 in namespace argocd-test on cluster dr2
    application 'busybox-dr2' created
    Waiting application busybox-dr1 to be healthy
    application.argoproj.io/busybox-dr1 condition met
    Waiting application busybox-dr2 to be healthy
    application.argoproj.io/busybox-dr2 condition met
    Deleting application busybox-dr1
    application 'busybox-dr1' deleted
    Deleting namespace argocd-test in cluster dr1
    namespace "argocd-test" deleted
    Deleting application busybox-dr2
    application 'busybox-dr2' deleted
    Deleting namespace argocd-test in cluster dr2
    namespace "argocd-test" deleted
    Waiting until application busybox-dr1 is deleted
    application.argoproj.io/busybox-dr1 condition met
    Waiting until namespace argocd-test is deleted in cluster dr1
    namespace/argocd-test condition met
    Waiting until application busybox-dr2 is deleted
    Waiting until namespace argocd-test is deleted in cluster dr2
    namespace/argocd-test condition met
    addons/argocd/test hub dr1 dr2  1.33s user 0.63s system 17% cpu 11.430 total

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
For minikube clusters we upgrade to minikube 1.37.0, using Kubernetes
1.34 by default.

For lima clusters we update Kubernetes version in out lima k8s.yaml
template.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
We used "e2e-" prefix, but ramenctl is using "test-". This creates a
conflict when ramenctl try to create an application when the application
was already created by e2e. We want to keep the ramenctl prefix, so lets
simplify by using the same prefix.

To make it easy to consume in ramenctl, move the value to the config.
Ramenctl will automatically get the value from the config.

The new config is intentionally not documented since users do not have
to change it normally. The value will be visible in in ramenctl
report.config.

This change also remove duplicate "e2e-" prefix hiding in
defaultChannelNamespace. Previously we had to change both constants at
the same time, now we have single constant.

Fixes: RamenDR#2255
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Adding "failed to create env: " in env.New() does not help and duplicate
the same message added by the caller, as can be seen in ramenctl.

In e2e we did not have duplicate message since the message was stale,
leftover from previous code.

Example error with this change:

    % ./run.sh -test.run TestDR
    2025-09-21T21:40:02.387+0300	INFO	Using config file "config.yaml"
    2025-09-21T21:40:02.388+0300	INFO	Using log file "ramen-e2e.log"
    2025-09-21T21:40:02.389+0300	ERROR	Failed to create env: failed to create cluster
    "hub": failed to build config from kubeconfig (/Users/nir/.config/drenv/rdr/kubeconfigs/hub):
    stat /Users/nir/.config/drenv/rdr/kubeconfigs/hub: no such file or directory

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Updating recipe package so that SkipHookIfNotPresent can be implemented
for check hooks.

Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
When user sets SkipHookIfNotPresent to true, then for check hooks, we
will wait till timeout and if the resource is not found and
SkipHookIfNotPresent is set, we will continue with the next steps in the
workflow.

Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
Signed-off-by: Annaraya Narasagond <annarayanarasagond@gmail.com>
This now macos-15, and soon it will switch to macos-26. There seems to
be an issue with the macos-14 now, and our job times out since no runner
pick it up.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Using pyproject.toml we can keep all tool configurations in a single
file for easier maintenance.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Basedpyright is a fork of pyright with various type checking
improvements, pylance features and more. It is the default python
language server now in my editor and its default configuration is too
annoying, adding too many warnings for valid code. Standard mode does
not have any warnings with current code.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Instead of every 3 seconds. This speeds up starting a cluster by 3
seconds.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
So we can run arm64 image on macOS.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
abhijeet219 and others added 13 commits October 7, 2025 09:20
Wait for BackupStorageLocation to become Available before creating
backup to prevent FailedValidation

Fixes: DFBUGS-4220, DFBUGS-4224
Signed-off-by: Abhijeet Shakya <abhijeetshakya21@gmail.com>
We forgot to include this when adding codeowners.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: pruthvitd <prd@redhat.com>
…d PVCs

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
When starting a stopped vm we have timeouts in webhooks, so we added a
15 seconds sleep before using the cluster in the lima provider start().
The implementation was wrong, always waiting 15 seconds after the
cluster is ready, before returning. Fix it to wait only when starting
stopped vm.

Testing shows that removing the sleep shortens the time until the
cluster is started, but the first deployment takes more time, since k8s
is not completely ready when the cluster is started.

Example run vm - before:

    % drenv start envs/vm.yaml --local-registry
    2025-10-08 21:25:45,505 INFO    [vm] Starting environment
    2025-10-08 21:25:45,528 INFO    [cluster] Starting lima cluster
    2025-10-08 21:27:02,028 INFO    [cluster] Cluster started in 76.50 seconds
    2025-10-08 21:27:02,030 INFO    [cluster/0] Running addons/example/start
    2025-10-08 21:27:21,405 INFO    [cluster/0] addons/example/start completed in 19.37 seconds
    2025-10-08 21:27:21,405 INFO    [cluster/0] Running addons/example/test
    2025-10-08 21:27:21,542 INFO    [cluster/0] addons/example/test completed in 0.14 seconds
    2025-10-08 21:27:21,543 INFO    [vm] Environment started in 96.04 seconds

Example run vm - after:

    % drenv start envs/vm.yaml --local-registry
    2025-10-08 21:27:44,382 INFO    [vm] Starting environment
    2025-10-08 21:27:44,405 INFO    [cluster] Starting lima cluster
    2025-10-08 21:28:46,500 INFO    [cluster] Cluster started in 62.10 seconds
    2025-10-08 21:28:46,501 INFO    [cluster/0] Running addons/example/start
    2025-10-08 21:29:21,167 INFO    [cluster/0] addons/example/start completed in 34.67 seconds
    2025-10-08 21:29:21,168 INFO    [cluster/0] Running addons/example/test
    2025-10-08 21:29:21,309 INFO    [cluster/0] addons/example/test completed in 0.14 seconds
    2025-10-08 21:29:21,310 INFO    [vm] Environment started in 96.93 seconds

To total time to start a cluster and install the example deployment
remain the same.

Example run ocm - before:

    % drenv start envs/ocm.yaml --local-registry
    2025-10-08 22:04:31,997 INFO    [ocm] Starting environment
    2025-10-08 22:04:32,024 INFO    [dr1] Starting lima cluster
    2025-10-08 22:04:32,024 INFO    [hub] Starting lima cluster
    2025-10-08 22:04:32,024 INFO    [dr2] Starting lima cluster
    2025-10-08 22:05:55,983 INFO    [hub] Cluster started in 83.96 seconds
    2025-10-08 22:05:55,985 INFO    [hub/0] Running addons/ocm-hub/start
    2025-10-08 22:05:56,032 INFO    [dr1] Cluster started in 84.01 seconds
    2025-10-08 22:05:56,034 INFO    [dr1/0] Running addons/ocm-cluster/start
    2025-10-08 22:05:57,063 INFO    [dr2] Cluster started in 85.04 seconds
    2025-10-08 22:05:57,063 INFO    [dr2/0] Running addons/ocm-cluster/start
    2025-10-08 22:06:37,142 INFO    [hub/0] addons/ocm-hub/start completed in 41.16 seconds
    2025-10-08 22:06:37,142 INFO    [hub/0] Running addons/ocm-controller/start
    2025-10-08 22:06:48,347 INFO    [hub/0] addons/ocm-controller/start completed in 11.21 seconds
    2025-10-08 22:08:08,546 INFO    [dr1/0] addons/ocm-cluster/start completed in 132.52 seconds
    2025-10-08 22:08:08,546 INFO    [dr1/0] Running addons/ocm-cluster/test
    2025-10-08 22:08:08,993 INFO    [dr2/0] addons/ocm-cluster/start completed in 131.94 seconds
    2025-10-08 22:08:08,993 INFO    [dr2/0] Running addons/ocm-cluster/test
    2025-10-08 22:08:13,091 INFO    [dr2/0] addons/ocm-cluster/test completed in 4.10 seconds
    2025-10-08 22:08:14,582 INFO    [dr1/0] addons/ocm-cluster/test completed in 6.04 seconds
    2025-10-08 22:08:14,583 INFO    [ocm] Environment started in 222.59 seconds

Example run ocm - after:

    % rm drenv.log; drenv start envs/ocm.yaml --local-registry
    2025-10-08 22:52:42,566 INFO    [ocm] Starting environment
    2025-10-08 22:52:42,592 INFO    [dr1] Starting lima cluster
    2025-10-08 22:52:42,592 INFO    [dr2] Starting lima cluster
    2025-10-08 22:52:42,593 INFO    [hub] Starting lima cluster
    2025-10-08 22:53:49,340 INFO    [hub] Cluster started in 66.75 seconds
    2025-10-08 22:53:49,341 INFO    [hub/0] Running addons/ocm-hub/start
    2025-10-08 22:53:49,390 INFO    [dr1] Cluster started in 66.80 seconds
    2025-10-08 22:53:49,390 INFO    [dr1/0] Running addons/ocm-cluster/start
    2025-10-08 22:53:50,483 INFO    [dr2] Cluster started in 67.89 seconds
    2025-10-08 22:53:50,483 INFO    [dr2/0] Running addons/ocm-cluster/start
    2025-10-08 22:54:43,553 INFO    [hub/0] addons/ocm-hub/start completed in 54.21 seconds
    2025-10-08 22:54:43,553 INFO    [hub/0] Running addons/ocm-controller/start
    2025-10-08 22:54:52,894 INFO    [hub/0] addons/ocm-controller/start completed in 9.34 seconds
    2025-10-08 22:56:02,998 INFO    [dr2/0] addons/ocm-cluster/start completed in 132.51 seconds
    2025-10-08 22:56:02,998 INFO    [dr2/0] Running addons/ocm-cluster/test
    2025-10-08 22:56:03,740 INFO    [dr1/0] addons/ocm-cluster/start completed in 134.35 seconds
    2025-10-08 22:56:03,740 INFO    [dr1/0] Running addons/ocm-cluster/test
    2025-10-08 22:56:07,002 INFO    [dr2/0] addons/ocm-cluster/test completed in 4.00 seconds
    2025-10-08 22:56:07,809 INFO    [dr1/0] addons/ocm-cluster/test completed in 4.07 seconds
    2025-10-08 22:56:07,809 INFO    [ocm] Environment started in 205.24 seconds

The total time was slightly lower but starting clusters is very noisy so
more runs are needed to tell if this is a real improvement.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Signed-off-by: rakeshgm <rakeshgm@redhat.com>
We always want to test with latest versions. Hopefully this version can
fix the random failure to delete volsync snapshot that we see in the CI.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
Like e2e, create our standard per-test venv and use it to run drenv.

Previously we install drenv globally which modifies the CI hosts, and
requires the python3-pip package. With this change we don't modify the
host and there is not need to install the python3-pip package.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
The current job runs on a random runner every day. Fix it to run on all
runners every day, so we refresh the cache and prune images every day.

Signed-off-by: Nir Soffer <nsoffer@redhat.com>
It is necessary, because deleteVGR function not only deletes VGR,
but also verifies, that VGR was actually deleted from API server

Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
- Add support for creating multiple DRPolicies for easier testing
  with different scheduling intervals. Both 1m and 5m DRPolicies
  are now created automatically using a single template.
- Create vrc with 5m scheduling interval.
- Rename vr/vgr/vgrc/vrc sample start-date/test-data files.
- Update e2e configs to use updated 1m policy name.

Signed-off-by: Parikshith <parikshithb@gmail.com>
@BenamarMk BenamarMk force-pushed the fix-data-loss-when-using-cg-workload-for-cephfs-v2 branch from 3c755bc to 7a9d939 Compare October 13, 2025 12:14
rakeshgm and others added 15 commits October 13, 2025 10:00
Signed-off-by: rakeshgm <rakeshgm@redhat.com>
update groupReplicationID from VGRC while processing
peerClasses

Signed-off-by: rakeshgm <rakeshgm@redhat.com>
add GroupReplicationId to existing tests
add new tests

Signed-off-by: rakeshgm <rakeshgm@redhat.com>

Signed-off-by: rakeshgm <rakeshgm@redhat.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Elena Gershkovich <elenage@il.ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Introduce a Paused flag in the ReplicationGroupDestination to mark it as
non-reconcilable. This state is primarily used during a transition from
primary to secondary, while the new primary is still in its initial
restore phase.

Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
Signed-off-by: Benamar Mekhissi <bmekhiss@ibm.com>
@BenamarMk BenamarMk force-pushed the fix-data-loss-when-using-cg-workload-for-cephfs-v2 branch from 7a9d939 to 1feda0a Compare October 13, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants