Skip to content

Leverage unset to avoid ingesting tombstones at the target cluster by default#419

Merged
msmygit merged 14 commits intomainfrom
bugfix/honor_unset
Feb 26, 2026
Merged

Leverage unset to avoid ingesting tombstones at the target cluster by default#419
msmygit merged 14 commits intomainfrom
bugfix/honor_unset

Conversation

@msmygit
Copy link
Member

@msmygit msmygit commented Feb 20, 2026

What this PR does: CDM should leverage unset at the target cluster to avoid ingesting tombstones for empty or null values in the origin cluster table's column value.

Which issue(s) this PR fixes:
Fixes #418

Checklist:

  • Automated Tests added/updated
  • Documentation added/updated
  • CLA Signed: DataStax CLA

Test Strategy

How was this tested?

Origin Cluster - HCD 1.2.4

hcd-1.2.4 % bin/cqlsh localhost -u cassandra -p cassandra

Warning: Using a password on the command line interface can be insecure.
Recommendation: use the credentials file to securely provide the password.

Connected to Test Cluster at localhost:9042
[cqlsh 6.0.0 | Cassandra 4.0.11.0-d86e224aa19c | CQL spec 3.4.5 | Native protocol v5]
Use HELP for help.
cassandra@cqlsh> desc keyspaces;

system       system_distributed  system_traces  system_virtual_schema
system_auth  system_schema       system_views 

cassandra@cqlsh> create keyspace cdm WITH replication = {'class': 'NetworkTopologyStrategy', 'datacenter1':1};
cassandra@cqlsh> use cdm.
   ... 
cassandra@cqlsh> use cdm;
cassandra@cqlsh:cdm> create table if not exists test(pk int primary key,cset set<int>,ceset set<text>,clist list<int>,celist list<float>,cmap map<int,int>,cemap map<text,int>);
cassandra@cqlsh:cdm> insert into test(pk,cset,ceset,clist,celist,cmap,cemap) values (0,{0,1,2},null,[0,1,2],null,{0:0,1:1},null);

Warnings :
Replica disk usage exceeds warn threshold

cassandra@cqlsh:cdm> select * from cdm.test;

 pk | celist | cemap | ceset | clist     | cmap         | cset
----+--------+-------+-------+-----------+--------------+-----------
  0 |   null |  null |  null | [0, 1, 2] | {0: 0, 1: 1} | {0, 1, 2}

(1 rows)
cassandra@cqlsh:cdm>

CDM Migrate Job Run

cdmt.properties file structure

spark.cdm.connect.origin.username mysuperuser
spark.cdm.connect.origin.password SuperSecurePWD

spark.cdm.connect.target.password AstraCS:SuperSecurePWD
spark.cdm.connect.target.username token
spark.cdm.connect.target.astra.database.id 0e0ab676-914f-4560-b15d-1a712a8fc47e
spark.cdm.connect.target.astra.scb.region westus3

spark.cdm.schema.origin.keyspaceTable cdm.test
spark.cdm.schema.target.keyspaceTable default_keyspace.test

Migrate Step

./bin/spark-submit --properties-file ./cdmt.properties --master "local[*]" --driver-memory 10G --executor-memory 10G --conf spark.driver.extraJavaOptions='-Dlog4j2.level=DEBUG -Dlog4j2.rootLogger.level=DEBUG' --conf spark.executor.extraJavaOptions='-Dlog4j2.level=DEBUG -Dlog4j2.rootLogger.level=DEBUG' --class com.datastax.cdm.job.Migrate /path/to/cassandra-data-migrator/target/cassandra-data-migrator-5.7.3-SNAPSHOT.jar

26/02/22 09:55:25 INFO Migrate$: ################################################################################################
26/02/22 09:55:25 INFO Migrate$: ###                                  Migrate Job - Starting                                  ###
26/02/22 09:55:25 INFO Migrate$: ################################################################################################
...
26/02/22 09:55:26 INFO ConnectionFetcher: Connecting to ORIGIN at localhost:9042
26/02/22 09:55:26 INFO ConnectionFetcher: Auto-downloading secure connect bundle for TARGET 0e0ab676-914f-4560-b15d-1a712a8fc47e westus3
26/02/22 09:55:26 INFO AstraDevOpsClient: Auto-downloading secure connect bundle for TARGET database ID: 0e0ab676-914f-4560-b15d-1a712a8fc47e, type: default, region: westus3
26/02/22 09:55:27 INFO AstraDevOpsClient: Downloading secure bundle from URL: https://datastax-cluster-config-prod.s3.us-east-2.amazonaws.com/0e0ab676-914f-4560-b15d-1a712a8fc47e-1/secure-connect-azure.zip?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIA2AIQRQ76WSWBKT4M%2F20260222%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20260222T145527Z&X-Amz-Expires=300&X-Amz-SignedHeaders=host&X-Amz-Signature=18323c65b66cb444dd2644b0914eabd2aa3426d34f0dc896d09e2689ef8199c5
26/02/22 09:55:27 INFO AstraDevOpsClient: Secure bundle downloaded successfully to: /var/folders/5_/v9cyt_kn68sg2nnv58x9nwrh0000gp/T/cdm-scb-9e4c84a3-bceb-42ee-86ad-a60160b55e9910830699588580089820/target-secure-bundle.zip
26/02/22 09:55:27 INFO ConnectionFetcher: Successfully auto-downloaded secure bundle for TARGET: file:///var/folders/5_/v9cyt_kn68sg2nnv58x9nwrh0000gp/T/cdm-scb-9e4c84a3-bceb-42ee-86ad-a60160b55e9910830699588580089820/target-secure-bundle.zip
26/02/22 09:55:27 INFO ConnectionFetcher: PARAM --  SSL Enabled: false
26/02/22 09:55:27 INFO ConnectionFetcher: Connecting to TARGET using SCB file:///var/folders/5_/v9cyt_kn68sg2nnv58x9nwrh0000gp/T/cdm-scb-9e4c84a3-bceb-42ee-86ad-a60160b55e9910830699588580089820/target-secure-bundle.zip
26/02/22 09:55:27 INFO ContactPoints: Contact point localhost:9042 resolves to multiple addresses, will use them all ([localhost/127.0.0.1:9042, localhost/[0:0:0:0:0:0:0:1]:9042])
...
26/02/22 09:55:43 INFO DAGScheduler: Job 0 finished: foreach at Migrate.scala:46, took 12.768252 s
26/02/22 09:55:43 INFO JobCounter: ################################################################################################
26/02/22 09:55:43 INFO JobCounter: Final Read Record Count: 1
26/02/22 09:55:43 INFO JobCounter: Final Write Record Count: 1
26/02/22 09:55:43 INFO JobCounter: Final Skipped Record Count: 0
26/02/22 09:55:43 INFO JobCounter: Final Error Record Count: 0
26/02/22 09:55:43 INFO JobCounter: Final Partitions Passed: 5000
26/02/22 09:55:43 INFO JobCounter: Final Partitions Failed: 0
26/02/22 09:55:43 INFO JobCounter: ################################################################################################
26/02/22 09:55:43 INFO SparkContext: Successfully stopped SparkContext
26/02/22 09:55:43 INFO Migrate$: ################################################################################################
26/02/22 09:55:43 INFO Migrate$: ###                                  Migrate Job - Stopped                                   ###
26/02/22 09:55:43 INFO Migrate$: ################################################################################################

Target Cluster - Astra DB Serverless content post migrate

token@cqlsh:default_keyspace> select * from test;

 pk | celist | cemap | ceset | clist     | cmap         | cset
----+--------+-------+-------+-----------+--------------+-----------
  0 |   null |  null |  null | [0, 1, 2] | {0: 0, 1: 1} | {0, 1, 2}

(1 rows)

also, validated by the log lines,

21:15:01.090 [Executor task launch worker for task 1748.0 in stage 0.0 (TID 1748)] DEBUG com.datastax.cdm.cql.statement.TargetInsertStatement - Unsetting column celist at bind index 3 to avoid tombstone
21:15:01.090 [Executor task launch worker for task 1748.0 in stage 0.0 (TID 1748)] DEBUG com.datastax.cdm.cql.statement.TargetInsertStatement - Unsetting column cemap at bind index 4 to avoid tombstone
21:15:01.090 [Executor task launch worker for task 1748.0 in stage 0.0 (TID 1748)] DEBUG com.datastax.cdm.cql.statement.TargetInsertStatement - Unsetting column ceset at bind index 5 to avoid tombstone

@msmygit msmygit self-assigned this Feb 20, 2026
@msmygit msmygit added bug Something isn't working enhancement New feature or request labels Feb 20, 2026
@msmygit msmygit marked this pull request as ready for review February 20, 2026 19:59
@msmygit msmygit requested a review from a team as a code owner February 20, 2026 19:59
}

TypeCodec<Object> fromCodec = (TypeCodec<Object>) codecRegistry.codecFor(toDataType, fromClass);
TypeCodec<Object> fromCodec = (TypeCodec<Object>) codecRegistry.codecFor(fromDataType, fromClass);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

💡 Code reviewer's assist - This has been a miss for long time and we have addressed it here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!!

Copy link
Collaborator

@pravinbhat pravinbhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I tested it locally with two rows
insert into test(pk,cset,ceset,clist,celist,cmap,cemap) values (0,{0,1,2},null,[0,1,2],null,{0:0,1:1},null);
insert into test(pk,cset,ceset,clist,celist,cmap,cemap) values (1,{},null,[],null,{},null);

Then ran migration which copied the two rows fine. Then to verify there are no tombstones, I compared the table dumps & below you can see the target table has no tombstones

Origin table dump (has tombstones)

[
  {
    "partition" : {
      "key" : [ "1" ],
      "position" : 18
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "liveness_info" : { "tstamp" : "2026-02-24T03:50:04.964017Z" },
        "cells" : [
          { "name" : "celist", "deletion_info" : { "marked_deleted" : "2026-02-24T03:50:04.964016Z", "local_delete_time" : "2026-02-24T03:50:04Z" } },
          { "name" : "cemap", "deletion_info" : { "marked_deleted" : "2026-02-24T03:50:04.964016Z", "local_delete_time" : "2026-02-24T03:50:04Z" } },
          { "name" : "ceset", "deletion_info" : { "marked_deleted" : "2026-02-24T03:50:04.964016Z", "local_delete_time" : "2026-02-24T03:50:04Z" } },
          { "name" : "clist", "deletion_info" : { "marked_deleted" : "2026-02-24T03:50:04.964016Z", "local_delete_time" : "2026-02-24T03:50:04Z" } },
          { "name" : "cmap", "deletion_info" : { "marked_deleted" : "2026-02-24T03:50:04.964016Z", "local_delete_time" : "2026-02-24T03:50:04Z" } },
          { "name" : "cset", "deletion_info" : { "marked_deleted" : "2026-02-24T03:50:04.964016Z", "local_delete_time" : "2026-02-24T03:50:04Z" } }
        ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "0" ],
      "position" : 80
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 80,
        "liveness_info" : { "tstamp" : "2026-02-24T03:49:36.078759Z" },
        "cells" : [
          { "name" : "celist", "deletion_info" : { "marked_deleted" : "2026-02-24T03:49:36.078758Z", "local_delete_time" : "2026-02-24T03:49:36Z" } },
          { "name" : "cemap", "deletion_info" : { "marked_deleted" : "2026-02-24T03:49:36.078758Z", "local_delete_time" : "2026-02-24T03:49:36Z" } },
          { "name" : "ceset", "deletion_info" : { "marked_deleted" : "2026-02-24T03:49:36.078758Z", "local_delete_time" : "2026-02-24T03:49:36Z" } },
          { "name" : "clist", "deletion_info" : { "marked_deleted" : "2026-02-24T03:49:36.078758Z", "local_delete_time" : "2026-02-24T03:49:36Z" } },
          { "name" : "clist", "path" : [ "d6420050-1133-11f1-abcf-e78dfc7c9cff" ], "value" : 0 },
          { "name" : "clist", "path" : [ "d6420051-1133-11f1-abcf-e78dfc7c9cff" ], "value" : 1 },
          { "name" : "clist", "path" : [ "d6420052-1133-11f1-abcf-e78dfc7c9cff" ], "value" : 2 },
          { "name" : "cmap", "deletion_info" : { "marked_deleted" : "2026-02-24T03:49:36.078758Z", "local_delete_time" : "2026-02-24T03:49:36Z" } },
          { "name" : "cmap", "path" : [ "0" ], "value" : 0 },
          { "name" : "cmap", "path" : [ "1" ], "value" : 1 },
          { "name" : "cset", "deletion_info" : { "marked_deleted" : "2026-02-24T03:49:36.078758Z", "local_delete_time" : "2026-02-24T03:49:36Z" } },
          { "name" : "cset", "path" : [ "0" ], "value" : "" },
          { "name" : "cset", "path" : [ "1" ], "value" : "" },
          { "name" : "cset", "path" : [ "2" ], "value" : "" }
        ]
      }
    ]
  }
]

Target table dump (has NO tombstones)

[
  {
    "partition" : {
      "key" : [ "1" ],
      "position" : 18
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 18,
        "liveness_info" : { "tstamp" : "2025-11-04T15:00:07Z" },
        "cells" : [ ]
      }
    ]
  },
  {
    "partition" : {
      "key" : [ "0" ],
      "position" : 42
    },
    "rows" : [
      {
        "type" : "row",
        "position" : 42,
        "liveness_info" : { "tstamp" : "2025-11-04T15:00:07Z" },
        "cells" : [
          { "name" : "clist", "deletion_info" : { "marked_deleted" : "2025-11-04T15:00:06.999999Z", "local_delete_time" : "2026-02-24T03:54:13Z" } },
          { "name" : "clist", "path" : [ "7bb83f40-1134-11f1-abcf-e78dfc7c9cff" ], "value" : 0 },
          { "name" : "clist", "path" : [ "7bb83f41-1134-11f1-abcf-e78dfc7c9cff" ], "value" : 1 },
          { "name" : "clist", "path" : [ "7bb83f42-1134-11f1-abcf-e78dfc7c9cff" ], "value" : 2 },
          { "name" : "cmap", "deletion_info" : { "marked_deleted" : "2025-11-04T15:00:06.999999Z", "local_delete_time" : "2026-02-24T03:54:13Z" } },
          { "name" : "cmap", "path" : [ "0" ], "value" : 0 },
          { "name" : "cmap", "path" : [ "1" ], "value" : 1 },
          { "name" : "cset", "deletion_info" : { "marked_deleted" : "2025-11-04T15:00:06.999999Z", "local_delete_time" : "2026-02-24T03:54:13Z" } },
          { "name" : "cset", "path" : [ "0" ], "value" : "" },
          { "name" : "cset", "path" : [ "1" ], "value" : "" },
          { "name" : "cset", "path" : [ "2" ], "value" : "" }
        ]
      }
    ]
  }
]

}

TypeCodec<Object> fromCodec = (TypeCodec<Object>) codecRegistry.codecFor(toDataType, fromClass);
TypeCodec<Object> fromCodec = (TypeCodec<Object>) codecRegistry.codecFor(fromDataType, fromClass);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!!

@msmygit msmygit enabled auto-merge (squash) February 26, 2026 20:25
@msmygit msmygit merged commit 22835a3 into main Feb 26, 2026
10 of 11 checks passed
@msmygit msmygit deleted the bugfix/honor_unset branch February 26, 2026 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Null fields from origin to unset at target cluster does not work in all cases

3 participants