Skip to content

Incremental Data Migration #18

@tomijaga

Description

@tomijaga

Problem

We have a schema based design that strictly adheres to candid and it's migration policy. Which means users would have to use either variants or Optional fields to add new fields and types to their schema. These two options don't provide the best user experience in motoko as handling them requires a series of switch statements and extractor functions to access the newly added field or type. The alternative, adding the field directly to a record, leads to a totally new type that's incompatible with the old one. Resulting in a need to run a migration function over all the values to convert them to the new type at the cost of a considerable amount of cycles depending on how large your dataset it.

Solution

Our schema based approach can be an advantage here as the user let's us know their intended type definitions. With this data the db can format the documents so it matches the intended type. Provided with a set of instructions by the user on how to upgrade the data between schemas, the db can retrieve previous documents and present them in the new type on demand, preventing the user from having to spend cycles migrating all the data at once. Instead we unlock a new paradigm, incremental updates, where we migrate documents to the new type only when their content is being updated. For example, if we were adding a new field comments to an existing type Post with two fields an id and likes, it would look something like this.

    type PostV1 = {
         id : Nat;
         likes: Nat;
    };
    
    posts.insert({ id = 1; likes = 328_832 });
    posts.insert({ id = 2; likes = 232 });
    posts.insert({ id = 3; likes = 91 });
         

There three documents in the db are stored internally as PostV1

    type PostV2 = #Record([
        ("id", #Nat), ("likes", #Nat), ("comments", #Array(#Text))
    ]);

    type PostV2 = {
         id : Nat;
         likes: Nat;
         comments: [Text];
    };
    
    let migration_instructions = [("comments", #Array([]))]
    
    posts.update_schema(PostV2, migration_instructions);
    
    assert posts.search(ZenDB.QueryBuilder()) == #ok([
        { id = 1; likes = 328_832; comments = []},
        { id = 2; likes = 232; comments = [] },
        { id = 3; likes = 91; comments = [] }
    ]);

The schema is updated but none of the contents are updated, they are still stored as PostV1.
However, when the data is retrieved the migration_instructions are applied to the data stored as PostV1 to transform it to PostV2

Internal Representation after transform and get

PostV1 content
1 { id = 1; likes = 328_832 }
2 { id = 2; likes = 232 }
3 { id = 3; likes = 91 }

Because this is most likely a query call the transformed data will not be saved so the contents will still be stored as type PostV1. However, when a field is upda

   let #ok(ids) = post.update(
       ZenDB.QueryBuilder().Where("id", #eq(#Nat(1))),
       [
           ("likes", #add(#currValue, #Nat(100))),
           ("comments", #Array([#Text("migration")]))
       ]
   );
   
   assert post.get(ids[0]) == #ok({ id = 1; likes = 328_932; comments = ["migration"]});
   
    let #ok(ids) = post.update(
       ZenDB.QueryBuilder().Where("id", #eq(#Nat(3))),
       [
           ("likes", #mul(#currValue, #Nat(100))),
       ]
   );
   
   assert post.get(ids[0]) == #ok({ id = 3; likes = 9_100; comments = []});

Internal Representation after update

PostV1 content
2 { id = 2; likes = 232 }
PostV2 content
1 { id = 1; likes = 328_932; comments = ["migration"]}
3 { id = 3; likes = 91; comments = [] }

Proposed Features

  • Support Schema Versioning, so users still have access to all the previously saved schemas after a schema is updated
  • Add instructions to migrate data when updating a schema
  • Incrementally migrate documents to the new version of the schema whenever their contents are updated
  • An an internal index to keep track of the documents and their schema versions.
    • This will allow users to view the number of documents that have been migrated to existing schemas

Incremental get() migrations

Initially, I thought it would be best to make just apply the migrate fn on the results of any query that doesn't update the data to accommodate canister query calls that don't update the canister state. After taking a look at the new-base motoko library , I noticed there was an isReplicated() method that essentially checks if the canister function the code is being executed in is either a query call or an update call. True means it is and false means its not. With this functionality we can store the data migrated to the new schema. If not, we will just return the migrated data without saving.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions