Repository Migration

Tools & ideas for migrating data from a MODS-based EQUELLA repository to Datacite InvenioRDM.

Semantics: EQUELLA objects are items with attachments. Invenio objects are records with files. EQUELLA has taxonomies; Invenio has vocabularies. We use these terms consistently so it's clear what format an object is in (e.g. python migrate/record.py item.json > record.json converts an item into a record).

Setup & Tests

uv install # get dependencies, takes awhile due to spacy's en_core_web_lg model
uv run pytest # run tests

Vocabularies

Invenio uses vocabularies to represent a number of fixtures beyond just subject headings, like names, description types, and creator roles. They're stored under the app_data directory and loaded when an instance is initialized. Many of our controlled lists in contribution wizards and EQUELLA taxonomies will be mapped to vocabularies.

The taxos dir contains exported EQUELLA taxonomies and tools for working with them. The vocab dir contains YAML files for Invenio vocabularies.

Notable scripts that create Invenio vocabularies:

taxos/users.py creates the names.yaml and users.yaml fixtures
taxos/roles.py creates the Invenio relator creatorsroles and contributorsroles in a file named roles.yaml

Subjects

We create a few subject vocabularies for different types of terms: "name" for person/org names, "place" for geographic locations, "form" for genre or form terms, and "topic" for topical subjects. We attempt to match terms to URIs from Getty Vocabs or Wikidata, but some local terms use generated UUIDs for identifiers.

Download the subjects sheet and run python migrate/mk_subjects.py data/subjects.csv to create the YAML vocabularies in the vocab dir (lc.yaml and cca_local.yaml) as well as migrate/subjects_map.json which is used by Record's find_subjects to convert the text of VAULT subject terms into Invenio identifiers or keyword subjects without an id.

If an INVENIO_REPO env var is set, vocabs are copied to the Invenio instance. We should be able to update existing vocabs with invenio rdm add-to-fixture. If not, the site can rebuilt like invenio-cli services destroy and then invenio-cli services setup.

Creating Records in Invenio

We need to load the necessary fixtures in Invenio before creating records. Anywhere an identifier is used, whether in a subject, resource type, or relation, it must exist prior to being referenced in a record. If we attempt to create a record with an id that doesn't exist, we get a 500 error.

migrate/record.py: converts EQUELLA item(s) into Invenio record JSON
migrate/api.py: converts an item and POSTs it to Invenio to create a metadata-only record
migrate/import.py: imports an item directory (created by our export tool) with its attachments to Invenio

The scripts rely on a personal access token for an administrator account in Invenio:

Sign in as an admin
Go to Applications > Personal access tokens
Create one—its name and the user:email scope (as of v12) do not matter
Copy it to clipboard and Save
Paste in .env and/or set it as an env var, e.g. set -x INVENIO_TOKEN=xyz in fish

# fish shell brief example
set -x INVENIO_TOKEN abc123; set -x HOST 127.0.0.1:5000 # better: edit into .env
python migrate/api.py items/item.json
HTTP 201 https://127.0.0.1:5000/api/records/k7qk8-fqq15/draft
HTTP 202 https://127.0.0.1:5000/records/k7qk8-fqq15
...

Invenio API calls can fail if the .env file in the project root is loaded and contains an outdated personal access token. If API calls fail with 403 errors, check that the TOKEN / INVENIO_TOKEN and HOST environment variables are set correctly.

Rerunning a "migrate" script with the same input creates a new record, it doesn't update the existing one.

Post Migration Steps

After records are created, they are added to their respective communities, but there are a few more steps that cannot be performed at creation time. We track the created records in an id-map.json file (updated by migrate/import.py) so we know which Invenio record corresponds to which EQUELLA item and what steps remain.

Change the record owner: records are created by the migration user and not the same EQUELLA account, uv run invenio cca set-owner --map-file id-map.json
Add collaborators: see the Syllabus Collection especially where faculty are collaborators on their syllabi and not owners, uv run invenio cca add-editor --map-file id-map.json
(TBD) Share with specific users or groups: to emulate EQUELLA's granular ACLs, we may need to share records with specific users or groups
(TBD) Update internal record references: references to other EQUELLA items in metadata must be updated to point to the other items' corresponding Invenio record

There is no order to these steps or interdependencies between them. Code does not exist for the final two steps yet.

The set-owner and add-editor commands skip internal (UUID) EQUELLA users. We do not plan to migrate those accounts.

Items

We can download metadata for all items using equella-cli and a script like this:

#!/usr/bin/env fish
set total (eq search -l 1 | jq '.available')
set length 50 # can only download up to 50 at a time
set pages (math floor $total / $length)
for i in (seq 0 $pages)
  set start (math $i x $length)
  echo "Downloading items $start to" (math $start + $length)
  # NOTE: no attachment info, use "--info all" for both attachments & metadata
  eq search -l $length --info metadata --start $start > json/$i.json
end

Metadata Crosswalk

We can use the item.metadata XML of existing VAULT items for testing. Generally, python migrate/record.py items/item.json | jq to see the JSON Invenio record. See our crosswalk diagrams.

Schemas:

It's likely our schema is outdated/inaccurate in places.

How to map a field:

Add a brief description to the mermaid diagram in docs/crosswalk.html
Write a test in tests.py with your input XML and expected record output
Write a Record method in migrate.py & use it in the Record::get() dict
Run tests, optionally run a record migration as described above

LICENSE

ECL Version 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 321 Commits
.github		.github
data		data
docs		docs
migrate		migrate
taxos		taxos
vocab		vocab
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
mise.toml		mise.toml
pyproject.toml		pyproject.toml
readme.md		readme.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Repository Migration

Setup & Tests

Vocabularies

Subjects

Creating Records in Invenio

Post Migration Steps

Items

Metadata Crosswalk

LICENSE

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

cca/vault_migration

Folders and files

Latest commit

History

Repository files navigation

Repository Migration

Setup & Tests

Vocabularies

Subjects

Creating Records in Invenio

Post Migration Steps

Items

Metadata Crosswalk

LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages