Resource identifier mapping

The purpose of this page is to flesh out how we (will) manage unique identifiers for various resources.

Overview

First, at the DB level, all tables have a serial integer primary key following the naming convention <tablename>_id. In the original VegBank, these are implicitly exposed in various URLs and in accession codes, but we neither expect nor want users to think about and use these table-level primary keys.

Second, at the application level, most resources have an accession code stored in an accessionCode field in relevant database tables. See table at the bottom of this page identifying which tables have this field. Note: This field does not have a not-null constraint in the DB tables, and although it should always be populated per application logic, in practice all tables have missing records as of the start of this project (late 2024).

Finally, for external linkages and identification, we could also support annotation of VegBank resources with relevant stable identifiers provided by users and/or minted using external services. These could notably include DOIs for references and records, ORCIDs for people, and RORs for research organizations, along with other custom user-supplied identifiers that link VegBank records to an external system of record (e.g., an agency's internal database).

Requirements (proposed)

For historical fidelity, we need to retain the relationship between existing accession codes and their corresponding resources. More specifically, we need to ensure that VegBank's original citation URLs continue to work in the future, because they may by published and reported in various enduring artifacts.
In order to use accession codes as the primary resource identifier in our API layer, we need to ensure this field exists in all relevant tables, is unique across the entire database, and is populated (not null) for all records.
We should also have a way to store and maintain the relationship between resources and various external IDs (e.g. DOIs, RORs, ORCIDs, and other custom identifiers) that are either supplied by users or assigned by us for external linkage purposes, where such IDs may be optional and/or revokable depending on the use case, and are distinct from our authoritative internal resource identifiers.

Design (WIP)

To meet these requirements, we probably need a new table to manage identifiers and their mapping to various use cases. Here's a simple proposed identifier table that links external identifiers to internal accession codes, with a simple scheme for classifying each identifier by its type, and enforcing uniqueness of each identifier by type. We may want to add other fields to indicate the creator/etc, the date created/modified/etc, the status (active or not), etc, but these are left out for now.

Option 1

CREATE TABLE identifier (
    identifier_id SERIAL,
    resource_accession_code VARCHAR NOT NULL,  -- Reference to accession_code
    identifier_type VARCHAR NOT NULL,          -- e.g., LSID, DOI, ORCID, ROR, LOCAL
    identifier_value VARCHAR NOT NULL,         -- e.g., "10.1234/abcd", "0000-0002-1825-0097"
    PRIMARY KEY (identifier_id)
    UNIQUE (identifier_type, identifier_value) -- Prevent duplicate mappings
);

Option 2

CREATE TABLE identifier (
    identifier_id SERIAL,
    table VARCHAR NOT NULL,                    -- Name of table holding the identified resource
    resource_id INTEGER NOT NULL,              -- FK holding the PK of the resource?
    identifier_type VARCHAR NOT NULL,          -- e.g., DOI, ORCID, ROR, LOCAL
    identifier_value VARCHAR NOT NULL,         -- e.g., "10.1234/abcd", "0000-0002-1825-0097"
    PRIMARY KEY (identifier_id)
    UNIQUE (identifier_type, identifier_value) -- Prevent duplicate mappings
);

Appendix: Tables with `accessionCode`

(This is WIP -- need to complete/verify)

Proposal

New Accession Code Format

We want to change the canonical accession code format to the following:

{2-character table code}.{primary key}

This allows us to infer the accession code of foreign keys without doing a join. For example, I query taxon observations and get back an observation ID. I can then just return to the user ob.<observation_id> and save the extra join to the observation table. See here for the original VegBank mapping of tables to two-character codes, noting that we will use lower-cased characters across the board.

Create new Identifiers Table

Table Code	Primary Key	Identifier Type	Identifier
ob	9999	LSID	urn:lsid:cvs.bio.unc.edu:observation:7297-{21013588-2F2E-47FA-947A-FBD3C1B376AB}
ob	9999	AccessionCode	ob.9999
pl	1234	AccessionCode	VB.PL.48373.VZ17QEZ6PVLCDPY

The identifier table will be available via the resolution service: api.vegbank.org/cite/<accession_code>. Old vegbank.org/cite urls will also need to be set to redirect here. This table will need to be indexed on the identifier field. If we need a service that relates a resource back to its identifiers in this table, we'll also need to index the other two fields.

Under this model, we will not be backfilling the existing null accession code fields. In fact we essentially are no longer using the existing accession code fields on the individual tables as everything will be accessible via its table code and primary key. This also allows for accessing tables that do not currently have accession codes should we want to do that.

The only situation where the above table is queried from an endpoint is during a resolution service. In this case, a user has some old format accession code, they will hit the above service url with their old code and either be redirected immediately to the appropriate resource or receive the appropriate accession code to do it themselves.

Join concerns

The only case we would have to do large scale joins to this new table would be if we decided we wanted legacy accession codes returned on bulk requests, which I don't believe is necessary. Our new queries will return current format accession codes for all records that have foreign keys, and a user could take those keys and resolve them to legacy codes through the /cite endpoint. We could also return legacy codes on single entity requests (like /plot-observation/<accession_code> with little to no overhead if necessary.

Use Cases

User has old accession code they need to resolve to an existing record. They pass that accession code to the resolution service and get the current info back.
User has a new identifier (like a DOI) that they need to add to an existing record

User provides an existing accession code and their new identifier, we do a lookup in the identifier table and create a new row associated with that primary key and table code. This will happen either as a post request to /cite or on a separate endpoint altogether.

User has a new format accession code (ex. ob.9999)

This user can either go directly to api.vegbank.org/plot-observations/ob.9999 to receive the data, or they can also hit api.vegbank.org/cite/ob.9999 which will give them a redirect them to the appropriate resource URL.

Creating the Identifier Table

Take every existing accession code and add a row to the identifier table with the code, primary key and table code. Add a second row with the table code, primary key, and the joined new accession code format (ex. ob.9999) This will result with every record with an existing accession code having 2 records in the table.
Take every record with a null accession code and add a new row with the new accession code format ({table_code}.{PK}). This will result in every record with a null accession code having one row in the table.

Adding new records

When a new record is inserted into a table that is accessible, also insert a new record into the identifier table with the Table Code.PK format.

Questions

Do we need a column in the identifier table for URLs?
For existing accession codes for plot, we don't currently have an endpoint in the design that resolves these. Only Observation accession codes are resolvable. Do we need a plot endpoint for this reason?
Do we need an identifier type field? If a hypothetical user wanted just the lsid identifiers for a given record, would we need an ID type field, or do we just want the resolution service to return all the different identifiers for a record along with its correct resource resolution? ANSWERED: We do, it's been added to the above example table.
Do we want different identifier types for the old accession code format and the new one? (VB.Ob.9999.EXAMPLE vs ob.9999) ANSWERED: We do want different names, as it will allow us to potentially create a "preferred" identifier type, which could be our accession code, or DOIs, or anything else in the future should the need arise. We could display these differently on the front end.
Do we need to have the full human readable name of the table in a separate field?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resource identifier mapping

Overview

Requirements (proposed)

Design (WIP)

Option 1

Option 2

Appendix: Tables with `accessionCode`

Proposal

New Accession Code Format

Create new Identifiers Table

Join concerns

Use Cases

Creating the Identifier Table

Adding new records

Questions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Resource identifier mapping

Overview

Requirements (proposed)

Design (WIP)

Option 1

Option 2

Appendix: Tables with accessionCode

Proposal

New Accession Code Format

Create new Identifiers Table

Join concerns

Use Cases

Creating the Identifier Table

Adding new records

Questions

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Appendix: Tables with `accessionCode`