Skip to content

Add and update resource descriptors using bioregistry conventions#335

Open
sjm41 wants to merge 1 commit intomasterfrom
add-bioregistry-resource-descriptors
Open

Add and update resource descriptors using bioregistry conventions#335
sjm41 wants to merge 1 commit intomasterfrom
add-bioregistry-resource-descriptors

Conversation

@sjm41
Copy link
Copy Markdown
Contributor

@sjm41 sjm41 commented Apr 16, 2026

Summary

  • Add new resource descriptor entries for biocyc, cazy, kegg.pathway, merops.family, and wikipathways
  • Update existing TCDB entry to use bioregistry prefix (tcdb), identifier pattern, and resolution URL
  • All new/updated entries use bioregistry.io resolution URLs and identifier patterns sourced from https://bioregistry.io

Test plan

  • Verify YAML syntax is valid
  • Confirm each new gid_pattern matches its corresponding example_gid
  • Confirm each default_url resolves correctly when the example identifier is substituted

🤖 Generated with Claude Code

Add new entries for biocyc, cazy, kegg.pathway, merops.family, and
wikipathways. Update existing TCDB entry to use bioregistry prefix,
pattern, and URL. All new entries use bioregistry.io resolution URLs
and identifier patterns from https://bioregistry.io.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@sjm41
Copy link
Copy Markdown
Contributor Author

sjm41 commented Apr 16, 2026

Note: the old (existing) TCDB entry was not based on bioregistry, and its gid pattern only allowed full 4-character IDs. The new pattern comes from bioregistry and allows for xrefs to 1-, 2- or 3- character IDs as well as full 4-character IDs.

@sjm41
Copy link
Copy Markdown
Contributor Author

sjm41 commented Apr 16, 2026

Another note:
I added a new 'kegg.pathway' entry based on https://bioregistry.io/registry/kegg.pathway.

There is one existing KEGG entry in the yaml file that looks like this:

 - db_prefix: KEGG
    name: Kyoto Encyclopedia of Genes and Genomes
    example_id: KEGG:05220
    gid_pattern: "^KEGG:\\w*:*\\d+$"
    default_url: http://www.genome.jp/dbget-bin/www_bget?map[%s]
    pages:
      - name: ontology_provided_cross_reference
        url: http://www.genome.jp/dbget-bin/www_bget?map[%s]

That entry doesn't look right, so I asked Claude:

Can you tell what the pre-exisitng KEGG entry was meant to represent? The details don't look good to me.

Looking at the existing entry:

It looks like it was meant to represent KEGG pathways, based on:

  • The URL uses map[%s] — the map prefix is KEGG's convention for pathway maps
  • 05220 is a KEGG pathway number (likely hsa05220 = "Proteoglycans in cancer")

But there are several problems:

  1. Incomplete example ID — KEGG:05220 is missing the organism prefix (e.g. hsa05220)
  2. Sloppy gid_pattern — ^KEGG:\w*:*\d+$ is very loose, allowing optional characters and colons in odd places
  3. Outdated URL — uses http://www.genome.jp rather than the current https://www.kegg.jp
  4. Only used for ontology_provided_cross_reference — likely added to resolve cross-references from disease ontologies (like DOID) pointing to
    KEGG pathways

It seems like the new kegg.pathway entry we just added is a proper replacement. Would you like to remove this old KEGG entry?


This PR doesn't delete that existing KEGG entry, because I wasn't sure exactly what "ontology_provided_cross_reference" might be using it. But I suspect Claude is right, and it should be deleted (and my new kegg.pathway entry used instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant