-
Notifications
You must be signed in to change notification settings - Fork 22
Database introduction
fajieyuan edited this page Sep 6, 2025
·
8 revisions
Note:we have published embeddings of all these databases; download them from here. The Swiss-Prot database is on the HuggingFace page here.
| Database | Introduction | Number of proteins | Reference link |
|---|---|---|---|
| Swiss-Prot | Human-reviewed protein sequence database | 500K | https://www.uniprot.org/uniprotkb?query=reviewed:true |
| UniRef50 | Generated by clustering UniProt proteins at 50% sequence identity | 45M | https://www.uniprot.org/help/uniref |
| Uncharacterized | All proteins labeled as "Uncharacterized" at UniProt website | 30M | https://www.uniprot.org/uniprotkb?query=Uncharacterized |
| OMG_prot50 | Created by clustering the Open MetaGenomic dataset (OMG) at 50% sequence identity | 200M | https://huggingface.co/datasets/tattabio/OMG_prot50 |
| PDB | A database for the three-dimensional structural data of proteins | 700K (every chain in a structure was extracted and counted as one protein) | https://www.rcsb.org |
| GOPC | Global ocean microbiome protein catalog sequences | 2B | https://db.cngb.org/maya/datasets/MDB0000002 |
| NCBI | NCBI protein database | 700M | https://www.ncbi.nlm.nih.gov/protein |
| OMG | The Open MetaGenomic dataset (OMG) | 3.1B | https://huggingface.co/datasets/tattabio/OMG |
| MGnify | MGnify is a free to use resource for analysis, visualisation and discovery of metagenomic, metatranscriptomic, amplicon and assembly datasets. We filtered out protein segments and only kept the full-length proteins. | 470M | https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2024_04 |