Skip to content

Database introduction

fajieyuan edited this page Sep 6, 2025 · 8 revisions

Note:we have published embeddings of all these databases; download them from here. The Swiss-Prot database is on the HuggingFace page here.

Database Introduction Number of proteins Reference link
Swiss-Prot Human-reviewed protein sequence database 500K https://www.uniprot.org/uniprotkb?query=reviewed:true
UniRef50 Generated by clustering UniProt proteins at 50% sequence identity 45M https://www.uniprot.org/help/uniref
Uncharacterized All proteins labeled as "Uncharacterized" at UniProt website 30M https://www.uniprot.org/uniprotkb?query=Uncharacterized
OMG_prot50 Created by clustering the Open MetaGenomic dataset (OMG) at 50% sequence identity 200M https://huggingface.co/datasets/tattabio/OMG_prot50
PDB A database for the three-dimensional structural data of proteins 700K (every chain in a structure was extracted and counted as one protein) https://www.rcsb.org
GOPC Global ocean microbiome protein catalog sequences 2B https://db.cngb.org/maya/datasets/MDB0000002
NCBI NCBI protein database 700M https://www.ncbi.nlm.nih.gov/protein
OMG The Open MetaGenomic dataset (OMG) 3.1B https://huggingface.co/datasets/tattabio/OMG
MGnify MGnify is a free to use resource for analysis, visualisation and discovery of metagenomic, metatranscriptomic, amplicon and assembly datasets. We filtered out protein segments and only kept the full-length proteins. 470M https://ftp.ebi.ac.uk/pub/databases/metagenomics/peptide_database/2024_04

Clone this wiki locally