BioGeoFormer is a protein language model designed to predict and classify microbial proteins involved in key biogeochemical cycles — including methane, sulfur, nitrogen, and phosphorus transformations. Built on the ESM-2 transformer architecture, fine-tuned on curated metabolic pathway databases (MCycDB, NCycDB, PCycDB, SCycDB), and a calibrated confidence function, BioGeoFormer extends sequence-function inference beyond traditional homology-based tools.
Built on four databases, BioGeoFormer leverages 610 unique gene families to cover 37 metabolic pathways. It represents an excellent complementary method with classic approaches for metagenome and genome mining, to uncover hypothetical gene function related to biogeochemical cycling. 'BioGeoFormer' is a blanket term for the 8 fine-tuned models defined by their clustered identity splits, with training, validation, and test n% dissimilar at 10% intervals from 20% to 90%. While the nuance is described in our manuscript, we found that the 70% split model is the most effective at precisely identifying remote homologues, and recommend its use in most circumstances.
While the tool does run on CPU-based infrastructure, we strongly recommend using a GPU-based infrastructure to annotate sequences to ensure the fastest completion time, especially for large datasets. If you do not have one personally available to you or do not have access through your institution, Google Colab is a user-friendly option to run a notebook with a GPU.
Version 1.0.0
git clone the repository to the location you intend to run the tool:
git clone https://github.com/nimuh/biogeoformer.gitdirect to the repository folder and make sure that you are in 'biogeoformer' only and not within any subdirectories
cd /path/to/biogeoformer/folderrun the setup.py script by entering the following command
pip install -e .Input data must be a .fasta file format with an identifiable sequence ID, followed by a biological sequence sequence in amino acid format. Files must end with .fasta, and not .faa in order for BioGeoFormer to correctly identify the input.
To run inference on sequences (functionally annotate), run the attached command below while specifying the path to the inference.py script in the cyc folder within the BioGeoFormer directory. Then specify which model-split to use (e.g., --sim 70) and the path to the input fasta file using the --fasta_file command. Lastly specify the path of the output .csv file with the --anot_file command.
Example command:
bgf --sim 70 --fasta_file ./path/to/input/fasta --annot_file ./path/to/output/filehttps://www.biorxiv.org/content/10.64898/2025.12.17.695047v1
Nima Azbijari: azbijarn@oregonstate.edu
Jacob Wynne: jacobwynne@ucsb.edu
BioGeoFormer is under the MIT license.