Skip to content

nimuh/biogeoformer

Repository files navigation

Untitled_Artwork

BioGeoFormer is a protein language model designed to predict and classify microbial proteins involved in key biogeochemical cycles — including methane, sulfur, nitrogen, and phosphorus transformations. Built on the ESM-2 transformer architecture, fine-tuned on curated metabolic pathway databases (MCycDB, NCycDB, PCycDB, SCycDB), and a calibrated confidence function, BioGeoFormer extends sequence-function inference beyond traditional homology-based tools.

Built on four databases, BioGeoFormer leverages 610 unique gene families to cover 37 metabolic pathways. It represents an excellent complementary method with classic approaches for metagenome and genome mining, to uncover hypothetical gene function related to biogeochemical cycling. 'BioGeoFormer' is a blanket term for the 8 fine-tuned models defined by their clustered identity splits, with training, validation, and test n% dissimilar at 10% intervals from 20% to 90%. While the nuance is described in our manuscript, we found that the 70% split model is the most effective at precisely identifying remote homologues, and recommend its use in most circumstances.

While the tool does run on CPU-based infrastructure, we strongly recommend using a GPU-based infrastructure to annotate sequences to ensure the fastest completion time, especially for large datasets. If you do not have one personally available to you or do not have access through your institution, Google Colab is a user-friendly option to run a notebook with a GPU.

Current version

Version 1.0.0

To download BioGeoFormer

git clone the repository to the location you intend to run the tool:

git clone https://github.com/nimuh/biogeoformer.git

direct to the repository folder and make sure that you are in 'biogeoformer' only and not within any subdirectories

cd /path/to/biogeoformer/folder

run the setup.py script by entering the following command

pip install -e .

Formatting input data

Input data must be a .fasta file format with an identifiable sequence ID, followed by a biological sequence sequence in amino acid format. Files must end with .fasta, and not .faa in order for BioGeoFormer to correctly identify the input.

Inference

To run inference on sequences (functionally annotate), run the attached command below while specifying the path to the inference.py script in the cyc folder within the BioGeoFormer directory. Then specify which model-split to use (e.g., --sim 70) and the path to the input fasta file using the --fasta_file command. Lastly specify the path of the output .csv file with the --anot_file command.

Example command:

bgf --sim 70 --fasta_file ./path/to/input/fasta --annot_file ./path/to/output/file

Preprint:

https://www.biorxiv.org/content/10.64898/2025.12.17.695047v1

Contact

Nima Azbijari: azbijarn@oregonstate.edu

Jacob Wynne: jacobwynne@ucsb.edu

License

BioGeoFormer is under the MIT license.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors