GitHub - nimuh/biogeoformer

BioGeoFormer is a protein language model designed to predict and classify microbial proteins involved in key biogeochemical cycles — including methane, sulfur, nitrogen, and phosphorus transformations. Built on the ESM-2 transformer architecture, fine-tuned on curated metabolic pathway databases (MCycDB, NCycDB, PCycDB, SCycDB), and a calibrated confidence function, BioGeoFormer extends sequence-function inference beyond traditional homology-based tools.

Built on four databases, BioGeoFormer leverages 610 unique gene families to cover 37 metabolic pathways. It represents an excellent complementary method with classic approaches for metagenome and genome mining, to uncover hypothetical gene function related to biogeochemical cycling. 'BioGeoFormer' is a blanket term for the 8 fine-tuned models defined by their clustered identity splits, with training, validation, and test n% dissimilar at 10% intervals from 20% to 90%. While the nuance is described in our manuscript, we found that the 70% split model is the most effective at precisely identifying remote homologues, and recommend its use in most circumstances.

While the tool does run on CPU-based infrastructure, we strongly recommend using a GPU-based infrastructure to annotate sequences to ensure the fastest completion time, especially for large datasets. If you do not have one personally available to you or do not have access through your institution, Google Colab is a user-friendly option to run a notebook with a GPU.

Current version

Version 1.0.0

To download BioGeoFormer

git clone the repository to the location you intend to run the tool:

git clone https://github.com/nimuh/biogeoformer.git

direct to the repository folder and make sure that you are in 'biogeoformer' only and not within any subdirectories

cd /path/to/biogeoformer/folder

run the setup.py script by entering the following command

pip install -e .

Formatting input data

Input data must be a .fasta file format with an identifiable sequence ID, followed by a biological sequence sequence in amino acid format. Files must end with .fasta, and not .faa in order for BioGeoFormer to correctly identify the input.

Inference

To run inference on sequences (functionally annotate), run the attached command below while specifying the path to the inference.py script in the cyc folder within the BioGeoFormer directory. Then specify which model-split to use (e.g., --sim 70) and the path to the input fasta file using the --fasta_file command. Lastly specify the path of the output .csv file with the --anot_file command.

Example command:

bgf --sim 70 --fasta_file ./path/to/input/fasta --annot_file ./path/to/output/file

Preprint:

https://www.biorxiv.org/content/10.64898/2025.12.17.695047v1

Contact

Nima Azbijari: azbijarn@oregonstate.edu

Jacob Wynne: jacobwynne@ucsb.edu

License

BioGeoFormer is under the MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
BioGeoFormer_manuscript		BioGeoFormer_manuscript
cyc.egg-info		cyc.egg-info
cyc		cyc
data		data
docs		docs
models		models
results		results
scripts		scripts
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Current version

To download BioGeoFormer

Formatting input data

Inference

Preprint:

Contact

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Current version

To download BioGeoFormer

Formatting input data

Inference

Preprint:

Contact

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages