Skip to content

EXGEP: a framework for predicting genotype-by-environment interactions using ensembles of explainable machine-learning models

License

Notifications You must be signed in to change notification settings

AIBreeding/EXGEP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EXGEP

License: GPL v3 Python Platform Model: EXGEP Published in Briefings in Bioinformatics

A framework for predicting genotype-by-environment interactions using ensembles of explainable machine-learning models

EXGEP

Related Software and Tools

  • DNNGP – Deep neural network for genomic prediction.
  • AutoGS – A framework for predicting genotype-by-environment interactions using ensem)bles of explainable machine-learning models.
  • GxEtoolkit – An automated and explainable machine learning framework for Genome Prediction.

Table of Contents

Getting started

Requirements

  • Python 3.9
  • pip

Installation

Install packages:

  1. Create a python environment.
conda create -n exgep python=3.9
conda activate exgep
  1. Clone this repository and cd into it.
git clone https://github.com/AIBreeding/EXGEP.git
cd ./exgep
pip install -r requirements.txt

Base Usage

import os
import time
import argparse
import pandas as pd
from datetime import datetime
from exgep.data import datautils
from exgep.model import RegEXGEP
from exgep.data.reg_metrics import (mae_score as mae, 
                           mse_score as mse, 
                           rmse_score as rmse, 
                           r2_score as r2,
                           rmsle_score as rmsle, 
                           mape_score as mape, 
                           medae_score as medae, 
                           pcc_score as pcc)


geno = './data/genotype.csv'
phen = './data/pheno.csv'
soil = './data/soil.csv'
weather = './data/weather.csv'

data = datautils.merge_data(geno, phen, soil, weather)
X = pd.DataFrame(data.iloc[:, 3:])
y = data['Yield']
y = pd.core.series.Series(y)
    
regression = RegEXGEP(
    y=y,
    X=X,
    test_frac=0.1,
    n_splits=10,
    n_trial=5,
    reload_study=True,
    reload_trial_cap=True,
    write_folder=os.getcwd()+'/result/',
    metric_optimise=r2,
    metric_assess=[mae, mse, rmse, pcc, rmsle, mape, medae],
    optimisation_direction='maximize',
    models_to_optimize=['LightGBM'],
    models_to_assess=['LightGBM'],
    boosted_early_stopping_rounds=5,
    random_state=2024
)

start = time.time()
regression.train()
end = time.time()
print(end - start)

🌲 Optional Tree Models

  • DTR – Decision Tree Regressor
  • ETR – Extra Trees Regressor
  • LightGBM – Light Gradient Boosting Machine
  • XGBoost – Extreme Gradient Boosting
  • CatBoost – Categorical Boosting
  • AdaBoost – Adaptive Boosting
  • GBDT – Gradient Boosting Decision Tree
  • Bagging – Bagging Regressor
  • RF – Random Forest Regressor
  • HistGradientBoosting – Histogram-based Gradient Boosting

Training Example (Parameter Configuration)

python train_exgep.py \
--geno ./data/geno.csv \
--phen ./data/pheno.csv \
--soil ./data/soil.csv \
--weather ./data/weather.csv \
--target Yield \
--test_size 0.1 \
--n_splits 10 \
--n_trial 5 \
--models_optimize XGBoost LightGBM \
--models_assess XGBoost LightGBM

Training Parameters Details

Parameter Description Default Type
--geno Path to genotype CSV file ./data/genotype.csv string
--phen Path to phenotype CSV file ./data/pheno.csv string
--soil Path to soil CSV file ./data/soil.csv string
--weather Path to weather CSV file ./data/weather.csv string
--target Target column name in the dataset Yield string
--test_size Fraction of data to be used for testing 0.1 float
--n_splits Number of splits for cross-validation 10 integer
--n_trial Number of optimization trials 5 integer
--models_optimize Models to use for hyperparameter optimization ['XGBoost'] list
--models_assess Models to use for performance assessment ['XGBoost'] list

Model Explainable Example

python test_explain.py \
--geno ./data/genotype.csv \
--phen ./data/pheno.csv \
--soil ./data/soil.csv \
--weather ./data/weather.csv \
--target Yield \
--model EXGEP \
--job_id 20240813103950 \
--sample 2 \
--feature_i pc2 \
--feature_j RH2M \
--top_features 10 \
--top_interactions 20 \
--test_size 0.1 \
--random_state 2024

Model Explanation Parameters Details

Parameter Description Default Type
--model Model to explain - string
--job_id Job directory containing trained models - string
--geno Path to genotype CSV file ./data/genotype.csv string
--phen Path to phenotype CSV file ./data/pheno.csv string
--soil Path to soil CSV file ./data/soil.csv string
--weather Path to weather CSV file ./data/weather.csv string
--target Target column name Yield string
--test_size Test set fraction 0.2 float
--random_state Random seed for reproducibility 2024 integer
--sample Sample index for waterfall plot 2 integer
--feature_i Primary feature for dependence plots pc2 string
--feature_j Secondary feature for interaction plots RH2M string
--top_features Number of top features to display 20 integer
--top_interactions Number of top interactions for network 20 integer
--cluster Use KMeans clustering for background data False flag
--n_train_points Training points for background 200 integer
--n_test_points Test samples to explain None integer

📚 Citation

You can read our paper explaining EXGEP.

Yu T, Zhang H, Chen S, et al. EXGEP: a framework for predicting genotype-by-environment interactions using ensembles of explainable machine-learning models. Brief Bioinform, 2025. https://doi.org/10.1093/bib/bbaf414

📜Copyright and License

This project is free to use for non-commercial purposes - see the LICENSE file for details.

👥Contacts

For more information, please contact with Huihui Li (lihuihui@caas.cn).

About

EXGEP: a framework for predicting genotype-by-environment interactions using ensembles of explainable machine-learning models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages