A framework for predicting genotype-by-environment interactions using ensembles of explainable machine-learning models
- DNNGP – Deep neural network for genomic prediction.
- AutoGS – A framework for predicting genotype-by-environment interactions using ensem)bles of explainable machine-learning models.
- GxEtoolkit – An automated and explainable machine learning framework for Genome Prediction.
- Python 3.9
- pip
Install packages:
- Create a python environment.
conda create -n exgep python=3.9
conda activate exgep- Clone this repository and cd into it.
git clone https://github.com/AIBreeding/EXGEP.git
cd ./exgep
pip install -r requirements.txtimport os
import time
import argparse
import pandas as pd
from datetime import datetime
from exgep.data import datautils
from exgep.model import RegEXGEP
from exgep.data.reg_metrics import (mae_score as mae,
mse_score as mse,
rmse_score as rmse,
r2_score as r2,
rmsle_score as rmsle,
mape_score as mape,
medae_score as medae,
pcc_score as pcc)
geno = './data/genotype.csv'
phen = './data/pheno.csv'
soil = './data/soil.csv'
weather = './data/weather.csv'
data = datautils.merge_data(geno, phen, soil, weather)
X = pd.DataFrame(data.iloc[:, 3:])
y = data['Yield']
y = pd.core.series.Series(y)
regression = RegEXGEP(
y=y,
X=X,
test_frac=0.1,
n_splits=10,
n_trial=5,
reload_study=True,
reload_trial_cap=True,
write_folder=os.getcwd()+'/result/',
metric_optimise=r2,
metric_assess=[mae, mse, rmse, pcc, rmsle, mape, medae],
optimisation_direction='maximize',
models_to_optimize=['LightGBM'],
models_to_assess=['LightGBM'],
boosted_early_stopping_rounds=5,
random_state=2024
)
start = time.time()
regression.train()
end = time.time()
print(end - start)DTR– Decision Tree RegressorETR– Extra Trees RegressorLightGBM– Light Gradient Boosting MachineXGBoost– Extreme Gradient BoostingCatBoost– Categorical BoostingAdaBoost– Adaptive BoostingGBDT– Gradient Boosting Decision TreeBagging– Bagging RegressorRF– Random Forest RegressorHistGradientBoosting– Histogram-based Gradient Boosting
python train_exgep.py \
--geno ./data/geno.csv \
--phen ./data/pheno.csv \
--soil ./data/soil.csv \
--weather ./data/weather.csv \
--target Yield \
--test_size 0.1 \
--n_splits 10 \
--n_trial 5 \
--models_optimize XGBoost LightGBM \
--models_assess XGBoost LightGBM| Parameter | Description | Default | Type |
|---|---|---|---|
--geno |
Path to genotype CSV file | ./data/genotype.csv |
string |
--phen |
Path to phenotype CSV file | ./data/pheno.csv |
string |
--soil |
Path to soil CSV file | ./data/soil.csv |
string |
--weather |
Path to weather CSV file | ./data/weather.csv |
string |
--target |
Target column name in the dataset | Yield |
string |
--test_size |
Fraction of data to be used for testing | 0.1 |
float |
--n_splits |
Number of splits for cross-validation | 10 |
integer |
--n_trial |
Number of optimization trials | 5 |
integer |
--models_optimize |
Models to use for hyperparameter optimization | ['XGBoost'] |
list |
--models_assess |
Models to use for performance assessment | ['XGBoost'] |
list |
python test_explain.py \
--geno ./data/genotype.csv \
--phen ./data/pheno.csv \
--soil ./data/soil.csv \
--weather ./data/weather.csv \
--target Yield \
--model EXGEP \
--job_id 20240813103950 \
--sample 2 \
--feature_i pc2 \
--feature_j RH2M \
--top_features 10 \
--top_interactions 20 \
--test_size 0.1 \
--random_state 2024| Parameter | Description | Default | Type |
|---|---|---|---|
--model |
Model to explain | - | string |
--job_id |
Job directory containing trained models | - | string |
--geno |
Path to genotype CSV file | ./data/genotype.csv |
string |
--phen |
Path to phenotype CSV file | ./data/pheno.csv |
string |
--soil |
Path to soil CSV file | ./data/soil.csv |
string |
--weather |
Path to weather CSV file | ./data/weather.csv |
string |
--target |
Target column name | Yield |
string |
--test_size |
Test set fraction | 0.2 |
float |
--random_state |
Random seed for reproducibility | 2024 |
integer |
--sample |
Sample index for waterfall plot | 2 |
integer |
--feature_i |
Primary feature for dependence plots | pc2 |
string |
--feature_j |
Secondary feature for interaction plots | RH2M |
string |
--top_features |
Number of top features to display | 20 |
integer |
--top_interactions |
Number of top interactions for network | 20 |
integer |
--cluster |
Use KMeans clustering for background data | False |
flag |
--n_train_points |
Training points for background | 200 |
integer |
--n_test_points |
Test samples to explain | None |
integer |
You can read our paper explaining EXGEP.
Yu T, Zhang H, Chen S, et al. EXGEP: a framework for predicting genotype-by-environment interactions using ensembles of explainable machine-learning models. Brief Bioinform, 2025. https://doi.org/10.1093/bib/bbaf414
This project is free to use for non-commercial purposes - see the LICENSE file for details.
For more information, please contact with Huihui Li (lihuihui@caas.cn).
