Leveraging Automated Machine Learning for Environmental Dimension-Data Genetic Analysis and Genomic Prediction in Maize Hybrids
Automated Machine Learning for Environmental Data-Driven Genome Prediction.An automated machine learning framework integrating environmental and genomic data enhances genetic analysis and genomic prediction in maize. By leveraging dimension-reduced environmental parameters, it reveals trait-environment relationships and identifies genetic markers that govern phenotypic plasticity and genotype-by-environment interactions. The combined use of markers and environmental features improves genomic prediction accuracy, offering a scalable solution for developing climate-resilient maize varieties.
- DNNGP – Deep neural network for genomic prediction.
- EXGEP – A framework for predicting genotype-by-environment interactions using ensem)bles of explainable machine-learning models.
- GxEtoolkit – An automated and explainable machine learning framework for Genome Prediction.
-
Windows
-
Linux
- Python 3.9
- pip
Install packages:
- Create a python environment.
conda create -n autogs python=3.9
conda activate autogs- Clone this repository and cd into it.
git clone https://github.com/AIBreeding/AutoGS.git
cd ./AutoGS
pip install -r requirements.txt- Test the AutoGS.
python test_autogs.pyimport os
import sys
import pprint
import sklearn
import pandas as pd
from scipy.stats import pearsonr
from matplotlib import pyplot as plt
from sklearn.preprocessing import StandardScaler
from autogs.data.tools.reg_metrics import (mae_score as mae,
mse_score as mse,
rmse_score as rmse,
r2_score as r2,
rmsle_score as rmsle,
mape_score as mape,
medae_score as medae,
pcc_score as pcc)
from autogs.model import RegAutoGS
from autogs.data import datautils
# read data
phen_file_path = "./dataset/trainset/Pheno/"
env_file_path = "./dataset/trainset/Env/"
geno_file_path = "./dataset/trainset/Geno/YI_All.vcf"
ref_path = "./docs/maizeRef(ALL).csv"
file_names = ["DEH1_2020", "DEH1_2021", "IAH2_2021", "IAH3_2021", "IAH4_2021", "WIH2_2020", "WIH2_2021"]
com_phen_data, com_env_data, dynamic_window_avg, env_transformed_data, \
gendata, PGE = datautils.process_data(phen_file_path, env_file_path, geno_file_path, ref_path, file_names)
# Access to phenotype (Yield_Mg_ha) and feature data (Gneo and Env)
columns_to_extract = [0, 1, 8] # Get Columns Env, Hybrid, and Yield_Mg_ha
columns_from_11_to_end = list(range(11, PGE.shape[1]))
columns_indices = columns_to_extract + columns_from_11_to_end
extracted_columns = PGE.iloc[:, columns_indices]
extracted_columns = pd.DataFrame(extracted_columns.dropna().reset_index(drop=True))
snp = pd.DataFrame(extracted_columns.iloc[:,3:])
scaler = StandardScaler()
scaled_snp = scaler.fit_transform(snp)
X = pd.DataFrame(scaled_snp,columns=snp.columns)
y = extracted_columns['Yield_Mg_ha']
y = pd.core.series.Series(y)
# train AutoGS model for reg prediction
reg = RegAutoGS(
y=y,
X=X,
test_size=0.2,
n_splits=5,
n_trial=5,
reload_study=True,
reload_trial=True,
write_folder=os.getcwd()+'/results/',
metric_optimise=r2,
metric_assess=[mae, mse, rmse, pcc, r2, rmsle, mape, medae],
optimization_objective='maximize',
models_optimize=['LightGBM','XGBoost','CatBoost','BayesianRidge'],
models_assess=['LightGBM','XGBoost','CatBoost','BayesianRidge'],
early_stopping_rounds=5,
random_state=2024
)
reg.train() # train model
reg.CalSHAP(n_train_points=200,n_test_points=200,cluster=False) # AutoGS SHAP interactionmae— Mean Absolute Error (MAE)mse— Mean Squared Error (MSE)rmse— Root Mean Squared Error (RMSE)r2— R² Score (Coefficient of Determination)rmsle— Root Mean Squared Logarithmic Error (RMSLE)mape— Mean Absolute Percentage Error (MAPE)medae— Median Absolute Error (MEDAE)pcc— Pearson Correlation Coefficient (PCC)
The AutoGS framework currently supports 28 base regression models, including ensemble models, linear models, and other widely-used regressors. Each model can be optionally selected via the selected_regressors list.
DTR– Decision Tree RegressorETR– Extra Trees RegressorLightGBM– Light Gradient Boosting MachineXGBoost– Extreme Gradient BoostingCatBoost– Categorical BoostingAdaBoost– Adaptive BoostingGBDT– Gradient Boosting Decision TreeBagging– Bagging RegressorRF– Random Forest RegressorHistGradientBoosting– Histogram-based Gradient Boosting
BayesianRidge– Bayesian Ridge RegressionLassoLARS– Lasso Least Angle RegressionElasticNet– Elastic Net RegressionSGD– Stochastic Gradient Descent RegressorLinear– Ordinary Least Squares Linear RegressionLasso– Lasso RegressionRidge– Ridge RegressionOMP– Orthogonal Matching PursuitARD– Automatic Relevance Determination RegressionPAR– Passive Aggressive RegressorTheilSen– Theil-Sen EstimatorHuber– Huber RegressorKernelRidge– Kernel Ridge RegressionRANSAC– RANSAC (RANdom SAmple Consensus) Regressor
KNN– k-Nearest Neighbors RegressorSVR– Support Vector RegressorDummy– Dummy Regressor (Baseline)MLP– Multi-Layer Perceptron (Deep Neural Network)
⚠️ When using Jupyter Notebook/Lab, please ensure your Python environment is properly configured and added to Jupyter Notebook/Lab.
conda create -n autogs python=3.9
conda activate autogs
conda install ipykernel
python -m ipykernel install --user --name autogs --display-name "autogs"You can read our paper explaining AutoGS here.
He K, Yu T, Gao S, et al. Leveraging Automated Machine Learning for Environmental Data-Driven Genetic Analysis and Genomic Prediction in Maize Hybrids. Adv Sci (Weinh), e2412423, 2025. https://doi.org/10.1002/advs.202412423
This project is free to use for non-commercial purposes - see the LICENSE file for details.
For more information, please contact with Huihui Li (lihuihui@caas.cn).