EMUlator2ML

EMUlator2ML is a machine learning-enabled computational framework for 13C metabolic flux analysis. The training and test dataset is generated through metabolite labeling pattern simulation by means of random flux sampling and metabolic network decomposition, in which metabolic flux ratios and metabolite labeling patterns (mass isotopomer distribution vector, MDVs) are used as training targets and features, respectively. Prior to training, flux ratios are screened for solvability and metabolite MDVs are selected according to feature impotrance. A serial of ML and DL methods such as linear SVM, KNN, decision tree, random forest, gradient tree boosting and DNN will be tested and tuned to find the best estimator for each ratio. Combined with measurable fluxes, the predicted flux ratios determine/overdetermine the mass balance system of a metabolic network, and global metabolic fluxes will be solved. The uncertainty of estimation is also provided with a Monte Carlo method.

Dependencies

EMUlator2ML was developed and tested using Python 3.6+ with the following packages:
      COBRApy    v0.16.0
      Matplotlib    v3.1.1
      NumPy    v1.16.4
      Pandas    v0.25.1
      SciPy    v1.3.1
      Scikit-learn    v0.21.2
      SymPy    v1.4
      TensorFlow with GPU support    v2.0.0

Usage

main1.py generates dataset of flux ratios at metabilic nodes and simulated metabolite MDVs at random flux distributions.

Arguments:

-o, --outDir: output directory
-s, --simEMUs: EMUs (elementary metabolite units) to simulate, spe by ","
-rf, --rxnsFile: .tsv file with reactions of a metabolic network. Lines starts with "#" will be ignored. See below as an example

#reaction_ID	substrate_IDs(atom)	product_IDs(atom)	reversibility
v1	G6P(abcdef)	F6P(abcdef)	1
v2	F6P(abcdef)+ATP	FBP(abcdef)	0
...	...	...	...

Notes.

In Col 2 and 3, letters in parenthesis denotes atom mapping in the reaction. For metabolites with equivalents (e.g. chiral and prochiral metabolites), the atom mapping should be written as "0.5abcd,0.5dcba" which means a four-carbon metabolite with two equivalents.
In Col 4, 0 denotes irreversible and 1 denotes reversible.

-cf, --consFile: .tsv file with mass balance constraints including assignmet of fluxes, flux ratios and flux ranges. Lines starts with "#" will be ignored. "#flux value", "#ratio range" and "#flux range" tell the program the subsequent assignment of fluxes, flux ratios and flux ranges, respectively. See below as an example

#constraints	lower_bound	upper_bound
#flux value
v66	100
...	...	...
#ratio range
v2/v1	0.1	0.4
...	...	...
#flux range
allIR	0	200
allR	-200	200
...	...	...

Notes.

Assignmet of flux value is mandatory with a single value. It's usually used to standardize a flux distribution by setting the uptake flux to 100.
Assignmet of ratio range is optional, and symbolic expression is acceptable.
Assignmet of flux range is mandatory. Use "allR" to assign all reverible reactions, and "allIR" to assign all irreversible reactions.

-sf, --subsFiles: file(s) with substrate labeling info in the format of "tracer::path", sep by "," for parallel labeling. Lines starts with "#" will be ignored. See below as an example

#substrate	percentage	purity	labeling_pattern
GlcEX	0.754	0.997	1,0,0,0,0,0
GlcEX	0.246	0.994	1,1,1,1,1,1
...	...	...	...

Notes.

"percentage" denotes the molar percentage of corresponding labeling pattern; "purity" denotes the isotopic purity of corresponding labeling pattern; "labeling_pattern" indicates how the substrate is labeled with each bit denoting whether corresponding carbon is labeled.
User don't need to record natural substrates.
For each labeled substrate, sum of percentage should be 1 or less than 1. In the second case, the remainder is considered as fully unlabeled (natural substrate).
Different substrates are allowed.

-exm, --exMetabs: metabolites excluded from mass balance, sep by ","
-exn, --exNodes: node metabolites excluded for ratio selection, sep by ","
-ns, --nsims: the number of random flux distributions to generate
-q, --quantile: generated flux values in the quantile interval (i.e. [0.5 - quantile/2, 0.5 + quantile/2]) are preserved for each reaction, intersection of each flux sets are preserved as final dataset
-nj, --njobs: the number of jobs to run in parallel

Example:

python path/to/main1.py -o path/to/generated_data -s Ala23,Ala123,Thr1234,Phe123456789 -rf path/to/Reactions.tsv -cf path/to/Simulation_constraints.tsv -sf 1-U-Glc::path/to/Labeled_substrates_1-U-Glc.tsv -exn CD -ns 10000 -q 0.99 -nj 30

Note：

It is highly recommended to run this script in a HPC cluster.

main2.py selects MDV features according to importance and select the best estimator for each flux ratio.

Arguments:

-o, --outDir: output directory
-df, --dataFile: dataset file generated by main1.py, including target flux ratios and MDV features. See below as an example

r1	r2	...	1-U-Glc_Ala23_m0	1-U-Glc_Ala23_m1	1-U-Glc_Ala23_m2	...
0.093221201	0.906778799	...	0.639773192	0.148613755	0.211613052	...
...	...	...	...	...	...	...

Notes.

ID of MDV feature is presented as "labeling strategy_EMU ID_m*".

-r, --ratios: flux ratios to predict, sep by ",". Use "all" to select all ratios
-n, --noise: SD of normally distributed noise added to MDV features, default 0.01
-c, --criteria: how features are selected. If "mean" or "median" or float, features with importance above mean or median or criteria will be selected; if "**%", the top criteria percent of total features will be selected; if int, the top criteria of total features will be selected
-m, --methods: ML methods to test, sep by ",". Specifically, "pr" for polynomial regression, "lsvm" for linear support vector machine, "knn" for k-nearest neighbors, "dt" for decision tree, "rf" for random forst, "gtb" for gradient tree boosting, "mlp" for multilayer perceptron and "dnn" for deep neural network
-w, --runWhich: "1" only run MDV feature selection; "2" only run ML method tuning and selection; "12" (default) run both
-e, --ifError: whether to train a error model, "yes" or "no". Valid only if runWhich == "2" or "12"
-nj, --njobs: the number of jobs to run in parallel

Example:

python path/to/main2.py -o path/to/selected_models -df path/to/generated_data/ratios_MDVs.tsv -r all -n 0.01 -c 5% -m lsvm,knn,dt,rf,gtb,dnn -w 12 -e no -nj 30

Note：

It is highly recommended to run this script in a HPC cluster with GPU support.

main3.py predicts flux ratios from measured metabolite MDVs using trained ML models.

Arguments:

-o, --outDir: output directory
-cf, --configFile: config file for flux ratio prediction, fields include ratio_ID, model_path, errormodel_path (optional) and feature_path. Lines starts with "#" will be ignored. See below as an example

#ratio_ID	model_path	errormodel_path	feature_path
r3	path/to/selected_models/r3/dnn.h5		path/to/selected_models/r3/selected_features.tsv
...	...	...	...

Notes.

Model path, errormodel path and feature path are paths to corresponding files generated by main2.py.
Empty string for errormodel_path if no error model is trained.
feature_path can be selected features, sep by ",". In this case, the order is important.

-mf, --MDVsFile: file of measured MDVs, fields include mean and sd. Set sd to 0 if no replicates when error models are required. Lines starts with "#" will be ignored. See below as an example

#feature	mean	SD
1-U-Glc_Ala23_m0	0.639773192	0.01
1-U-Glc_Ala23_m1	0.148613755	0.01
1-U-Glc_Ala23_m2	0.211613052	0.01
...	...	...

-nr, --nruns: the number of runs to estimate uncentainties using Monte Carlo method, valid when sd is not 0, default 1000

Example:

python path/to/main3.py -o path/to/predicted_ratios -cf path/to/Prediction_configs.tsv -mf path/to/Measured_MDVs.tsv -nr 1000

main4.py estimated global metabolic fluxes based on mass balance constrainted by predicted flux ratios and measured fluxes.

Arguments:

-o, --outDir: output directory
-rf, --rxnsFile: the same as in main1.py
-cf, --consFile: file with mass balance constraints including assignmet of fluxes, flux ratios and flux ranges. Lines starts with "#" will be ignored. See below as an example

#constraints	mean(lower_bound)	SD(upper_bound)
#flux value
v66	100	1
...	...	...
#ratio value
1.0v3_b/(1.0v2+1.0*v3_b)	0.572433352	0.031303227
...	...	...
#flux range
all	0	200
...	...	...

Notes.

Assignmet of flux value is mandatory with mean and SD. Set SD to 0 if no replicate.
Assignmet of ratio value is mandatory, and symbolic expression is acceptable. Note "v*_f" denotes forward flux and "v*_b" denotes backward flux.
Assignmet of flux range is optional. Use "all" to assign all (nonnegative) reactions.

-exm, --exMetabs: metabolites excluded from mass balance, sep by ","
-nr, --nruns: the number of runs to estimate uncentainties using Monte Carlo method, default 1000

Example:

python path/to/main4.py -o path/to/estimateted_fluxes -rf path/to/Reactions.tsv -cf path/to/Measured_constraints.tsv -nr 1000

Citation

Wu et al. (2022) Computational Framework for Machine-Learning-Enabled 13C Fluxomics, ACS Synthetic Biology 11(1): 103–115. doi:10.1021/acssynbio.1c00189

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
data		data
models		models
scripts		scripts
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EMUlator2ML

Dependencies

Usage

Citation

About

Uh oh!

Releases

Packages

Languages

License

Chaowu88/EMUlator2ML

Folders and files

Latest commit

History

Repository files navigation

EMUlator2ML

Dependencies

Usage

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages