Antimicrobial resistance is driven not only by single genes but by coordinated gene sets that co-occur and move on mobile elements, yet researchers lack a unified, reproducible way to quantify these patterns across genomic populations. The Resistance Gene Association and Inference Network (ReGAIN) is an open-source platform that utilizes Bayesian network structure learning to infer probabilistic co-occurrence among resistance and virulence genes in clinically important bacterial pathogens. ReGAIN delivers interpretable metrics (conditional probability, relative risk, absolute risk difference, confidence intervals) and post-hoc bidirectional probability scores, enabling detection of synergistic and mutually exclusive gene relationships. By standardizing analyses across species and studies, ReGAIN supports surveillance, hypothesis generation, and stewardship by highlighting gene constellations linked to multidrug resistance and potential co-selection of resistance determinants.
- Conditional Probability
- Conditional probability is described as the probability of observing Variable A given the presence of Variable B within a given dataset, P(A|B)
- This value is reported on a scale of 0–1 (conditional probability of 0.5 = 50%)
- Absolute Risk Difference
- Absolute risk difference is defined as the conditional probability of observing Variable A given the presence of Variable B minus the conditional probability of observing Variable A in the absence of Variable B, P(A|B) - P(A|¬B)
- This value is reported on a scale of -1–1; negative values can occurr if P(A|¬B) >> P(A|B) and indicate a strong negative probabilistic relationship
- Relative Risk
- Relative risk is a ratio of the conditional probability of observing Variable A given the presence of Variable B to the conditional probability of observing Variable A given the absence of Variable B, P(A|B) / P(A|¬B)
- Relative risk < 1, negative relationship
- Relative risk > 1, positive relationship
- Relative risk = 1, neutral relationship (variable independence)
- Relative risk is a ratio of the conditional probability of observing Variable A given the presence of Variable B to the conditional probability of observing Variable A given the absence of Variable B, P(A|B) / P(A|¬B)
- Baseline risk (
bnSpipeline only)- Baseline risk of outcome: the % occurrence of Variable B in the dataset
- This value will be identical for all "Variable B's" within a given dataset
- Example: Gene 1 has a baseline risk of 0.12; this indicates that 12% of the genomic population encodes Gene 1.
Post-Hoc Analyses
A relationship between two variables may be directionally asymmetric — that is, P(A|B) does not necessarily equal P(B|A) — because the conditional dependencies that determine A given B may differ from those that determine B given A (e.g., knowing that it’s raining (B) changes the probability that the grass is wet (A), but knowing that the grass is wet doesn’t change the probability of rain in exactly the same way — hence P(A|B) ≠ P(B|A).) To better inform the direction of the relationship, ReGAIN calculates two post-hoc scores.
- Bidirectional Probability Score (BDPS)
- The ratio of the conditional probability P(A|B) to the conditional probability P(B|A)
- BDPS > 1, P(A|B) > P(B|A)
- BDPS < 1, P(A|B) < P(B|A)
- BDPS = 1, equal bidirectional strength
- The ratio of the conditional probability P(A|B) to the conditional probability P(B|A)
- Fold Change (FC)
- The ratio of the relative risk (P(A|B) / P(A|¬B) to the relative risk P(B|A) / P(B|¬A)) / 2
- Interpreted similarly to BPDS, with equal bidirectional strength being 0.5
- Installation
- Programs and Example Usage
- Module 1: Gene Identification
- Module 1.1: Dataset Creation
- Module 2: Bayesian Network Structure Learning
- ReGAIN Curate
- Multivariate Analysis
- Formatting External Data
- Citations
Due to current limitations in the Conda R / gRain stack on macOS, the Bioconda recipe for ReGAIN v1.7.1 may fail to solve dependencies on osx-64. Until this is resolved upstream, we recommend this workaround using a minimal Python+R Conda environment and installing ReGAIN from source.
- If not done already, download miniforge
conda create -y -n regain-conda python=3.10 r-base=4.4source activate regain-condaconda install -y -c conda-forge -c bioconda regain-cli- Install AMRFinderPlus database:
amrfinder -u
Test ReGAIN installation:
regain -h
regain --module-health
If these commands execute without error, the installation is successful.
For regain --module-health, you should see:
- You should see:
ReGAIN Module Status Report - AMR.run: Available - matrix.run: Available - curate.run: Available - extract.run: Available - combine.run: Available - bnS: Available - bnL: Available - MVA: Available - network: Available
For the most cutting-edge ReGAIN releases, we suggest cloning from source
Create Conda environment and install NCBI AMRFinderPlus and BLAST+
conda create -n regain python=3.10 r-base=4.4source activate regain- Install AMRfinderPlus
conda install -y -c conda-forge -c bioconda ncbi-amrfinderplus
- Check AMRFinder installation
amrfinder -h
- Download ARMfinderPlus Database
amrfinder -u
- BLAST+ is installed as an AMRFinderPlus dependency
One-liner: conda create -y -n regain -c conda-forge -c bioconda python=3.10 r-base=4.4 ncbi-amrfinderplus
- Download ReGAIN to preferred directory
git clone https://github.com/ERBringHorvath/regain_CLI
- Install Python dependencies
cd /path/to/regain_CLIpip install .
Note: R dependencies should be installed automatically when running ReGAIN for the first time
- Add ReGAIN to your PATH
- File might be
.bash_profile,.bashrc,.zshrcdepending on OS and Shell - Add this line to the end of the file:
- File might be
export PATH="$PATH:/path/to/regain_CLI/src/regain"
- Replace
/path/to/regain_CLI/src/regainwith the actual path to the directory containing the executable - Whatever the initial directory, this path should end with
/regain_CLI/src/regain
- Save the file and restart your terminal or run
source ~/.bashrcorsource ~/.zshrc - Verify installation:
regain --versionregain --help
- Run
regain --module-healthto check status of ReGAIN modules- You should see:
ReGAIN Module Status Report - AMR.run: Available - matrix.run: Available - curate.run: Available - extract.run: Available - combine.run: Available - bnS: Available - bnL: Available - MVA: Available - network: Available
NOTE: ReGAIN utilizes shell scripts to execute some modules. You may need to modify your permissions
to execute these scripts. If you run regain --version and see permission denied: regain, Navigate to regain/src/regain, then run both chmod +x regain and chmod +x *.sh and rerun regain --module-health
ReGAIN standalone installation comes with a subdirectory called demo_dataset/
For Bioconda installations, demo_dataset/ files can be downloaded from this repository
- Input files:
- Enterococcus_Demo_Dataset.csv
- Enterococcus_Demo_Dataset_Metadata.csv
Basic regain bnS workflow:
regain bnS -i Enterococcus_Demo_Dataset.csv \
-M Enterococcus_Demo_Dataset_Metadata.csv \
-o demo_network -n 500 -r 100 -T 4
This will generate a Bayesian network using 500 bootstraps and 100 data resamples. Using 4 dedicated cores (-T 4), this analysis should take less than 30 minutes to run. If more cores are available, setting -T to 8 will decrease runtime.
- Output files:
demo_network.rdsQuery_Results.csvpost_hoc_analysis.csvBayesian_Network.(html and pdf)
The resulting network will have both red and black edges connecting nodes. By default, ReGAIN colors negatively associated variable edges red (relative risk < 1). Positively associated edges are colored black (relative risk ≥ 1).
If specific network visualization parameters are desired, see Module 2 Standalone Network Visualization for available options.
-f, --fasta-directory, path to directory containing genome FASTA files to analyze
--mode, input FASTA file format (nucleotide, protein, combined (required protein FASTA + GFF file); default = nucleotide)
-O, --organism, specify organism for AMRFinderPlus pipeline
-o, --output-dir, output directory path
-T, --threads, specify number of cores to dedicate
Additional options:
--gff, input GFF file for combined mode
-D, --database, alternate AMRFinderPlus database directory
--no-plus, run basic AMRFinder analysis (organism non-specific; cannot use with --organism)
--name, prepend a sample/run identifier to AMRFinderPlus output CSV file
--quiet, suppress AMRFinderPlus status messages
--ident-min, override minimum percent identity (0-1). Use with caution
--print-node, add heierarchy node column
--mutation-all, generate a per-sample point-mutation audit file
--nucleotide-output, write FASTA file of detected nucleotide regions
--nucleotide-flank5-output, write FASTA file of detected nucleotide regions + 5' flanking base pairs
--nucleotide-flank5-size, number of additional 5' flanking base pairs for flank output
--protein-output, write FASTA of detected proteins from input protein FASTA
--organism-list, print the built-in lilst of supported --organism options
Currently supported organisms and how they should be called:
Acinetobacter_baumannii
Bordetella_pertussis
Burkholderia_cepacia
Burkholderia_pseudomallei
Burkholderia_mallei
Citrobacter_freundii
Corynebacterium_diphtheriae
Campylobacter
Clostridioides_difficile
Enterobacter_cloacae
Enterobacter_asburiae
Enterococcus_faecalis
Enterococcus_faecium
Escherichia
Haemophilus_influenzae
Klebsiella_oxytoca
Klebsiella_pneumoniae
Neisseria_gonorrhoeae
Neisseria_meningitidis
Pseudomonas_aeruginosa
Salmonella
Serratia_marcescens
Staphylococcus_aureus
Staphylococcus_pseudintermedius
Streptococcus_agalactiae
Streptococcus_pneumoniae
Streptococcus_pyogenes
Vibrio_cholerae
Vibrio_vulnificus
Vibrio_parahaemolyticus
Module 1 example usage:
Organism specific:
regain AMR -d path/to/FASTA/files -O Pseudomonas_aruginosa -T 8 -o path/to/output/directory
Organism non-specific:
regain AMR -d path/to/FASTA/files -T 8 -o path/to/output/directory
Output files:
One results file per submitted genome
NOTE: variable names cannot contain special characters; this transformation is automated during dataset creation
Examples of special characters: '"/,().[]
-d, --directory, path to AMRfinder results in CSV format
--gene-type, searches for resistance, virulence, or all genes
--min, minimum gene occurrence cutoff
--max, maximum gene occurrence cutoff (should be less than number of genomes, see NOTE below)
--report-all, optional; reports all genes identified, regardless of --min/--max threshold
--keep-gene-names, optional; maintains special characters in variable names. Should not be used if proceeding to Module 2
Module 1.1 example usage
NOTE: Discrete Bayesian network anlyses requires all variables to exist in at least two states. For ReGAIN, these two states are 'present' and 'absent'. Ubiquitously occurring genes will break the analysis.
Best practice is for N genomes, --max should MINIMALLY be defined as N - 1. Keep in mind that removing very low and very high abundance genes can reduce noise in the network.
regain matrix -d path/to/AMRfinder/results --gene-type resistance --min 5 --max 475
NOTE: all results are saved in the 'ReGAIN_Dataset' folder, which will be generated within the directory
defined by -d/--directory
Output files:
filtered_matrix.csv: presence/absence matrix of genes
metadata.csv: file containing genes identified in AMRfinderPlus analysis
combined_AMR_results_unfiltered.csv: concatenated file of all AMRfinder/Plus results; this file contains contig and nucleotide location of all identified genes
If --report-all is used:
unfiltered_matrix.csv: presence/absence matrix of all genes identified, regardless of --min/--max thresholds
-i, --input, input file in CSV format
-M, --metadata, file containing gene names and descriptions
-o, --output-boot, output bootstrap file
-T, --threads, number of cores to dedicate for parallel processing
-n, --bootstraps, how many bootstraps to run (suggested minimum of 300-500)
-r, --resamples, how many data resamples you want to use (suggested minimum of 100)
--blacklist, optional blacklist CSV (no header); 2 columns for variable 1 and variables 2
--iss, imaginary sample size for BDe score (default = 10)
--no-viz, Skip HTML/PDF visualization (for use if specific options are wanted, see regain network)
bnL only:
--cp-samples, Monte Carlo samples for cpquery (default: 10000)
Module 2 example usage:
NOTE: We suggest using between a minimum of 300 to 500 bootstraps and >100 resamples when possible
bnS, Bayesian network structure learning analysis for less than 100 genes
bnL, Bayesian network structure learning analysis for 100 genes or greater
For less than 100 genes:
regain bnS -i matrix_filtered.csv -M metadata.csv -o bootstrapped_network -T 8 -n 500 -r 500 --blacklist list.csv
For 100 or more genes:
regain bnL -i matrix_filtered.csv -M metadata.csv -o bootstrapped_network -T 8 -n 500 -r 500 --blacklist list.csv
Output files:
<output>.rds, bootstrapped network
Query_Results.csv, results file of all conditional probability and relative risk values
post_hoc_analysis.csv, results file of all bidirectional probability and fold change scores
Bayesian_Network.html, interactive Bayesian network
Bayesian_Network.pdf, static PDF of network using Fruchterman-Reingold force-directed layout algorithm
NOTE:
If using a blacklist, the CSV file should have no headers and only two columns. Column 1 should be your first variable in your variable pair to blacklist, and column 2 should be your second variable in the pair. Explicit bidirectional blacklisting must be passed, so if 'geneA' and 'geneB' are in the blacklist, they should be listed like:
col1, col2
geneA, geneB
geneB, geneA
Additionally, blacklists should be used with caution; ideally, only 'imposible' variables should be blacklisted. An example of this would be a gene with 2 different mutations at the same site (e.g., gyrA_S83D and gyrA_S83E).
For use if --no-viz is passed or specific network parameters are wanted
regain network
-i, --input, input RDS file generated from bnS/bnL analysis
-d, --data, input filtered data matrix file
-M, --metadata, input metadata file
-s, --statistics-results, input 'Results.csv' file from bnS/bnL analysis
--threshold, averaged network threshold (default: 0.5)
--seed, layout seed for Fruchterman-Reingold force-directed layout algorithm (pdf output)
--html-out, HTML output file name (default: Bayesian_Network.html)
--pdf-out, PDF output file name (default: Bayesian_Network.pdf)
-b, --blacklist, optional blacklist CSV (no header): from,to
--width-metric, edge-width metric selection (auto, abs_mean, abs_ci, cp_ci, cp_mean) (default: auto (abs_ci))
--rr-metric, relative risk edge color threshold (default: 1.0, <1: red, >=1: black)
Example usage:
regain network -i network.rds -d matrix_filtered.csv -M metadata.csv -s Results.csv
This workflow is an integrated part of the standard bnS/bnL pipeline, but serves as a redundant measure in the event network visualization needs to be re-performed or specific parameters beyond bnS/bnL defaults are wanted.
ReGAIN Curate is designed to allow users to generate a dataset for Bayesian network structure learning using
a custom set of gene queries, independent of ReGAIN Module 1
regain curate
-d, --directory, path to genome FASTA files
-q, --query, path to query files containing amino acid sequences in FASTA format
-T, --threads, number of cores to dedicate for parallel processing
--min, minimum gene occurrence threshold
--max, maximum gene occurrence threshold (should be less than number of genomes, see NOTE below)
--nucleotide-query, optional; use this to query nucleotide FASTA files
--report-all, optional; use this to return all BLAST hits, regardless of internal identity thresholds
--evalue, optional; set a custom maximum e-value threshold. Default = 1e-5
--perc, optional; set a custom minimum percent identity threshold. Default = 90%
--cov, optional; set a custom minimum query coverage threshold. Default = 75%
--min-seq-length, optional; designate minimum allowed query sequence lenght. Use with caution
--keep-gene-names, optional; maintains special characters in variable names. Should not be used if proceeding to Module 2
ReGAIN Curate example Usage:
regain curate -d /path/to/genome/files -q /path/to/query/files -T 8 --min 5 --max 475
ReGAIN Curate output files:
filtered_results.csv, all BLAST results meeting identity thresholds
curate_matrix.csv, filtered data matrix
curate_metadata.csv, metadata file for use in ReGAIN statistical modules
If --report-all is used:
all_results.csv, all BLAST results, regardless of identity thresholds
NOTE: Discrete Bayesian network anlyses requires all variables to exist in at least two states. For ReGAIN, these two states are 'present' and 'absent'. Ubiquitously occurring genes will break the analysis.
Best practice is for N genomes, --max should MINIMALLY be defined as N - 1. Keep in mind that removing very low and very high abundance genes can reduce noise in the network.
ReGAIN Extract is an optional module for use with ReGAIN Curate. This module extracts aligned sequences
identified from regain curate. Offered as an additional quality control step for gene identification.
Nucleotide sequences are extracted to a multi-FASTA file
regain extract
-c, --csv-path, path to ReGAIN Curate results file, such as filtered_results.csv
-f, --fasta-directory, path to genome FASTA files used in ReGAIN curate
-T, --threads, number of cores to dedicate for parallel processing
-o, --output-fasta, multi-FASTA file output (.fa, .fas, .fasta, .fna, .faa)
--evalue, optional; for use when --report-all flag is used. Sets maximum evalue threshold for sequence extraction
--perc, optional; same guidelines as --evalue
--cov, optional; same guidlines as --evalue and --min-perc
--translate, optional; translates extracted nucleotide sequences (see NOTE below)
ReGAIN Extract example usage:
regain extract -c /path/to/results/csv -f /path/to/genome/FASTA/files -T 8 -o sequences.fa
NOTE: the --translate flag should be used with care. In the event an alignment returns an incomplete CDS,
ReGAIN Extract will trim the sequence to the closest value divisible by 3 for codon prediction, which can result
in frameshifts. --translate is only suggested for use if returned alignments represent full coding sequences, or
manual validation of gene calls is performed
ReGAIN Combine is an optional module for use in combination with the ReGAIN Curate and ReGAIN AMR modules.
In the event users want to supplement the regain AMR results with a custom set of genes queried through
regain curate, regain combine will merge both datasets into a single dataset for use in ReGAIN statistical modules
regain combine
--matrix1, path to ReGAIN AMR data matrix, filtered_matrix.csv
--matrix2, path to ReGAIN Curate data matrix, curate_matrix.csv
--metadata1, path to ReGAIN AMR metadata file, metadata.csv
--metadata2, path to ReGAIN Curate metadata file, curate_metadata.csv
--delete-duplicates, optional; automatically delete duplicate values from dataset
ReGAIN Combine example usage:
regain combine --matrix1 /path/to/AMR/matrix/csv --matrix2 /path/to/curate/matrix/csv
--metadata1 /path/to/AMR/metadata/csv --metadata2 /path/to/curate/metadata/csv
ReGAIN Combine output files:
combined_matrix.csv, combined presence/absence matrix
combined_metadata.csv, combined metadata file
NOTE: in order for regain combine to function properly, do not modify values in column 1 (file) of the data matrix files
Currently supported measures:
manhattan, euclidean, canberra, clark, bray, kulczynski,
jaccard, gower, altGower, morisita, horn, mountford, raup,
binomial, chao, cao, chord, hellinger, aitchison, mahalanobis
regain MVA
-i, --input, input file in CSV format
-m, --method, options:
--k, k for k-means; 0 = auto (2..10)
-C, --confidence, Ellipse confidence (Default = 0.95)
--seed, set seed for reproducibility (default = 42)
--label, apply labels to data points [none, auto, all] (default = auto)
--point-size, size of data points (default = 3.5)
--alpha, opacity of data points (default = 0.75)
--no-ellipses, do not layer confidence ellipses
--psuedocount, for Aitchison method (default = 1e-6)
--save-dist, save distances to CSV
--dist-out, manually name output distance file
--png-out, manually name output PNG
--pdf-out, manually name output PDF
--coords-out, manually name output coordinate file (default = MVA_coordinates.csv)
--pcoa-correction, apply PCoA correction [auto, none, lingoes, cailliez] (default = auto)
Example usage:
regain MVA -i matrix.csv -m jaccard --k 0 --pcoa-correction auto
NOTE: the MVA analysis will generate 2 files: a PNG and a PDF of the plot
Bayesian network analysis requires both data matrix and metadata files. MVA analysis requires only a data matrix file
Metadata file MUST have two column headers. Ideally, 'Gene' and 'GeneClass'. Second column may contain empty rows
Data matrix MUST have headers for all columns
ReGAIN
Bring Horvath E, Stein M, Mulvey MA, Hernandez EJ, Winter JM.
Resistance Gene Association and Inference Network (ReGAIN): A Bioinformatics Pipeline for Assessing Probabilistic Co-Occurrence Between Resistance Genes in Bacterial Pathogens.
bioRxiv 2024.02.26.582197; doi: https://doi.org/10.1101/2024.02.26.582197
AMRFinder
Feldgarden M, Brover V, Haft DH, Prasad AB, Slotta DJ, Tolstoy I, Tyson GH, Zhao S, Hsu CH, McDermott PF, Tadesse DA, Morales C, Simmons M, Tillman G, Wasilenko J, Folster JP, Klimke W.
Validating the AMRFinder Tool and Resistance Gene Database by Using Antimicrobial Resistance Genotype-Phenotype Correlations in a Collection of Isolates.
Antimicrob Agents Chemother. 2019 Oct 22;63(11):e00483-19. doi: 10.1128/AAC.00483-19
AMRFinderPlus
Feldgarden M, Brover V, Gonzalez-Escalona N, Frye JG, Haendiges J, Haft DH, Hoffmann M, Pettengill JB, Prasad AB, Tillman GE, Tyson GH, Klimke W.
AMRFinderPlus and the Reference Gene Catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence.
Sci Rep. 2021 Jun 16;11(1):12728. doi: 10.1038/s41598-021-91456-0