g4hunterpy3

Python 3 implementation of G4Hunter as a full Python package. G4Hunter predicts G-quadruplex (G4) propensity in DNA sequences using a sliding window approach based on the algorithm described by Bedrat et al. (2016).

We created this package because we wanted to use G4Hunter locally, yet the original Python implementation is only compatible with Python 2.7. Based on the script there, we converted the code into a fully-fledged installable package with tests. This should reproduce and extended the original functionality, notably with the --complex-plot function which is useful for large-scale predictions.

For any questinos/concerns contact Alex.

About

G4Hunter is a bioinformatic tool designed to predict the formation propensity of G-quadruplexes (G4), which are four-stranded nucleic acid structures stabilized by guanine quartets. Unlike previous pattern-matching algorithms that rely on rigid consensus sequences, G4Hunter uses a scoring system based on G-richness and G-skewness.

Core Scoring Principles

The algorithm assigns a score to every nucleotide in a sequence to reflect its contribution to (or competition with) G4 formation:

Neutral Bases: Adenine (A) and Thymine (T) are assigned a score of 0.

Guanine (G): Assigned positive scores based on the length of the G-run:
A single G = 1
Each G in a GG run = 2
Each G in a GGG run = 3
Each G in a run of 4 or more Gs = 4

Cytosine (C): Assigned the same values as G, but negative (e.g., a CCC run scores -3 per C). This penalizes regions where high C-content would favor stable duplex formation over G4 structures.

Calculation Method

Sliding Window: The algorithm typically uses a sliding window (default: 25 nt) to compute the arithmetic mean of the scores within that window.

G4Hscore: The resulting mean value is the G4Hscore.

Thresholding: A threshold () is applied to extract G-quadruplex Forming Sequences (G4FS). Typical values for range between 1.0 and 2.0.
- Thresold of 1.2: This is a recommended compromise for identifying many true G4 motifs while maintaining reasonable precision.
- Threshold of 1.5: This is recommended for high-confidence predictions (precision >90%) where stable G4 formation is likely.
Finally, for plotting you can plot scores either in a strand-dependent way (i.e. runs of C are expected to be G quadruplex forming) or a strand-agnostic way,

For more information, see the manuscript

Installation

Install from source:

git clone https://github.com/holehouse-lab/g4hunterpy3.git
cd g4hunterpy3
pip install .

Or install in development mode:

pip install -e .

Dependencies

Python >= 3.8
numpy
matplotlib
protfasta

Command-Line Interface (CLI)

After installation, the g4hunterpy3 command becomes available in your terminal. This CLI tool reads FASTA files containing DNA sequences, calculates G4Hunter scores using a sliding window approach, and outputs hit regions and merged regions to text files.

Basic Usage

g4hunterpy3 -i <input.fasta> -o <output_directory> [options]

Options

Option	Short	Required	Default	Description
`--input`	`-i`	Yes	—	Path to the input FASTA file containing DNA sequence(s)
`--output`	`-o`	Yes	—	Path to the output directory where results will be saved
`--window`	`-w`	No	`25`	Window size (k) for the sliding window analysis
`--score`	`-s`	No	`1.2`	Absolute score threshold for calling hits. Windows scoring below this threshold (in absolute value) are not reported
`--info`	—	No	`false`	Print information about the sequences being analyzed (length, number of hits, number of merged regions)
`--simple-plot`	—	No	`false`	Generate a simple PDF plot of the sliding-window scores for each sequence
`--complex-plot`	—	No	`false`	Generate a complex PDF plot with binned visualization of sliding-window scores
`--complex-plot-nbins`	—	No	`1000`	Number of bins for the complex plot (appropriate for large genomes)
`--complex-plot-percentile`	—	No	`95`	Percentile to use for y-axis limit in complex plot
`--strand-agnostic`	—	No	`false`	Use absolute G4 scores for complex plot (treats both strands equally). When enabled, uses red colormap; when disabled, uses diverging blue-red colormap to show C-rich (negative) vs G-rich (positive) regions
`--highlight-regions`	—	No	—	Regions to highlight on complex plot. Specify as `START:END` pairs (1-based coordinates). Multiple regions can be specified

Examples

Basic analysis with default parameters

g4hunterpy3 -i sequences.fasta -o results/

This runs G4Hunter with:

Window size: 25
Score threshold: 1.2

Custom window size and threshold

g4hunterpy3 -i genome.fasta -o output/ -w 30 -s 1.5

This uses:

Window size: 30 bp
Score threshold: 1.5 (more stringent, fewer hits)

Get sequence information

g4hunterpy3 -i sequences.fasta -o results/ --info

This prints details about each sequence including:

Sequence length
Number of window hits
Number of merged regions
Output file paths

Generate visualization plots

# Simple score plot
g4hunterpy3 -i sequences.fasta -o results/ --simple-plot

# Complex binned plot (useful for large sequences/genomes)
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --complex-plot-nbins 500

# Both plots with custom percentile for y-axis
g4hunterpy3 -i genome.fasta -o results/ --simple-plot --complex-plot --complex-plot-percentile 99

Highlight specific regions on complex plot

You can highlight specific genomic regions on the complex plot using the --highlight-regions option. This is useful for marking regions of interest such as promoters, genes, or other annotations.

# Highlight a single region (e.g., a promoter from position 1000 to 2000)
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --highlight-regions 1000:2000

# Highlight multiple regions
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --highlight-regions 1000:2000 5000:6000 8000:9000

# Combine with other complex plot options
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --complex-plot-nbins 500 --highlight-regions 1000:2000 5000:6000

Highlighted regions appear as yellow vertical spans (alpha=0.5) on the heatmap.

Strand-agnostic vs strand-specific plotting

By default, the complex plot shows strand-specific scores using a diverging blue-red colormap:

Blue regions indicate C-rich sequences (negative scores, potential G4 on reverse strand)
Red regions indicate G-rich sequences (positive scores, potential G4 on forward strand)

Use --strand-agnostic to show absolute scores (ignoring strand), which is useful for double-stranded DNA where G4s can form on either strand:

# Strand-specific (default): blue for C-rich, red for G-rich
g4hunterpy3 -i genome.fasta -o results/ --complex-plot

# Strand-agnostic: all G4-forming regions shown in red (absolute values)
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --strand-agnostic

Full example with all options

g4hunterpy3 \
    -i my_sequences.fasta \
    -o ./g4_results/ \
    -w 25 \
    -s 1.2 \
    --info \
    --simple-plot \
    --complex-plot \
    --complex-plot-nbins 1000 \
    --complex-plot-percentile 95 \
    --strand-agnostic \
    --highlight-regions 1000:2000 5000:6000

Output Files

For each sequence record in the input FASTA file, the CLI generates the following output files:

1. Per-Window Hit File

Filename: <sequence_header>-W<window_size>-S<threshold>.txt

Contains all windows that pass the score threshold:

Column	Description
Start	1-based start position of the window
End	1-based end position of the window
Sequence	The nucleotide sequence of the window
Length	Length of the window (equals window size)
Score	G4Hunter score for the window

2. Merged Region File

Filename: <sequence_header>-Merged.txt

Contains merged regions formed by overlapping or adjacent window hits:

Column	Description
Start	1-based start position of the merged region
End	1-based end position of the merged region
Sequence	The nucleotide sequence of the merged region
Length	Total length of the merged region
Score	Mean G4Hunter score across the region
NBR	Region number (sequential identifier)

3. Plot Files (optional)

Simple Plot: <sequence_header>-ScorePlot.pdf — A straightforward visualization of G4Hunter scores across the sequence
Complex Plot: <sequence_header>-ComplexScorePlot.pdf — A binned visualization suitable for large sequences/genomes

Understanding G4Hunter Scores

Positive scores indicate G-rich regions (potential G4-forming on the forward strand)
Negative scores indicate C-rich regions (potential G4-forming on the reverse strand)
Score magnitude reflects the G4-forming propensity:
- |score| ≥ 1.2: Moderate propensity
- |score| ≥ 1.5: High propensity
- |score| ≥ 2.0: Very high propensity

How to Cite

Please cite the original G4Hunter paper [1], and link to this repository so folks can reproduce analysis with this implementation.

[1] Bedrat, A., Lacroix, L. & Mergny, J.-L. Re-evaluation of G-quadruplex propensity with G4Hunter. Nucleic Acids Res. 44, 1746–1759 (2016). Link

Copyright

Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.11.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github		.github
devtools		devtools
docs		docs
g4hunterpy3		g4hunterpy3
.codecov.yml		.codecov.yml
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

g4hunterpy3

Table of Contents

About

Core Scoring Principles

Calculation Method

Installation

Dependencies

Command-Line Interface (CLI)

Basic Usage

Options

Examples

Basic analysis with default parameters

Custom window size and threshold

Get sequence information

Generate visualization plots

Highlight specific regions on complex plot

Strand-agnostic vs strand-specific plotting

Full example with all options

Output Files

1. Per-Window Hit File

2. Merged Region File

3. Plot Files (optional)

Understanding G4Hunter Scores

How to Cite

Copyright

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

g4hunterpy3

Table of Contents

About

Core Scoring Principles

Calculation Method

Installation

Dependencies

Command-Line Interface (CLI)

Basic Usage

Options

Examples

Basic analysis with default parameters

Custom window size and threshold

Get sequence information

Generate visualization plots

Highlight specific regions on complex plot

Strand-agnostic vs strand-specific plotting

Full example with all options

Output Files

1. Per-Window Hit File

2. Merged Region File

3. Plot Files (optional)

Understanding G4Hunter Scores

How to Cite

Copyright

Acknowledgements

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages