Skip to content

holehouse-lab/g4hunterpy3

g4hunterpy3

Python 3 Maintained Last Commit License

Python 3 implementation of G4Hunter as a full Python package. G4Hunter predicts G-quadruplex (G4) propensity in DNA sequences using a sliding window approach based on the algorithm described by Bedrat et al. (2016).

We created this package because we wanted to use G4Hunter locally, yet the original Python implementation is only compatible with Python 2.7. Based on the script there, we converted the code into a fully-fledged installable package with tests. This should reproduce and extended the original functionality, notably with the --complex-plot function which is useful for large-scale predictions.

For any questinos/concerns contact Alex.


Table of Contents


About

G4Hunter is a bioinformatic tool designed to predict the formation propensity of G-quadruplexes (G4), which are four-stranded nucleic acid structures stabilized by guanine quartets. Unlike previous pattern-matching algorithms that rely on rigid consensus sequences, G4Hunter uses a scoring system based on G-richness and G-skewness.


Core Scoring Principles

The algorithm assigns a score to every nucleotide in a sequence to reflect its contribution to (or competition with) G4 formation:

Neutral Bases: Adenine (A) and Thymine (T) are assigned a score of 0.

  • Guanine (G): Assigned positive scores based on the length of the G-run:

  • A single G = 1

  • Each G in a GG run = 2

  • Each G in a GGG run = 3

  • Each G in a run of 4 or more Gs = 4

Cytosine (C): Assigned the same values as G, but negative (e.g., a CCC run scores -3 per C). This penalizes regions where high C-content would favor stable duplex formation over G4 structures.

Calculation Method

Sliding Window: The algorithm typically uses a sliding window (default: 25 nt) to compute the arithmetic mean of the scores within that window.

G4Hscore: The resulting mean value is the G4Hscore.

  1. Thresholding: A threshold () is applied to extract G-quadruplex Forming Sequences (G4FS). Typical values for range between 1.0 and 2.0.

    • Thresold of 1.2: This is a recommended compromise for identifying many true G4 motifs while maintaining reasonable precision.
    • Threshold of 1.5: This is recommended for high-confidence predictions (precision >90%) where stable G4 formation is likely.
  2. Finally, for plotting you can plot scores either in a strand-dependent way (i.e. runs of C are expected to be G quadruplex forming) or a strand-agnostic way,

For more information, see the manuscript

Installation

Install from source:

git clone https://github.com/holehouse-lab/g4hunterpy3.git
cd g4hunterpy3
pip install .

Or install in development mode:

pip install -e .

Dependencies

  • Python >= 3.8
  • numpy
  • matplotlib
  • protfasta

Command-Line Interface (CLI)

After installation, the g4hunterpy3 command becomes available in your terminal. This CLI tool reads FASTA files containing DNA sequences, calculates G4Hunter scores using a sliding window approach, and outputs hit regions and merged regions to text files.

Basic Usage

g4hunterpy3 -i <input.fasta> -o <output_directory> [options]

Options

Option Short Required Default Description
--input -i Yes Path to the input FASTA file containing DNA sequence(s)
--output -o Yes Path to the output directory where results will be saved
--window -w No 25 Window size (k) for the sliding window analysis
--score -s No 1.2 Absolute score threshold for calling hits. Windows scoring below this threshold (in absolute value) are not reported
--info No false Print information about the sequences being analyzed (length, number of hits, number of merged regions)
--simple-plot No false Generate a simple PDF plot of the sliding-window scores for each sequence
--complex-plot No false Generate a complex PDF plot with binned visualization of sliding-window scores
--complex-plot-nbins No 1000 Number of bins for the complex plot (appropriate for large genomes)
--complex-plot-percentile No 95 Percentile to use for y-axis limit in complex plot
--strand-agnostic No false Use absolute G4 scores for complex plot (treats both strands equally). When enabled, uses red colormap; when disabled, uses diverging blue-red colormap to show C-rich (negative) vs G-rich (positive) regions
--highlight-regions No Regions to highlight on complex plot. Specify as START:END pairs (1-based coordinates). Multiple regions can be specified

Examples

Basic analysis with default parameters

g4hunterpy3 -i sequences.fasta -o results/

This runs G4Hunter with:

  • Window size: 25
  • Score threshold: 1.2

Custom window size and threshold

g4hunterpy3 -i genome.fasta -o output/ -w 30 -s 1.5

This uses:

  • Window size: 30 bp
  • Score threshold: 1.5 (more stringent, fewer hits)

Get sequence information

g4hunterpy3 -i sequences.fasta -o results/ --info

This prints details about each sequence including:

  • Sequence length
  • Number of window hits
  • Number of merged regions
  • Output file paths

Generate visualization plots

# Simple score plot
g4hunterpy3 -i sequences.fasta -o results/ --simple-plot

# Complex binned plot (useful for large sequences/genomes)
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --complex-plot-nbins 500

# Both plots with custom percentile for y-axis
g4hunterpy3 -i genome.fasta -o results/ --simple-plot --complex-plot --complex-plot-percentile 99

Highlight specific regions on complex plot

You can highlight specific genomic regions on the complex plot using the --highlight-regions option. This is useful for marking regions of interest such as promoters, genes, or other annotations.

# Highlight a single region (e.g., a promoter from position 1000 to 2000)
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --highlight-regions 1000:2000

# Highlight multiple regions
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --highlight-regions 1000:2000 5000:6000 8000:9000

# Combine with other complex plot options
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --complex-plot-nbins 500 --highlight-regions 1000:2000 5000:6000

Highlighted regions appear as yellow vertical spans (alpha=0.5) on the heatmap.

Strand-agnostic vs strand-specific plotting

By default, the complex plot shows strand-specific scores using a diverging blue-red colormap:

  • Blue regions indicate C-rich sequences (negative scores, potential G4 on reverse strand)
  • Red regions indicate G-rich sequences (positive scores, potential G4 on forward strand)

Use --strand-agnostic to show absolute scores (ignoring strand), which is useful for double-stranded DNA where G4s can form on either strand:

# Strand-specific (default): blue for C-rich, red for G-rich
g4hunterpy3 -i genome.fasta -o results/ --complex-plot

# Strand-agnostic: all G4-forming regions shown in red (absolute values)
g4hunterpy3 -i genome.fasta -o results/ --complex-plot --strand-agnostic

Full example with all options

g4hunterpy3 \
    -i my_sequences.fasta \
    -o ./g4_results/ \
    -w 25 \
    -s 1.2 \
    --info \
    --simple-plot \
    --complex-plot \
    --complex-plot-nbins 1000 \
    --complex-plot-percentile 95 \
    --strand-agnostic \
    --highlight-regions 1000:2000 5000:6000

Output Files

For each sequence record in the input FASTA file, the CLI generates the following output files:

1. Per-Window Hit File

Filename: <sequence_header>-W<window_size>-S<threshold>.txt

Contains all windows that pass the score threshold:

Column Description
Start 1-based start position of the window
End 1-based end position of the window
Sequence The nucleotide sequence of the window
Length Length of the window (equals window size)
Score G4Hunter score for the window

2. Merged Region File

Filename: <sequence_header>-Merged.txt

Contains merged regions formed by overlapping or adjacent window hits:

Column Description
Start 1-based start position of the merged region
End 1-based end position of the merged region
Sequence The nucleotide sequence of the merged region
Length Total length of the merged region
Score Mean G4Hunter score across the region
NBR Region number (sequential identifier)

3. Plot Files (optional)

  • Simple Plot: <sequence_header>-ScorePlot.pdf — A straightforward visualization of G4Hunter scores across the sequence
  • Complex Plot: <sequence_header>-ComplexScorePlot.pdf — A binned visualization suitable for large sequences/genomes

Understanding G4Hunter Scores

  • Positive scores indicate G-rich regions (potential G4-forming on the forward strand)
  • Negative scores indicate C-rich regions (potential G4-forming on the reverse strand)
  • Score magnitude reflects the G4-forming propensity:
    • |score| ≥ 1.2: Moderate propensity
    • |score| ≥ 1.5: High propensity
    • |score| ≥ 2.0: Very high propensity

How to Cite

Please cite the original G4Hunter paper [1], and link to this repository so folks can reproduce analysis with this implementation.

[1] Bedrat, A., Lacroix, L. & Mergny, J.-L. Re-evaluation of G-quadruplex propensity with G4Hunter. Nucleic Acids Res. 44, 1746–1759 (2016). Link


Copyright

Copyright (c) 2025-2026, Alex Holehouse


Acknowledgements

Project based on the Computational Molecular Science Python Cookiecutter version 1.11.

About

Python 3 implementation of G4 hunter, with associated CLI and API

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages