Identifying genetic variants in the pangenome using a reference tree

Introduction

pantree converts a pangenome graph .gfa file into a .vcf file containing variants identified in the graph. It creates a reference tree and defines variants as edges that are not in the reference tree. For more information, please see our preprint:

Nowbandegani PS, Zhang S, Hu H, Li H, O'Connor LJ. Defining and cataloging variants in pangenome graphs. bioRxiv. 2025. doi: 10.1101/2025.08.04.668502

Installation

You can install pantree using uv:

git clone https://github.com/oclb/pantree.git
cd pantree
uv venv
uv sync

Command Line Interface

uv run pantree <gfa_file> <vcf_file> [options]

Required Arguments

gfa_file: Path to the input GFA file containing the pangenome graph
vcf_file: Output path for VCF file

Optional Arguments

--chr-id TEXT: Chromosome ID for VCF output (default: "chr0")
--ref-name TEXT: Reference sample name (default: "GRCh38")
--no-genotypes: Skip genotype computation
--log-path TEXT: Path to log file for tracking progress and memory usage
--verbose, -v: Enable verbose logging to console
--dfs-method [max_weight|contiguous]: DFS method for reference tree construction (default: "max_weight")
--priority-samples TEXT: Comma-separated list of sample names.
--no-missingness: Skip missingness computation for genotypes (see below)

To output a bgzipped VCF, use a .vcf.gz extension for the output file.

The --dfs-method=contiguous option creates a reference tree whose branches follow individual haplotypes as long as they can. They switch to a new haplotype when the current haplotype ends, or when the next node on the current haplotype is already in the reference tree. This behavior only applies to haplotypes belonging to the --priority-samples list. When switching to a new haplotype, these same samples are prioritized, in the order that they are specified. Additionally, 'haplotype positions' are computed for haplotypes belonging to samples in this list: if that haplotype visits the branch point of some haplotype, then the position of that branch point is used to compute the haplotype position of that variant edge. This follows the same rule as the ordinary POS field, which is that the position of the variant is the position of the first base of the REF and ALT alleles; ordinarily this is the first base after the end of the branch point node, but for on-reference indels, one base is prepended to both alleles to make them non-empty, and accordingly the position is decremented by one.

Graphs generated by `vg`

The --no-missingness flag is useful for graphs generated using vg (e.g., per-chromosome .vg graphs from minigraph-cactus converted to .gfa using vg convert). Pantree annotates variants as 'missing' based on their positions along the reference genome, comparing these positions with the coordinates of contiguous walks. This logic can fail for vg-generated graphs because of the way that vg handles the linear reference genome. If you encounter errors related to missing genotypes having non-zero allele counts, use the --no-missingness flag to disable this logic.

`consolidate` subcommand

You can take a .vcf produced by pantree and produce a single-haplotype .vcf file containing pairwise differences between that sample and the linear reference genome. Any nested variation will be collapsed - for example, if the haplotype has an insertion, and then a SNP on that insertion, then these will be combined.

pantree consolidate <vcf_file> <sample_name> <haplotype_number> <output_path>

Python API Usage

from pantree import PangenomeGraph, Genotype

gfa_path = "/path/to/graph.gfa"

G: PangenomeGraph = PangenomeGraph.from_gfa(
    gfa_path, 
    ref_name="GRCh38",
)

# Also return walks; causes increased memory requirements
walks: list[list[str]]
G = PangenomeGraph.from_gfa(gfa_path, return_walks=True)

# Get the genotype of some walk
genotype: Genotype = Genotype.genotype(G, G.walks['CHM13'], exclude_terminus=True)

# Generate VCF file with genotypes
vcf_path = "/path/to/output.vcf"
chr_id = "chr1"
G.write_vcf(gfa_path, vcf_path, chr_id)

# Generate VCF without genotypes
G.write_vcf(None, vcf_path, chr_id)

Name		Name	Last commit message	Last commit date
Latest commit History 259 Commits
.github/workflows		.github/workflows
data		data
output		output
pantree		pantree
tests		tests
.gitignore		.gitignore
.python-version		.python-version
.windsurfrules		.windsurfrules
README.md		README.md
analyze_gfa.py		analyze_gfa.py
c4a_test_script.py		c4a_test_script.py
c4a_with_inversion.vcf		c4a_with_inversion.vcf
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
reference_tree_variation_methods_rough.docx		reference_tree_variation_methods_rough.docx
requirements.txt		requirements.txt
run_haplo_contiguous_vcf.sh		run_haplo_contiguous_vcf.sh
run_pipeline_default.py		run_pipeline_default.py
run_pipeline_with_haplo_contiguous.py		run_pipeline_with_haplo_contiguous.py
setup.py		setup.py
test_haplo_priorities.py		test_haplo_priorities.py
test_haplotype_position.py		test_haplotype_position.py
test_haplotype_position_root.py		test_haplotype_position_root.py
test_vcf_haplotype_position.py		test_vcf_haplotype_position.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Identifying genetic variants in the pangenome using a reference tree

Table of Contents

Introduction

Installation

Command Line Interface

Required Arguments

Optional Arguments

Graphs generated by `vg`

`consolidate` subcommand

Python API Usage

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

oclb/pantree

Folders and files

Latest commit

History

Repository files navigation

Identifying genetic variants in the pangenome using a reference tree

Table of Contents

Introduction

Installation

Command Line Interface

Required Arguments

Optional Arguments

Graphs generated by vg

consolidate subcommand

Python API Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Graphs generated by `vg`

`consolidate` subcommand

Packages