Skip to content

oclb/pantree

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Identifying genetic variants in the pangenome using a reference tree

Table of Contents

Introduction

pantree converts a pangenome graph .gfa file into a .vcf file containing variants identified in the graph. It creates a reference tree and defines variants as edges that are not in the reference tree. For more information, please see our preprint:

Nowbandegani PS, Zhang S, Hu H, Li H, O'Connor LJ. Defining and cataloging variants in pangenome graphs. bioRxiv. 2025. doi: 10.1101/2025.08.04.668502

Installation

You can install pantree using uv:

git clone https://github.com/oclb/pantree.git
cd pantree
uv venv
uv sync

Command Line Interface

uv run pantree <gfa_file> <vcf_file> [options]

Required Arguments

  • gfa_file: Path to the input GFA file containing the pangenome graph
  • vcf_file: Output path for VCF file

Optional Arguments

  • --chr-id TEXT: Chromosome ID for VCF output (default: "chr0")
  • --ref-name TEXT: Reference sample name (default: "GRCh38")
  • --no-genotypes: Skip genotype computation
  • --log-path TEXT: Path to log file for tracking progress and memory usage
  • --verbose, -v: Enable verbose logging to console
  • --dfs-method [max_weight|contiguous]: DFS method for reference tree construction (default: "max_weight")
  • --priority-samples TEXT: Comma-separated list of sample names.
  • --no-missingness: Skip missingness computation for genotypes (see below)

To output a bgzipped VCF, use a .vcf.gz extension for the output file.

The --dfs-method=contiguous option creates a reference tree whose branches follow individual haplotypes as long as they can. They switch to a new haplotype when the current haplotype ends, or when the next node on the current haplotype is already in the reference tree. This behavior only applies to haplotypes belonging to the --priority-samples list. When switching to a new haplotype, these same samples are prioritized, in the order that they are specified. Additionally, 'haplotype positions' are computed for haplotypes belonging to samples in this list: if that haplotype visits the branch point of some haplotype, then the position of that branch point is used to compute the haplotype position of that variant edge. This follows the same rule as the ordinary POS field, which is that the position of the variant is the position of the first base of the REF and ALT alleles; ordinarily this is the first base after the end of the branch point node, but for on-reference indels, one base is prepended to both alleles to make them non-empty, and accordingly the position is decremented by one.

Graphs generated by vg

The --no-missingness flag is useful for graphs generated using vg (e.g., per-chromosome .vg graphs from minigraph-cactus converted to .gfa using vg convert). Pantree annotates variants as 'missing' based on their positions along the reference genome, comparing these positions with the coordinates of contiguous walks. This logic can fail for vg-generated graphs because of the way that vg handles the linear reference genome. If you encounter errors related to missing genotypes having non-zero allele counts, use the --no-missingness flag to disable this logic.

consolidate subcommand

You can take a .vcf produced by pantree and produce a single-haplotype .vcf file containing pairwise differences between that sample and the linear reference genome. Any nested variation will be collapsed - for example, if the haplotype has an insertion, and then a SNP on that insertion, then these will be combined.

pantree consolidate <vcf_file> <sample_name> <haplotype_number> <output_path>

Python API Usage

from pantree import PangenomeGraph, Genotype

gfa_path = "/path/to/graph.gfa"

G: PangenomeGraph = PangenomeGraph.from_gfa(
    gfa_path, 
    ref_name="GRCh38",
)

# Also return walks; causes increased memory requirements
walks: list[list[str]]
G = PangenomeGraph.from_gfa(gfa_path, return_walks=True)

# Get the genotype of some walk
genotype: Genotype = Genotype.genotype(G, G.walks['CHM13'], exclude_terminus=True)

# Generate VCF file with genotypes
vcf_path = "/path/to/output.vcf"
chr_id = "chr1"
G.write_vcf(gfa_path, vcf_path, chr_id)

# Generate VCF without genotypes
G.write_vcf(None, vcf_path, chr_id)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •