pantree converts a pangenome graph .gfa file into a .vcf file containing variants identified in the graph. It creates a reference tree and defines variants as edges that are not in the reference tree. For more information, please see our preprint:
Nowbandegani PS, Zhang S, Hu H, Li H, O'Connor LJ. Defining and cataloging variants in pangenome graphs. bioRxiv. 2025. doi: 10.1101/2025.08.04.668502
You can install pantree using uv:
git clone https://github.com/oclb/pantree.git
cd pantree
uv venv
uv syncuv run pantree <gfa_file> <vcf_file> [options]gfa_file: Path to the input GFA file containing the pangenome graphvcf_file: Output path for VCF file
--chr-id TEXT: Chromosome ID for VCF output (default: "chr0")--ref-name TEXT: Reference sample name (default: "GRCh38")--no-genotypes: Skip genotype computation--log-path TEXT: Path to log file for tracking progress and memory usage--verbose, -v: Enable verbose logging to console--dfs-method [max_weight|contiguous]: DFS method for reference tree construction (default: "max_weight")--priority-samples TEXT: Comma-separated list of sample names.--no-missingness: Skip missingness computation for genotypes (see below)
To output a bgzipped VCF, use a .vcf.gz extension for the output file.
The --dfs-method=contiguous option creates a reference tree whose branches follow individual haplotypes as long as they can. They switch to a new haplotype when the current haplotype ends, or when the next node on the current haplotype is already in the reference tree. This behavior only applies to haplotypes belonging to the --priority-samples list. When switching to a new haplotype, these same samples are prioritized, in the order that they are specified. Additionally, 'haplotype positions' are computed for haplotypes belonging to samples in this list: if that haplotype visits the branch point of some haplotype, then the position of that branch point is used to compute the haplotype position of that variant edge. This follows the same rule as the ordinary POS field, which is that the position of the variant is the position of the first base of the REF and ALT alleles; ordinarily this is the first base after the end of the branch point node, but for on-reference indels, one base is prepended to both alleles to make them non-empty, and accordingly the position is decremented by one.
The --no-missingness flag is useful for graphs generated using vg (e.g., per-chromosome .vg graphs from minigraph-cactus converted to .gfa using vg convert). Pantree annotates variants as 'missing' based on their positions along the reference genome, comparing these positions with the coordinates of contiguous walks. This logic can fail for vg-generated graphs because of the way that vg handles the linear reference genome. If you encounter errors related to missing genotypes having non-zero allele counts, use the --no-missingness flag to disable this logic.
You can take a .vcf produced by pantree and produce a single-haplotype .vcf file containing pairwise differences between that sample and the linear reference genome. Any nested variation will be collapsed - for example, if the haplotype has an insertion, and then a SNP on that insertion, then these will be combined.
pantree consolidate <vcf_file> <sample_name> <haplotype_number> <output_path>from pantree import PangenomeGraph, Genotype
gfa_path = "/path/to/graph.gfa"
G: PangenomeGraph = PangenomeGraph.from_gfa(
gfa_path,
ref_name="GRCh38",
)
# Also return walks; causes increased memory requirements
walks: list[list[str]]
G = PangenomeGraph.from_gfa(gfa_path, return_walks=True)
# Get the genotype of some walk
genotype: Genotype = Genotype.genotype(G, G.walks['CHM13'], exclude_terminus=True)
# Generate VCF file with genotypes
vcf_path = "/path/to/output.vcf"
chr_id = "chr1"
G.write_vcf(gfa_path, vcf_path, chr_id)
# Generate VCF without genotypes
G.write_vcf(None, vcf_path, chr_id)