Skip to content

Output format

Darren J. Lin edited this page Mar 18, 2022 · 16 revisions

SVision outputs

SVision uses the standard VCF format to save detected structural variants and an extra rGFA formatted file to save the graph representation of detected complex structural variants.

VCF

SVision adopts the standard VCF format with extra info columns. Some important info columns are listed as below:

The SV ID column is given in the format of a_b, where b indicates site a contains other type of SVs.

Filters used in the output.

Covered: The entire SV is spanned by long-reads, producing the most confident calls.

Uncovered: SV is partially spanned by long-reads, i.e. reads spanning one of the breakpoints.

Clustered: SV is partially spanned by long-reads, but can be spanned through reads clusters.

We add extra attributes in the INFO column of VCF format for SVision detected complex structural variants.

BRPKS: The CNN recognized internal structure of CSVs through tMOR.

GraphID: The graph index used to indicate the graph structure, which requires --graph and is obtained by calculating isomorphic graphs. The ID for simple SVs is -1.

GFA_FILE_PREFIX: File name of CSV corresponding GFA file.

GFA_S: Nodes contained in a CSV graph represented based on GFA format.

GFA_L: Links contained in a CSV graph represented based on GFA format

Example of a SVision CSV call from the demo data

chr9	74283222	4	N	<CSV>	0	Covered END=74283473;SVLEN=251;SVTYPE=INS+INV;SUPPORT=12;BKPS=INS:1803-74283224-74283473,INV:157-74283222-74283379;READS=m54329U_190827_173812/67436872/ccs,m54329U_190615_010947/36833505/ccs,m54329U_190701_222759/20252964/ccs,m54329U_190617_231905/171966566/ccs,m54329U_190629_180018/105841132/ccs,m54329U_190701_222759/158597755/ccs,m54329U_190827_173812/141232071/ccs,m54329U_190617_231905/155256326/ccs,m54329U_190629_180018/77990008/ccs,m54329U_190617_231905/67109556/ccs,m54329U_190701_222759/118031725/ccs,m54329U_190701_222759/126223937/ccs;GraphID=0;GFA_FILE_PREFIX=chr9-74283222-74283473-4-INS+INV;GFA_S=S0,S1,S2,S3,I0,I1;GFA_L=S0+I0,I0+S1,S1-I1,I1+S3

CSV graph

Graph structure and biological description

Here we listed frequent CSV types detected by SVision. In addition, SVision identifies complex insertions of different structures, containing more than two insertion nodes. These complex insertion events are not included in this table because they are difficult to describe biologically.

Nodes Links Biological description
S:2,I1,D:1 S0+I0-,I0-S1+ Inverted duplicate of a genomic segment representing by the insertion node
S:4 S0+S2-,S2-S3+ Deletion associated with 3' or 5' inversion
S:4,I:1 S0+S2-,S2-I0+,I0+S3+ Deletion associated with 5' inversion and insertion
S:5 S0+S2-,S2-S4+ Two deletions with inverted or non-inverted spacer segment
S:3,I:1,D:1 S0+I0-,I0-S2+ Deletion associated with insertion, where the inserted sequence is a distal inverted duplicated genomic segment
S:3,I:1,D:1 S0+I0+,I0+S2+ Deletion associated with insertion, where the inserted sequence is a distal duplicated genomic segment
S:2,I:2,D:2 S0+I0-,I0-I1+,I1+S1+ A complex insertion consisting of an inverted duplication and a dispersed duplication
S:2,I:2,D:1 S0+I0+,I0+I1-,I1-S1+ A complex insertion contains a tandem inverted duplication at 3' end
S:2,I:2,D:1 S0+I0-,I0-I1+,I1+S1+ A complex insertion contains a tandem inverted duplication at 5' end

CSV graph comparison

SVision classify the graph of each CSV instances by comparing their graph topologies. This requires the --graph and --qname parameter activated. It will create two text (.txt) file along with the VCF output.

  1. sample.graph_exactly_match.txt: Unique graphs for all CSV instances, i.e. isomorphic graphs.
  2. sample.graph_symmetry_match.txt: Symmetric topology graph classified isomorphic graph.

Examples of two isomorphic graphs, representing different CSV events of the same type.

Graph format

The below example is an CSV in rGFA format (node sequence is omitted for display purpose), which is detected by SVision at chr11:99,819,283-99,820,576 in HG00733. The graph output is saved in separated files for each CSV events.

S	S1	SN:Z:chr11	SO:i:99819338	SR:i:0	LN:i:2990
S	I0	SN:Z:m54329U_190827_173812/140708091/ccs	SO:i:15813	SR:i:0	LN:i:1113
S	I1	SN:Z:m54329U_190827_173812/140708091/ccs	SO:i:16927	SR:i:0	LN:i:466
S	I2	SN:Z:m54329U_190827_173812/140708091/ccs	SO:i:17400	SR:i:0	LN:i:377	DP:S:S1:99820198
S	I3	SN:Z:m54329U_190827_173812/140708091/ccs	SO:i:17778	SR:i:0	LN:i:838
S	I4	SN:Z:m54329U_190827_173812/140708091/ccs	SO:i:18617	SR:i:0	LN:i:61	DP:S:S0:99819276
L	S0	+	I0	+	0M	SR:i:0
L	I0	+	I1	+	0M	SR:i:0
L	I1	+	I2	-	0M	SR:i:0
L	I2	-	I3	+	0M	SR:i:0
L	I3	+	I4	+	0M	SR:i:0
L	I4	+	S1	+	0M	SR:i:0

Besides the information included in standard rGFA format, we add another DP:S column to indicate sequence with detected origins via local realignment, such as node I2 is duplicated from node S1.

Graph genotyping

Note: This is a post-processing step that tries to validate the detected CSVs.

Step1: Extract HiFi raw reads

samtools view -b HG00733.ngmlr.sorted.bam chr11:99810000-99830000 > tmp.bam
samtools fasta tmp.bam > tmp.fasta

Step2: Align with GraphAligner

Please check GraphAligner for the detailed usage.

GraphAligner -g chr11-99819283-99820576.gfa -f tmp.fasta -a aln.gaf -x vg

Example of CSV path supporting reads

m54329U_190827_173812/140708091/ccs     21668   0       21668   +       >S0>I0>I1<I2>I3>I4>S1
m54329U_190617_231905/88145984/ccs      13612   0       13612   +       >S0>I0>I1<I2>I3>I4>S1
m54329U_190617_231905/88145984/ccs      13612   0       13612   +       >S0>I0>I1<I2>I3>I4>S1

Clone this wiki locally