highlineR/CONTRIBUTING.Rmd at master · PoonLab/highlineR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
---
  output: github_document
---

  <!-- CONTRIBUTING.md is generated from CONTRIBUTING.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/CONTRIBUTING-",
  out.width = "100%",
  dpi = 300,
  fig.align = "center"
)
```

## Import

HighlineR accepts aligned nucleotide or amino acid NGS data in the FASTA format as input. Aligned sequences stored in FASTQ format can also be converted and read by the program. In theory, data generated using any NGS instrument and methodology can be input to highlineR as long as the sequence alignment is supplied in an accepted file format. There is no limit on the number of sequences that a file may contain nor on the number of files that may be input.

The import function accepts three types of input:

  1. A path to a single NGS file,

2. A path to a directory containing NGS file(s), or

3. A vector of paths to NGS file(s).

For example, let's import a FASTA file to analyze using highlineR. (Data adapted from [Ferrandiz-Rovira et. al.](https://www.nature.com/articles/hdy201513#data-archiving) available [here](http://evobiolab.biol.amu.edu.pl/amplisat/index.php?examples))
```{r import}
library(highlineR)
highlineR.session <- init_session() # create session
base <- system.file('extdata', package = "highlineR")
fn <- paste0(base, "/sample.aligned.fa")

# Default import
import_raw_seq(fn)
```
highlineR recognizes standard file extensions (.fasta, .fa, .fas, .fastq, .fq). If a nonstandard file extension is being used (eg .txt) then the file format can be specified as follows:

```{r eval=FALSE}
# Import NGS files with non-standard file extensions
import_raw_seq("/sample.txt", datatype = "fasta")
```

The file we are using contains nucleotide sequences. If the file of interest contains amino acid sequences instead then run:
```{r eval=FALSE}
# Import amino acid data
import_raw_seq("/sample.aligned.fa", seqtype = "amino acid")
```

The `import_raw_seq()` function creates an S3 object called a `Data` object with the following structure:
```{r}
# Classes of Data objects
str(highlineR.session[[fn]])

# Attributes of Data objects
ls(highlineR.session[[fn]])
```
This means that the classes of a `Data` object provide information about the data type of the NGS file and the sequence type of the sequences contained in the file.

`Data` objects are comprised of the following attributes:

* `path` is the absolute path to the NGS file represented by the object,

* `raw_seq` is a list that will contain the parsed sequences of the NGS file,

* `compressed` is an S3 object that inherits from R's base `environment` class and will contain the compressed sequences of the NGS file,

* `sample` is a compressed environment that will contain randomly sampled sequences from the the compressed structure,

* `master` is a character string representing the master sequence to which other sequences will be compared during plotting, and

* `seq_diff` is a character matrix that will contain sequence differences in variant sequences compared to the master sequence.

The `import_raw_seq()` function also creates another S3 object called a `session` object that acts as a parent container to hold multiple `Data` objects for later plotting. This functionality allows multiple NGS files to be added to a single `session` by unique function calls for later batch processing.

In the above example, we first created an empty `session` using the `init_session()` function. This function returns the created `session` object which must be stored by the user for further use. In the above example, the `session` was named `highlineR.session`. This is the default name expected by the program. Following this, if the session argument is not provided, `Data` objects are added to the `highlineR.session` object by default.

To override this default behaviour, the `session` created using `init_session()` can be stored using a different variable name `Data` objects can be imported into that `session`, using the session argument. Alternatively, the result returned by the `import_raw_seq()` function can  be stored without a direct call to `init_session()`:

```{r eval = FALSE}
# Import into initialized session
my_session <- init_session()
import_raw_seq(fn, session = "my_session")

# Import into uninitialized session
my_session <- import_raw_seq(fn) # To create session using only one import call, the session argument is not needed
# To add more Data objects to the custom session, you must specify the session argument
import_raw_seq(paste0(base, "/sample.fa"), session = my_session)
```


## Parsing

Following data import, single `Data` objects or complete `session` objects can be parsed as follows:
```{r parse}
# Parse single Data object
# parse_raw_seq(highlineR.session$`./data/sample.aligned.fa`)

# Parse entire session
parse_raw_seq(highlineR.session)
```
The `parse_raw_seq()` function stream reads NGS files and populates the `raw_seq` lists of Data objects with `(header, sequence)` tuples.
```{r}
# Strucure of the parsed file
head(highlineR.session[[fn]]$raw_seq, n=3)
```

## Compression
After parsing, `Data` or `session` objects will be compressed into a format reminiscent of a dictionary structure of key:value pairs. In this structure, unique sequences serve as keys with read counts as values. `Data` or `session` objects can be compressed as follows:

```{r compress}
# Compression
compress(highlineR.session, unique = TRUE)
```

The `unique` flag specifies that the NGS file that we are working with has been processed so that it only contains unique sequences with read counts stored at the end of the sequence labels. If read counts should be inferred from the number of distinct instances of sequences in the file, then this flag should be `FALSE`.

The `compress()` function populates the `compressed` environments of `Data` objects.

```{r}
# Number of unique sequences in compressed structure
c_len <- length(highlineR.session[[fn]]$compressed)
c_len

# Most abundant sequences in compressed structure
sort(highlineR.session[[fn]]$compressed)[c_len:(c_len-2), , drop = F]
```

The `N` and `M` arguments to the `compress()` function are used to determine the behaviour of the sampling functions described below which populate the `sample` environments of `Data` objects.


### Sampling

In an NGS experiment, if the number of input sequences is much greater than the number of templates (nucleic acids) available during the sequencing reaction, then the frequency distribution of variants produced may be skewed. In an attempt to accommodate for this phenomenon, highlineR utilizes a read_sample() function which randomly samples N reads with replacement, M times. The average of the resulting frequencies of variants present in at least 85% of the replicates is then used as the expected variant frequencies.

Sampling is performed automatically during compression. A dataset can be resampled without re-compressing using the `resample()` function.
```{r}
# Sample created during compression
# Number of sequences in sample environment
s_len <- length(highlineR.session[[fn]]$sample)
s_len

# Most abundance sequences in sample environment
sort(highlineR.session[[fn]]$sample)[s_len:(s_len-2), , drop = F]

# Resampling
resample(highlineR.session[[fn]],
         N = c_len * 0.75,
         M = 10)

# New number of sequences in sample environment
new_s_len <- length(highlineR.session[[fn]]$sample)
new_s_len

# Most abundance sequences in new sample environment
sort(highlineR.session[[fn]]$sample)[new_s_len:(new_s_len-2), , drop = F]
```


## Plotting

The highlineR `plot()` function has several parameters:

  *  `x`: A `Data` object or `session` object containing `Data` objects to plot.
* `mode`: The desired mutation annotation for plotting. Options: "mismatch" (default), "svn" (Synonymous versus Non-Synonymous), "tvt" (Transition versus Transversion).
* `master`: The sequence to which other sequences should be compared. By default, the most abundant sequence is selected.
* `sort_by`: How should the sequences be ordered in a plot? Options: "similarity" (default), "frequency".
* `rf`: Which reading frame should be used when determining Synonymous vs Non-Synonymous mutations? Options: 1 (default), 2, 3.
* `use_sample`: Which environment should be plotted? If `TRUE`, then the `sample` environment is plotted. If `FALSE`, the complete `compressed` environment is plotted.

Using the default parameters, the most abundant sequence is selected as the master sequences. Variant sequences in the sample environment are compared to this master to identify their compositional differences. These differences are plotted such that mismatches are highlighted as coloured lines and the variants are ordered according to their similarity (number of mismatches) to the master.
```{r plot_default, out.width="60%", fig.width=6, fig.height=8}
# Plot session with variants sorted by similarity to master
plot(highlineR.session)
```

Variants can also be sorted by frequency rather than similarity:
```{r plot_freq, out.width="60%", fig.width=6, fig.height=8}
# Plot session with variants sorted by relative abundance
plot(highlineR.session, sort_by = "frequency")
```

Finally, rather than simply plotting mismatches, mutations can be annotated as synonymous and non-synonymous mutations or as transitions and transversions as follows:
```{r plot_annot, out.width="60%", fig.width=6, fig.height=8}
# Plot synonymous and non-synonymous mutations
plot(highlineR.session, mode = "svn")

# Plot transitions and transversions
plot(highlineR.session, mode = "tvt")
```