highlineR/README.Rmd at master · PoonLab/highlineR · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%",
  dpi = 300,
  fig.align = "center"
)
```
# highlineR: a tool for visualizing NGS datasets in R

[Highlighter](https://www.hiv.lanl.gov/content/sequence/HIGHLIGHT/highlighter_top.html) is a web-based tool maintained by the Los Alamos National Laboratory (LANL) HIV Sequence Database that makes it easy for investigators to visually scan their sequence alignments for compositional differences.
However, it is inconvenient for users to use Highlighter to process a large number of alignment files as the tool only accepts a single file as input.
In addition, the LANL website limits users to a maximum of 500 sequences per file, which can be problematic for users working with alignments derived from next-generation sequencing platforms.
Furthermore, the tool does not visualize the frequencies of different sequence variants in a dataset.

*highlineR* is a free, open-source R package that provides users with accessible functions for batch-processing of sequence alignments, including NGS data, to generate plots similar to Highlighter.
In addition, *highlineR* extends the Highlighter plot by varying line widths to represent variant frequencies in the data (see *Usage*).


## Installation

The simplest method to install *highlineR* is to use the R *devtools* package:
```R
# install.packages("devtools")  # if not already installed
require(devtools)
devtools::install_github("PoonLab/highlineR")
```


## Usage

As a basic example of using `highlineR`, we're going to load a number of anonymized [data sets](https://www.hiv.lanl.gov/content/sequence/HIV/USER_ALIGNMENTS/Li/) sets from [a published study](https://journals.plos.org/plospathogens/article?id=10.1371/journal.ppat.1000890) of HIV-1 diversity within patients:

```{r fig.width=9, fig.height=6}
require(highlineR)
# use glob to retrieve paths to FASTA files
files <- Sys.glob('~/git/highlineR/inst/extdata/*.fasta')
highline(files[1:2])  # render the first two alignments
```

Each horizontal grey band represents a sequence variant.
The *area* of the band is proportional to the number of times that variant occurs in the alignment.
(It may be more intuitive to make *height* proportional to variant counts, but this causes problems when a variant occurs 1000 times or more!)
By default, `highlineR` selects the most common variant to be the master (reference) sequence and places it at the top of the plot &mdash; it is also labeled with an `(m)`.

Each band is decorated with coloured tick marks to indicate the locations of nucleotide (or amino acid) differences relative to the master sequence.
A colour reference key (legend) is provided at the top of the plot.

Note that for the above example, the files were loaded from a developer directory.
To load these same files from your installed package as a user, you'd have to replace the following line:
```r
# files <- Sys.glob('~/git/highlineR/inst/extdata/*.fasta')
files <- Sys.glob(paste0(system.file(package='highlineR'), '/extdata/*.fasta'))
```


## Background

A *multiple sequence alignment* (MSA) is a hypothesis about how residues (nucleotides or amino acids) in two or more sequences were derived from residues in a common ancestral sequence.
By convention, each sequence is arranged horizontally in rows to be read from left to right, and evolutionarily-related (homologous) residues are arranged vertically in a column.
It is possible for two or more sequences to be exactly identical.
When this occurs, we refer to the shared sequence as the "sequence variant", and the number of times this variant appears in the data as its "count".
(Note there is no established terminology for these features.)
It is common practice to reduce a sequence alignment to the unique variants, especially when certain variants predominate the sample population.
By doing so, however, we lose information on the relative abundance of variants.

If an MSA comprises a large number of sequences, it may become difficult to visualize the composition of the alignment.
*Highlighter* plots were developed to reduce the information in the MSA by marking the location of residues that are different from the reference sequence, which has been selected by the user or the program.
We refer to this reference sequence the *master sequence*.
By default, the `highline` function selects the most common sequence variant to be the master.

The `highline` function is actually a high-level wrapper for several functions in `highlineR`.
These lower level functions are exposed to the user, so you have the option of doing more extensive customization than `highline` will allow.
Detailed instructions for using (and potentially modifying) these functions are provided in the [CONTRIBUTING.md](CONTRIBUTING.md) document.

In summary, the pipeline for generating a plot in `highlineR` are:

1. Initialize a `highlineR` session.  This creates an R [environment](http://adv-r.had.co.nz/Environments.html) for storing the sequence alignment data sets.

2. Import one or more sequence alignments from the respective files.  `highlineR` supports FASTA, FASTQ and CSV file formats.

3. Parse the raw sequence data from each file.

4. Compress the sequence data so that identical sequences are represented by a single copy annotated by the number of instances in the alignment.

5. Calculate the sequence differences from the "master sequence" and generate the *Highlighter* style plot.


## Further examples

* Batch process a set of FASTA files and save the results as PDF files.
```R
require(highlineR)
files <- Sys.glob('~/git/highlineR/inst/extdata/*.fasta')
for (f in files) {
  # prepare output file path
  of <- sub(pattern=".fasta$", replacement=".pdf", x=f)
  of <- sub("inst/extdata", "working", of)  # switch dir
  # start plot device to PDF file
  pdf(of, width=6, height=6, onefile=F)
  print(highline(f))
  dev.off()  # turn off plot device
}
```

* Annotate the Highlighter plot by transitions and transversions.
```{r fig.width=6, fig.height=9, out.width=600}
highline(files[3], mode='tvt')
```

* Annotate the Highlighter plot by non-synonymous and synonymous differences (note that the reading frame can be modified with the `rf` argument.)
```{r fig.width=6, fig.height=7, out.width=600}
highline(files[2], mode='svn')
```


## Feature requests and issues

Please use our [issue tracker](https://github.com/PoonLab/highlineR/issues) to report problems that you encountered while using this package.
Provide information on your operating system, R version and (ideally) a minimum reproducible example.
If there's some change or additional feature that you'd like to see in `highlineR`, you can also post a request on the issue tracker.