lab-book/07-sharedCode.Rmd at master · tiroshlab/lab-book · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
# Shared Code {#sharedCode}

The aim of the shared code is to implement the lab’s core ideas on analysis of scRNA-seq data. The code is written in R and is publicly available via GitHub (see below). It was designed with a modular approach and hence is separated into several well-defined packages. The packages and functions are well documented and examples are provided within each package.

## About the R packages

### `scalop`
[Website] (https://rdrr.io/github/jlaffy/scalop/)

_**A toolbox for single-cell analysis.**_

To install in R:
```{r, eval=FALSE}
install.packages("remotes")
remotes::install_github("jlaffy/scalop")
```

### `scandal`
[Source code](https://github.com/dravishays/scandal) | [Report bugs](https://github.com/dravishays/scandal/issues)

_**A framework that enables defining a single-cell experiment.**_

The package provides methods for loading the data, preprocessing and quality control, maintaining the data with a low memory footprint (using sparse matrices), various plotting methods, linking meta-data with expression data and more.

The package extends the `SingleCellExperiment` class, adapting it for use in our lab. See [this tutorial](https://www.bioconductor.org/packages/release/bioc/vignettes/SingleCellExperiment/inst/doc/intro.html) for an introduction to the `SingleCellExperiment` class.

To install in R:
```{r, eval=FALSE}
devtools::install_github("dravishays/scandal")
```

### `infercna`
[Website & Tutorials](https://jlaffy.github.io/infercna) | [Functions index](https://jlaffy.github.io/infercna/reference/index.html) | [Source code](https://github.com/jlaffy/infercna) | [Report bugs](https://github.com/jlaffy/infercna/issues)

_**Infer copy-number alterations from (single-cell) RNA-sequencing data.**_

The methodology implemented here was first formulated by Itay and colleagues during his postdoc [Tirosh et al., 2014](https://science.sciencemag.org/content/344/6190/1396.long) and has been tried and tested in several publications since ([Filbin et al., 2018](https://science.sciencemag.org/content/360/6386/331.long); [Neftel et al., 2019](https://www.ncbi.nlm.nih.gov/pubmed/31327527); [Puram et al., 2017](https://www.ncbi.nlm.nih.gov/pubmed/29198524); Tirosh et al., [2016a](https://science.sciencemag.org/content/352/6282/189.long), [2016b](https://www.nature.com/articles/nature20123); [Venteicher et al., 2017](https://science.sciencemag.org/content/355/6332/eaai8478.long)).

To install in R:
```{r, eval=FALSE}
devtools::install_github("jlaffy/infercna")
```

[Tutorial 1: Set your genome](https://jlaffy.github.io/infercna/articles/useGenome.html)
[Tutorial 2: Example with a scRNA-seq dataset](https://jlaffy.github.io/infercna/articles/infercna_tutorial.html)

### `scrabble`
[Website & Tutorials](https://jlaffy.github.io/scrabble) | [Functions index](https://jlaffy.github.io/scrabble/reference/index.html) | [Source code](https://github.com/jlaffy/scrabble) | [Report bugs](https://github.com/jlaffy/scrabble/issues)

_**Perform exploratory computational analyses on processed scRNAseq gene expression data.**_

The package focuses on unbiased methods in unsupervised clustering and dimensionality reduction to identify and characterize the transcriptionally-distinct subpopulations of cancer cells residing within tumours.

In its current implementation scrabble most closely reflects the methods implemented in [Neftel et al., 2019](https://www.ncbi.nlm.nih.gov/pubmed/31327527) though any of the lab’s papers should be useful as reference.

To install in R:
```{r, eval=FALSE}
devtools::install_github("jlaffy/scrabble")
```

## R-code best practices

### General

_**Reproducibility**_ - multiple runs of the same code should produce the same results. use `set.seed(0)` before commands with random elements.

_**Project**_ - R project, that contains multiple files, has the advantages of better code orgainzation, GIT support, ect...

_***.rmd files**_ - prefer using R markdown (*.rmd) files instead of *.R files, as the former enable to separate code into subsets (called chunks), provide each chunk with a title, which enable better navigation between analysis sections

Understand the code you are writing. Prefer writing your own code over using someone else’s code / packages


### Data

Always save the raw data and document any comment about the data.

Document the commands used for pre-processing the data (i.e cellranger commands).


### Code

Variables should have meaningful names (i.e myeloid_differential_expression_results)

Use a balanced approach for code comments (a readable code needs no comments)

Prefer to write R commands using tidyr approach (i.e %>%)

Don’t save an environment, but rather the important analysis objects (i.e expression matrix, metadata file). Save objects generated after a time-consuming process

Use the same name for similar ID fields (i.e Patient_ID) from different data sources in order to easily join data from different sources

Factors are a good method to store categorical data. In addition, it enable us to set the order in which the values will be displayed in the ggplot. Be careful when comparing factors, as the values are stored as numbers, and as.character() is needed to get the string values

Prefer using the apply group of function (i.e, lapply) over for loop

Run time-consuming analysis in a different job, either a Wexac job (https://www.weizmann.ac.il/WIT/wexac-quick-guide) or an RStudio job (found in the bottom-window panel in the RStudio)


### Visualization

ggplot is a great package for plotting, with vast possibilities. Try to use it as much as possible.

patchwork is a good package for placing multiple plots (i.e a heatmap and a panel) next to each other

Stick to the same color palate as you can (i.e, green-shade colors to represent T-cells). It is helpful to gather all color codes in a separate file (*.R) and load at the beginning of each analysis file (*.Rmd)

Create a theme(s) and customize it according to the desired plotting (i.e set base_size)

Save the final plot using ggsave (providing height and weight variables)

tidyr::pivot_longer() / reshape2::melt() are the ways to lengthening the data, usually done before plotting a heatmap from a data.frame data

Add the statistical test results into the plot (i.e ggpubr::stat_pvalue_manual())


### References:
https://github.com/UofABioinformaticsHub/BestPractices
R for Data Science (https://r4ds.hadley.nz/)


## Contributing
Members of the lab to whom the shared code is relevant and of use are encouraged to contribute to it. This will both help the code to grow and develop but also allow you to fine-tune it for your own analyses. To make individual contributions, you can [fork the project from GitHub](https://help.github.com/en/github/getting-started-with-github/fork-a-repo). To contribute in a more long-term way, ask Julie or Avishay to add you as an official contributor to the GitHub package(s) in question.