Bio720/assignments/BIO720_Assignment2.Rmd at master · DworkinLab/Bio720 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
---
title: "BIO720 Assignment 2"
author: "Ian Dworkin"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    number_sections: true
    keep_md: true
editor_options:
  chunk_output_type: console
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

**Due Date:** Tuesday, October 15, 9AM. Send the rendered document of your assignment as a PDF, and email it to Dr. Golding, as per instructions that Dr. Golding discussed in class and available on the class github README.

**Please code independently. Additionally, there is no use of large language models or other generative AI for this or any other assignments, unless specifically stated otherwise**. This class provides the opportunity to learn how to think computationally, and to learn the practical skills of scientific computing. To develop the fundamentals of the skills requires you to work through the thought processes, problem-solving and practice of computation. Any evidence of group work OR use of generative AIs is cheating on assignments.


## Background

In genomics and bioinformatics, we are often dealing with many different file types, and part of the work we need to do is "wrangle" everything to be able to work together. This assignment is going to focus on both the "wrangling" and some

One of the common file formats is called a "gene transfer file" or GTF. The GTF file is tab-delimited file, with some useful features. You can read about this and the related GFF (general feature format)  [file format here](https://genome.ucsc.edu/FAQ/FAQformat.html#format3).

We will be using the GTF file for the most current release (v6.59) of the *Drosophila melanogaster* genome available on the database [flybase](https://flybase.org/). More information on the specifics of the Drosophila melanogaster GTF is [here](https://wiki.flybase.org/wiki/FlyBase:Downloads_Overview#GTF_files), but in short, it follows the standard GTF v2.2 specifications as found [here](http://mblab.wustl.edu/GTF22.html).

The file we need can be found [here](https://ftp.flybase.net/releases/FB2024_04/dmel_r6.59/gtf/dmel-all-r6.59.gtf.gz), but it needs to be downloaded using *ftp* (file transfer protocol). While you can just double click on the file, you can also use the UNIX terminal or the R console (see below) to do so. I recommend downloading it and working with a *local* copy of the file (a copy of the file on your computer. The command below will save it to your current working directory (so you may wish to change this) but keep it zipped. Then we can read it into R from the zipped file format.

**you need to do this once, then you should have a local copy**. It will be saved where ever you set your working directory for R.

Don't forget to remove the `#`

```{r, eval = FALSE}
#download.file(url = "https://ftp.flybase.net/releases/FB2024_04/dmel_r6.59/gtf/dmel-all-r6.59.gtf.gz", destfile = "dmel_6_59.gtf.gz" )
```

and we can read the file into R (even compressed), by letting it know we have a tab-delimited file. The file does not have a *header* line (so no variable names), so we need to let R know this when we import the file.

Note depending on where you downloaded your file, whether you kept it zipped (like I did), what you named it, etc... Your way of reading it into R may be slightly different (and it is your responsibility to figure it out).

```{r}
dmel_gtf <- read.delim("./dmel_6_59.gtf.gz",
                       sep = "\t",
                       header = FALSE)
```


```{r}
dim(dmel_gtf)
```

You will know it has imported successfully if your call to `dim` above tells you there are 549128 rows and 9 columns.

```{r}
head(dmel_gtf)
```

```{r}
str(dmel_gtf)
```

What are the 9 columns (variables) in our GTF file?

1. `seqname`, which corresponds to the chromosome
2. `source`, which generated this file
3. `feature`, which is the genomic features (in this case)
4. `start`, the starting position of the feature in the sequence (1-based)
5. `end`, the end position of the genomic feature (inclusive).
6. `score`, not used here (for a score like a quality measure or degree of conservation to a specific organism)
7. `strand`, DNA strand of feature ("+", "-" or "." for unknown)
8. `frame`, when the feature is a coding exon, this will be [0,2], representing the reading frame of the first base. If it is not a coding exon it will be "."
9. `group`, all the other information, and from this we will need to extract lots of things.

Q1. Please name the columns of the dmel_gtf object, with the 9 variable names listed above. Check that it worked.


Q1A. We will be mostly working with gene length and transcript (mRNA) length data. So we only need columns 1,3,4,5 and the final column ("group"). Make a file ("dmel_gtf_subset1") with only those columns. Confirm that you new object, `dmel_gtf_subset1` is as expected.


Q1B. We will only need a subset of rows as well. In this case we only need the information about genes and mRNA. Let's see what all the features are first. Generate a variable ("features") containing a `table` with unique names of each of the feature types and counts of each.


Using this information, now extract only the rows  (in `feature`) that are either "gene" or "mRNA".

Check to make sure that it is the right variable type, and the information you have subsetted corresponds to your expectations based on the original table. Confirm that this operation works as intended. Confirm that the number of genes and mRNA you finished with is the same as in the original data.

Q1C. While we will work with both genes and mRNA, it may be easier to make two different objects (for the moment), to work with so create one data frame ("dmel_genes") and a second data frame ("dmel_mrna") containing just genes or mRNA respectively. Again confirm that your subsetting has worked correctly before proceeding to the next question.


Q2. Now that we have the information for each gene and transcript, but our information on gene identifiers (what starts with "FBgn") or the mRNA transcripts (what starts with "FBtr") is embedded in kind of a complex string in the "group" column. So we need to extract that information from the string, and make columns out of it.

Q2A. Using the information from the "group" variable in the dmel_genes object you made, please make two new columns, one containing the gene_id (FBgn...) and the other the information for the gene_symbol (Nep3, CG9570, Or19b, etc.). Note what delimiter is used to seperate the information in the strings contained in the "group" variable. Depending on your strategy, this can be useful (there are multiple ways to accomplish this task). Once you have confirmed you have done this correctly (show this), you can then remove the group column, or make a new version of the data frame without it.


Q2B.
We can look at the lengths of genes (very approximate) by the difference between their start and end values. Keep in mind these lengths include introns, exons, UTRs etc...


Which is the longest gene based on this approach? Show your work.


Q2B. Using the information from the "group" variable in the dmel_mrna object you made, please make four new columns:

- gene_id (FBgn...)
- gene_symbol
- transcript_id (FBtr)
- transcript_symbol

Once you have confirmed you have done this correctly, you can then remove the group column, or make a new version of the data frame without it. Then compute the length of the transcripts in the same way as we did for genes.


Q3. We now have two files, one with the position and length of genes, and one with the length of transcripts. Except, our length of transcripts are wrong! They currently include introns as well. We could go back to the GTF, file and do lots of fiddling, but instead, I have made a new file for you (I will show you how when I introduce bioconductor) with the actual transcript lengths. While I am reading it in here, you may wish to download it from the github repo directly if you plan to work offline.

```{r}
dmel_transcript_lengths_UCSC <- read.csv("https://raw.githubusercontent.com/DworkinLab/Bio720/refs/heads/master/data/dmel_transcriptLengths_UCSC.csv", header = TRUE)
```

```{r}
head(dmel_transcript_lengths_UCSC)
```

This object has 6 columns:

`tx_name` - The transcript name
`gene_id` - a gene identifier
`nexon`  - Number of exons contributing to the transcript
`transcript length` in base pairs
`5' UTR length` in base pairs
`3' UTR length` in base pairs

You will notice that the transcript identifiers do not begin with FBtr, but NM or NR. This is because these annotations are [*refseq* identifiers](https://www.ncbi.nlm.nih.gov/books/NBK50679/#RefSeqFAQ.what_is_the_difference_between). While flybase is the source of transcript information (see section 9.12 in this [FAQ](https://wiki.flybase.org/wiki/FlyBase:FAQ)), other databases use them and have their own identifiers. unfortunately, they don't provide the identifiers from the other database we used (flybase)! This issue is a very common problem from database to database.  If you really hate yourself, I guess you could go to the [refseq page](https://www.ncbi.nlm.nih.gov/refseq/) and copy and paste all of those identifiers... But of course, computers can make our lives easier...

So we need to find a file that has the matching identifiers. Thankfully one is available on [this page](https://flybase.org/downloads/bulkdata). Most model organisms will also have a similar database page for such purposes.

So we need to find a file that has matching identifiers. Thankfully one is available courtesy of flybase [this page](https://flybase.org/downloads/bulkdata). Most model organisms will also have a similar database page for such purposes.

This is a *tab seperated values* flat file.


```{r}
Dm_identifiers <- read.delim(file = "https://raw.githubusercontent.com/DworkinLab/Bio720/refs/heads/master/data/fbtr_refseq.tsv",
                             sep = "\t",
                             header = TRUE)
```


Q3A - Rename the variables in `Dm_identifiers` so that they are consistent with your variable names in your other files (i.e. the same variable names for the variables with Fbtr identifiers, etc...).

Q3B. We finally have all of the information we want, but across multiple files.

 Merge the files together based on matching identifiers (i.e. so that FBtr match the appropriate refseq id, and that each of these matches the appropriate FBgn), so that we have a final file (one row per transcript) that contains

1. The chromosome (SeqName)
2. FBgn
3. Gene symbol
4. Fbtr
5. Transcript symbol (like "Nep3-RA")
6. Refseq identifier
7. Start position of gene (not transcript)
8. End position of gene (not transcript)
9. Start position of transcript
10. End position of transcript
11. Number of exons in transcript
12. Length of transcript
13. 5' UTR length
14. 3' UTR length

Confirm that the merging of these files has worked as expected. i.e. that you have properly matched gene and transcript identifiers!

Q4. Now that we have all of this information, we can do some plotting.

Q4A. Plot the relationship between gene length (which you will need to calculate) and the *mean* transcript length of all transcripts within the gene (you may wish to first do this computation then plot)

Q4B. Pick one chromosome arm of a major chromosome (X, 2L, 2R, 3L or 3R) and for all genes along this chromosome arm, plot gene length and average transcript length on (varying along the Y axis) VS the starting position of the gene along the X axis.


Make sure your plots are as informative and clear as possible, with proper labeling of axes etc..