-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathATAC-seq_task4.txt
More file actions
258 lines (170 loc) · 12.1 KB
/
ATAC-seq_task4.txt
File metadata and controls
258 lines (170 loc) · 12.1 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
1) Move to folder ATAC-seq, and create folders to store bigBed data files and peaks analyses files. Make sure the files are organized in a consistent way as done for ChIP-seq.
Step 1: To make sure i'm in the correct folder:
cd ~/epigenomics_uvic/ATAC-seq
Step 2: create necessary directories
mkdir -p ATAC-seq/analysis ATAC-seq/data/bigBed.files ATAC-seq/data/bed.files ATAC-seq/annotation ATAC-seq/analyses/peaks.analysis
Step 3: verify the structure of the directories:
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq$ ls -R
.:
analyses cmnd.txt data
./analyses:
peaks.analysis
./analyses/peaks.analysis:
./data:
bigBed.files
./data/bigBed.files:
2) Retrieve from a newly generated metadata file ATAC-seq peaks (bigBed narrow, pseudoreplicated peaks, assembly GRCh38) for stomach and sigmoid_colon for the same donor used in the previous sections. Hint: have a look at what we did here. Make sure your md5sum values coincide with the ones provided by ENCODE.
Step 1: Filter the Metadata File
First, we need to filter the ATACmetadata.tsv file to retrieve only the ATAC-seq peaks that match the criteria: bigBed narrow, pseudoreplicated peaks, assembly GRCh38 for the tissues stomach and sigmoid_colon.
We are going to navigate through ENCODE portal to search for the donor ENCDO451RUA and filter the data to match the criteria (bigBed narrow, pseudoreplicated peaks, assembly GRCh38 for the tissues stomach and sigmoid_colon), then we are going to download the files in a file.txt. The URL of the metadata file is provided on the first line of the files.txt:
These are the filters taken:
- Assay type: DNA_accessibility (2)
- Assay title: ATAC-seq (2)
- Hide control experiments: no
- Biosample
- organism: HOMO SAPIENS (2)
- biosample: sigmoid colon (1) and stomach (1)
These are the results (2) after filtering out 347 results:
ATAC-seq in stomach
Homo sapiens stomach tissue male adult (54 years)
Lab: Michael Snyder, Stanford
Project: ENCODE
Reference Epigenome: ENCSR135MTK
ATAC-seq in sigmoid colon
Homo sapiens sigmoid colon tissue male adult (54 years)
Lab: Michael Snyder, Stanford
Project: ENCODE
Reference Epigenome: ENCSR517BJQ
I will download in a file called files.txt to obtain the link to download the metadata file in a file called ATACmetadata.tsv:
"https://www.encodeproject.org/metadata/?replicates.library.biosample.donor.uuid=d370683e-81e7-473f-8475-7716d027849b&status=released&status=submitted&status=in+progress&assay_title=ATAC-seq&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&biosample_ontology.classification=tissue&biosample_ontology.term_name=sigmoid+colon&biosample_ontology.term_name=stomach&type=Experiment"
# Download the metadata file
wget -O ATACmetadata.tsv "https://www.encodeproject.org/metadata/?replicates.library.biosample.donor.uuid=d370683e-81e7-473f-8475-7716d027849b&status=released&status=submitted&status=in+progress&assay_title=ATAC-seq&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&biosample_ontology.classification=tissue&biosample_ontology.term_name=sigmoid+colon&biosample_ontology.term_name=stomach&type=Experiment"
THIS ARE THE COLNAMES
File accession File format File type File format type Output type File assembly Experiment accession Assay Donor(s)Biosample term id Biosample term name Biosample type Biosample organism Biosample treatments Biosample treatments amount Biosample treatments duration Biosample genetic modifications methods Biosample genetic modifications categories Biosample genetic modifications targets Biosample genetic modifications gene targets Biosample genetic modifications site coordinates Biosample genetic modifications zygosity Experiment target Library made from Library depleted in Library extraction method Library lysis method Library crosslinking method Library strand specific Experiment date released Project RBNS protein concentration Library fragmentation method Library size range Biological replicate(s) Technical replicate(s) Read length Mapped read length Run typePaired end Paired with Index of Derived from Size Lab md5sum dbxrefs File download URL Genome annotation Platform Controlled by File Status s3_uri Azure URL File analysis title File analysis status Audit WARNING Audit NOT_COMPLIANT Audit ERROR
Step 2: Download and Verify bigBed Files
# Filter out metadata (ATAC-metadata.tsv) to obtain bigBed files:
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq$ grep -F "ATAC-seq" ATACmetadata.tsv | \
grep -F "bigBed" | \
grep -F "narrowPeak" | \
grep -F "pseudoreplicated" | \
grep -F "GRCh38" | \
awk 'BEGIN{FS=OFS="\t"}{print $1, $11}' | \
sort -k2,2 -k1,1r | \
sort -k2,2 -u > analyses/bigBed.peaks.ids.txt
# let's check the content of the file:
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq$ cd analyses/
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/analyses$ ls
bigBed.peaks.ids.txt peaks.analysis
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/analyses$ head bigBed.peaks.ids.txt
ENCFF287UHP sigmoid colon
ENCFF762IFP stomach
Now, let's download the bigBed peaks files:
# first, make sure I'm in the correct directory:
cd ~/epigenomics_uvic/ATAC-seq
cut -f1 analyses/bigBed.peaks.ids.txt | \
while read filename; do
wget -P data/bigBed.files "https://www.encodeproject.org/files/$filename/@@download/$filename.bigBed"
done
This loop will read each filename from bigBed.peaks.ids.txt and use wget to download the corresponding bigBed files
ENCFF762IFP.bigBed 100%[===================================================================>] 6.09M 71.5KB/s in 55s
2024-06-19 10:42:26 (114 KB/s) - ‘data/bigBed.files/ENCFF762IFP.bigBed’ saved [6390157/6390157]
Check md5 files:
#!/bin/bash
for file_type in bigBed; do
# Retrieve original MD5 hashes from the metadata
../bin/selectRows.sh <(cut -f1 analyses/"$file_type".*.ids.txt) ATACmetadata.tsv | cut -f1,46 > data/"$file_type".files/md5sum.txt
# Compute MD5 hashes on the downloaded files and compare with original hashes
while read -r filename original_md5sum; do
computed_md5sum=$(md5sum data/"$file_type".files/"$filename"."$file_type" | awk '{print $1}')
echo -e "${filename}\t${original_md5sum}\t${computed_md5sum}"
done < data/"$file_type".files/md5sum.txt > tmp
# Overwrite the original md5sum file with the new one containing computed hashes
mv tmp data/"$file_type".files/md5sum.txt
# Check for any mismatches between original and computed MD5 hashes
awk '$2 != $3' data/"$file_type".files/md5sum.txt
done
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq$ cd data/
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/data$ ls
bigBed.files
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/data$ cd bigBed.files/
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/data/bigBed.files$ ls
ENCFF287UHP.bigBed ENCFF762IFP.bigBed md5sum.txt
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/data/bigBed.files$ head -2 md5sum.txt # to show only the first 2 lines
ENCFF762IFP f6a97407b6ba4697108e74451fb3eaf4 f6a97407b6ba4697108e74451fb3eaf4
ENCFF287UHP 46f2ae76779da5be7de09b63d5c2ceb9 46f2ae76779da5be7de09b63d5c2ceb9
3) For each tissue, run an intersection analysis using BEDTools: report 1) the number of peaks that intersect promoter regions, 2) the number of peaks that fall outside gene coordinates (whole gene body, not just the promoter regions).
1) number of peaks that intersect promoter regions
Change directory:
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq$ cd analyses/
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/analyses$ ls
bigBed.peaks.ids.txt peaks.analysis
Download gene assembly:
wget -P annotation "https://www.encodeproject.org/files/gencode.v24.primary_assembly.annotation/@@download/gencode.v24.primary_assembly.annotation.gtf.gz"
gencode.v24.primary_assembly.annota 100%[===================================================================>] 36.93M 270KB/s in 5m 15s
2024-06-19 11:00:26 (120 KB/s) - ‘annotation/gencode.v24.primary_assembly.annotation.gtf.gz’ saved [38723767/38723767]
gunzip annotation/gencode.v24.primary_assembly.annotation.gtf.gz
Get the number of peaks:
awk '$3=="gene"' annotation/gencode.v24.primary_assembly.annotation.gtf |\
cut -d ";" -f1 |\
awk 'BEGIN{OFS="\t"}{print $1, $4, $5, $10, 0, $7, $10}' |\
sed 's/\"//g' |\
awk 'BEGIN{FS=OFS="\t"}$1!="chrM"{$2=($2-1); print $0}' > annotation/gencode.v24.gene.body.bed
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/analyses/annotation$ ls
gencode.v24.gene.body.bed gencode.v24.primary_assembly.annotation.gtf
Retrieve the bed files from the bigBed files:
# create directory:
mkdir -p ~/epigenomics_uvic/ATAC-seq/data/bed.files
# proceed:
# Convert bigBed to bed Format: Now, we'll use bigBedToBed to convert each bigBed file referenced in bigBed.peaks.ids.txt to bed format and store them in ~/epigenomics_uvic/ATAC-seq/data/bed.files/.
cut -f1 analyses/bigBed.peaks.ids.txt |\
while read filename; do
bigBedToBed data/bigBed.files/"$filename".bigBed data/bed.files/"$filename".bed
done
Now, intersect ATAC peaks with TSS:
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/analyses$ ls
annotation bigBed.peaks.ids.txt peaks.analysis
cut -f-2 analyses/bigBed.peaks.ids.txt |\
while read filename tissue; do
bedtools intersect -a annotation/gencode.v24.protein.coding.non.redundant.TSS.bed -b data/bed.files/"$filename".bed -wb |\
sort -u > analyses/peaks.analysis/ATAC.peaks."$tissue".txt
done
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/analyses/peaks.analysis$ ls
ATAC.peaks.sigmoidcolon.txt ATAC.peaks.stomach.txt
# obtain the number of peaks
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq$ wc -l analyses/peaks.analysis/ATAC.peaks.stomach.txt
99380 analyses/peaks.analysis/ATAC.peaks.stomach.txt
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq$ wc -l analyses/peaks.analysis/ATAC.peaks.sigmoidcolon.txt
107242 analyses/peaks.analysis/ATAC.peaks.sigmoidcolon.txt
2) the number of peaks that fall outside gene coordinates (whole gene body, not just the promoter regions).
Generate Gene Body BED File:
# Extracting gene body coordinates from the GTF file and creating a BED file (annotation/gencode.v24.gene.body.bed) that represents the entire gene body.
cd ~/epigenomics_uvic/ATAC-seq/analyses/
awk '$3=="gene"' ../annotation/gencode.v24.primary_assembly.annotation.gtf |\
cut -d ";" -f1 |\
awk 'BEGIN{OFS="\t"}{print $1, $4, $5, $10, 0, $7, $10}' |\
sed 's/\"//g' |\
awk 'BEGIN{FS=OFS="\t"}$1!="chrM"{$2=($2-1); print $0}' > gencode.v24.gene.body.bed
# Verify the File Creation:
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/analyses/annotation$ ls gencode.v24.gene.body.bed
gencode.v24.gene.body.bed
# in the same folder (cd ~/epigenomics_uvic/ATAC-seq/analyses/)
cut -f-2 bigBed.peaks.ids.txt |\
while read filename tissue; do
bedtools intersect -a annotation/gencode.v24.gene.body.bed -b ../data/bed.files/"$filename".bed -v -wb |\
sort -u > peaks.analysis/peaks.outside.gene."$tissue".txt
done
# Verify Output:
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/analyses/peaks.analysis$ ls
ATAC.peaks.sigmoidcolon.txt ATAC.peaks.stomach.txt peaks.outside.gene.sigmoidcolon.txt peaks.outside.gene.stomach.txt
# now create peaks output files
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq$
bedtools intersect -a data/bed.files/ENCFF762IFP.bed -b analyses/annotation/gencode.v24.gene.body.bed -v | sort -u > analyses/peaks.analysis/peaks.outside.gene.stomach.txt
bedtools intersect -a data/bed.files/ENCFF287UHP.bed -b analyses/annotation/gencode.v24.gene.body.bed -v | sort -u > analyses/peaks.analysis/peaks.outside.gene.sigmoidcolon.txt
# check files
mmatitos@Magdalena:~/epigenomics_uvic/ATAC-seq/analyses/peaks.analysis$ ls
ATAC.peaks.sigmoidcolon.txt ATAC.peaks.stomach.txt peaks.outside.gene.sigmoidcolon.txt peaks.outside.gene.stomach.txt
# check number of peaks
wc -l analyses/peaks.analysis/peaks.outside.gene.stomach.txt
25489 analyses/peaks.analysis/peaks.outside.gene.stomach.txt
wc -l analyses/peaks.analysis/peaks.outside.gene.sigmoidcolon.txt
27089 analyses/peaks.analysis/peaks.outside.gene.sigmoidcolon.txt