-
Notifications
You must be signed in to change notification settings - Fork 4
Inconsistencies between GRCh38 BedGraph download from HoffmanLab and bedgraph created from UCSC bigWig file #21
Description
My end goal is to annotate a VCF file with mappability scores, but in spot checking annotations for some mutation sites with the UCSC Genome Browser, I noticed some inconsistencies. Specifically, I annotated variants using bcftools annotate with a slightly modified (header removed) BedGraph file https://bismap.hoffmanlab.org/raw/hg38/k100.umap.bedgraph.gz. The variant that I first noticed was at chr1:1641723, for which multi-read k100 UMAP value as 1.0, but UCSC's value was 0.18.
So, to check this further, I downloaded the following, and subsetted data for chr1.
wget https://bismap.hoffmanlab.org/raw/hg38/k100.umap.bedgraph.gz
wget http://hgdownload.soe.ucsc.edu/admin/exe/linux.x86_64/bigWigToBedGraph
wget https://hgdownload.soe.ucsc.edu/gbdb/hg38/hoffmanMappability/k100.Umap.MultiTrackMappability.bw
bigWigToBedGraph -chrom=chr1 k100.Umap.MultiTrackMappability.bw k100.Umap.MultiTrackMappability.chr1.bedgraph
I imported the BW->BG file k100.Umap.MultiTrackMappability.chr1.bedgraph and the k100.umap.bedgraph.gz into R and merged the tables on CHROM, FROM, and TO. Note, to line up the tables, the FROM and TO values in the raw BedGraph file were one base higher than the matching rows in the file output by bigWigToBedGraph, so I subtracted 1 from FROM and TO for merging tables. That seems a little odd to me, as it seems like both should have consistent start and end positions.
I thought originally that each row should match one in the other file, but only 57.9% matched exactly:
The plots below compare the values between the two files around the above mutation site. I added status columns to show the sources of data for each row and the length of the regions encoded by a particular row (end with ".l").
Notably, from row 18, the raw BedGraph has a value of 1.0 for the 86 basepair region chr1:1641640-1641726 that includes chr1:1641723, but the BW->BG file has various values for those positions that continue down past 1641726 to position 1641739.
Also, if we line up the BW->BG values down from row 19 and the raw BG file's values from row 104 (from position 1641726), the values are equal until position 1642025, suggesting that something occurred in creating the raw BedGraph to shift the original values down.

