[ADAM-909] Refactoring variation RDDs.#1015
Conversation
|
Test FAILed. Build result: FAILURE[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.6.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosADAM-prb ? 2.3.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
|
Jenkins, retest this please. |
|
Test PASSed. |
| val path = resourcePath("small.vcf") | ||
|
|
||
| val vcs = sc.loadGenotypes(path).toVariantContext.collect.sortBy(_.position) | ||
| val vcs = sc.loadGenotypes(path).toVariantContextRDD.rdd.collect.sortBy(_.position) |
There was a problem hiding this comment.
Since we have this
object ADAMContext {
implicit def genomicRDDToRDD[T](gRdd: GenomicRDD[T]): RDD[T] = gRdd.rddwhy is the .rdd here and elsewhere necessary? Implicits work fine except for when they don't?
There was a problem hiding this comment.
Oh man. I zoned on the implicit. That's my bad. I'll fix this.
There was a problem hiding this comment.
Sorry, what I mean was that sometimes the genomicRDDToRDD implicit works for me and sometimes it doesn't. I assume because you are using .rdd here that they are necessary. The scala compiler and I are still getting to know each other.
There was a problem hiding this comment.
Nah I was using them because I forgot that we had the implicit.
|
Test FAILed. Build result: FAILUREGitHub pull request #1015 of commit 0ecfae9 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1015/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8a08bb5f7a926e05f9627de97e383f281c935d51 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1015/merge^{commit} # timeout=10Checking out Revision 8a08bb5f7a926e05f9627de97e383f281c935d51 (origin/pr/1015/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8a08bb5f7a926e05f9627de97e383f281c935d51First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
|
Just pushed another commit to address the failing unit tests. Please review @heuermh @erictu. @heuermh you may be interested to find that GZIPed VCF loading doesn't work. Alas, the VCFHeaderReader code from Hadoop-BAM seems to fail when trying to load GZIPed VCF. I think there's a fall-through case that wasn't considered properly. I'm going to open an issue here to test more, and then if I can isolate/test/resolve it, I'll go upstream to Hadoop-BAM. |
It was working for me, and there is a unit test to verify it. |
Oh, yeah; didn't mean to imply that it was currently misimplemented. The issue comes up because we now rely on reading the VCF header separately to pull out sequence/sample metadata: https://github.com/bigdatagenomics/adam/pull/1015/files#diff-d36ea7d0742decd0b040a73a96af06e9R136 The |
|
Test PASSed. |
|
Yep, in agreement. There's some duplicated code in Hadoop-BAM that should be shared, and things probably diverged between the VCF input format and VCF header reader in this case. |
| val variantContextRdd = sc.loadVcf(args.vcfPath, sdOpt = dictionary) | ||
| var variantContextsToSave = if (args.coalesce > 0) { | ||
| if (args.coalesce > variantContextRdd.partitions.size || args.forceShuffle) { | ||
| variantContextRdd.transform(_.coalesce(args.coalesce, shuffle = true)) |
There was a problem hiding this comment.
could you explain what the .transform(_. part is doing here? is it not possible to call variantContextRdd.coalesce directly?
There was a problem hiding this comment.
.transform(_.) transforms the RDD that underlies the VariantContextRDD and emits a new VariantContextRDD.
There was a problem hiding this comment.
If you did variantContextRdd.coalesce that would emit a RDD[VariantContext] and you'd lose the .saveAsVcf/etc functions, which we use later.
There was a problem hiding this comment.
I'm liking this pattern less as time goes on. Perhaps we should drop RDD from the VariantContextRDD class names for something more generic and remove the implicit conversion. Or at least remove the implicit conversion. What the user gets back from an ADAMContext load method should has_a RDD and not pretend to is_a RDD.
There was a problem hiding this comment.
Perhaps we should drop RDD from the VariantContextRDD
I agree there may be some nomenclature concern that our many *RDD container objects for rdd+dicts are not RDDs in same true sense as IntervalRDD and IndexedRDD which actually extend RDD. If I am reading correctly, there is not an implicit conversion that would even allow these to be treated as an RDD directly. The meaning seems more like Holder ofRDDofVariantContextWithDicts though I don't advocate that name. What do you think @fnothaft ? At this point though we have already gone down this path a ways using the suffix "RDD" in a broad sense, so a renaming should perhaps come in the future after this PR.
There was a problem hiding this comment.
There's this implicit conversion from GenomicRDD (which VariantContextRDD and others extend from) to RDD
https://github.com/fnothaft/adam/blob/822c64e3530d7ed8bb86daeb7b900bd7bf2f7854/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L116
There was a problem hiding this comment.
Created a new issue #1040, I'm ok with this for now.
|
Created HadoopGenomics/Hadoop-BAM#96 to document missing support for gzipped and BGZF VCF formats. |
| .map { case (v: RichVariant, g) => new VariantContext(ReferencePosition(v), v, g, None) } | ||
| } | ||
|
|
||
| def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = { |
There was a problem hiding this comment.
Can we get this operation supported over the GenotypeRDD class?
|
Test PASSed. |
Which BCF issue? BCF files probably won't work in ADAM due to issue(s) upstream in HTSJDK, where they don't follow the latest BCF specification. |
Ah, OK! That is the exact issue I was running into. I won't hassle myself trying to fix that then. (I tried regenerating BCF using Picard, which amusingly enough, doesn't seem to work) |
|
Test PASSed. |
|
Just pushed a commit that (I think) addresses the remaining review comments. Can I get a final pass on this? I will then squash this down to two commits. |
|
Test PASSed. |
|
+1, thanks! |
|
+1 - I have no outstanding concerns |
|
+1, seems good to me! |
|
Squashed down into three commits and cleaned up the history. |
|
Hold on the merge; I see a unit test failure in one of the intermediate commits when building locally. |
|
Test PASSed. |
|
Fixed and pushed. |
|
Test FAILed. Build result: FAILURE[...truncated 24 lines...]Triggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'Test FAILed. |
|
Jenkins, retest this please. |
|
Do you think our logging related changes might have anything to do with that failure? We removed some of the logging configuration initialization bits in #1028 |
|
Test PASSed. |
|
Can we add in the corresponding changes of schema to |
|
@fnothaft do you want to push another commit for |
|
|
||
| // due to a bug upstream in Hadoop-BAM, the VCFHeaderReader class errors when reading | ||
| // headers from .vcf.gz files | ||
| // |
Resolves bigdatagenomics#909: * Refactors `org.bdgenomics.adam.rdd.variation` to add `GenomicRDD`s for `Genotype`, `Variant`, and `VariantContext`. These classes write sequence and sample metadata to disk. * Refactors `ADAMRDDFunctions` to an abstract class in preparation for further refactoring in bigdatagenomics#1011. * Added `AvroGenomicRDD` trait which consolidates Parquet + Avro metadata writing code across all Avro data models.
|
Test PASSed. |
|
+1 |
1 similar comment
|
+1 |
Resolves #909:
org.bdgenomics.adam.rdd.variationto addGenomicRDDs forGenotype,Variant, andVariantContext. These classes write sequence and sample metadata to disk.ADAMRDDFunctionsto an abstract class in preparation for further refactoring in Factor out *RDDFunctions classes #1011.AvroGenomicRDDtrait which consolidates Parquet + Avro metadata writing code across all Avro data models.