Skip to content

[ADAM-909] Refactoring variation RDDs.#1015

Merged
heuermh merged 3 commits intobigdatagenomics:masterfrom
fnothaft:genotypes-rdd
Jun 3, 2016
Merged

[ADAM-909] Refactoring variation RDDs.#1015
heuermh merged 3 commits intobigdatagenomics:masterfrom
fnothaft:genotypes-rdd

Conversation

@fnothaft
Copy link
Member

Resolves #909:

  • Refactors org.bdgenomics.adam.rdd.variation to add GenomicRDDs for Genotype, Variant, and VariantContext. These classes write sequence and sample metadata to disk.
  • Refactors ADAMRDDFunctions to an abstract class in preparation for further refactoring in Factor out *RDDFunctions classes #1011.
  • Added AvroGenomicRDD trait which consolidates Parquet + Avro metadata writing code across all Avro data models.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1176/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.3.0,2.10,1.6.0,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.3.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosADAM-prb ? 2.3.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.0,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

Jenkins, retest this please.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1177/
Test PASSed.

val path = resourcePath("small.vcf")

val vcs = sc.loadGenotypes(path).toVariantContext.collect.sortBy(_.position)
val vcs = sc.loadGenotypes(path).toVariantContextRDD.rdd.collect.sortBy(_.position)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have this

object ADAMContext {
  implicit def genomicRDDToRDD[T](gRdd: GenomicRDD[T]): RDD[T] = gRdd.rdd

https://github.com/bigdatagenomics/adam/blob/master/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L114

why is the .rdd here and elsewhere necessary? Implicits work fine except for when they don't?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh man. I zoned on the implicit. That's my bad. I'll fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, what I mean was that sometimes the genomicRDDToRDD implicit works for me and sometimes it doesn't. I assume because you are using .rdd here that they are necessary. The scala compiler and I are still getting to know each other.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah I was using them because I forgot that we had the implicit.

@fnothaft
Copy link
Member Author

Just pushed an update with upgrades to bdg-formats 0.8.0 and cleanup of the .rdd bits. This fails unit tests right now, which I'll patch up this AM. CC @erictu @heuermh

The 0.8.0 upgrade is split out into a separate commit. This will need to be squashed down before merge.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1239/

Build result: FAILURE

GitHub pull request #1015 of commit 0ecfae9 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > /home/jenkins/git2/bin/git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1015/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 8a08bb5f7a926e05f9627de97e383f281c935d51 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1015/merge^{commit} # timeout=10Checking out Revision 8a08bb5f7a926e05f9627de97e383f281c935d51 (origin/pr/1015/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 8a08bb5f7a926e05f9627de97e383f281c935d51First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

Just pushed another commit to address the failing unit tests. Please review @heuermh @erictu.

@heuermh you may be interested to find that GZIPed VCF loading doesn't work. Alas, the VCFHeaderReader code from Hadoop-BAM seems to fail when trying to load GZIPed VCF. I think there's a fall-through case that wasn't considered properly. I'm going to open an issue here to test more, and then if I can isolate/test/resolve it, I'll go upstream to Hadoop-BAM.

@heuermh
Copy link
Member

heuermh commented May 20, 2016

you may be interested to find that GZIPed VCF loading doesn't work.

It was working for me, and there is a unit test to verify it.
https://github.com/bigdatagenomics/adam/blame/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L284

@fnothaft
Copy link
Member Author

fnothaft commented May 20, 2016

you may be interested to find that GZIPed VCF loading doesn't work.

It was working for me, and there is a unit test to verify it.
https://github.com/bigdatagenomics/adam/blame/master/adam-core/src/test/scala/org/bdgenomics/adam/rdd/ADAMContextSuite.scala#L284

Oh, yeah; didn't mean to imply that it was currently misimplemented. The issue comes up because we now rely on reading the VCF header separately to pull out sequence/sample metadata:

https://github.com/bigdatagenomics/adam/pull/1015/files#diff-d36ea7d0742decd0b040a73a96af06e9R136

The VCFHeaderReader in Hadoop-BAM appears to mishandle GZIP/BGZIPed VCFs. As far as I can tell, it falls through and tries to treat them as BCF files.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1240/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented May 20, 2016

Yep, in agreement. There's some duplicated code in Hadoop-BAM that should be shared, and things probably diverged between the VCF input format and VCF header reader in this case.

val variantContextRdd = sc.loadVcf(args.vcfPath, sdOpt = dictionary)
var variantContextsToSave = if (args.coalesce > 0) {
if (args.coalesce > variantContextRdd.partitions.size || args.forceShuffle) {
variantContextRdd.transform(_.coalesce(args.coalesce, shuffle = true))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you explain what the .transform(_. part is doing here? is it not possible to call variantContextRdd.coalesce directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.transform(_.) transforms the RDD that underlies the VariantContextRDD and emits a new VariantContextRDD.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you did variantContextRdd.coalesce that would emit a RDD[VariantContext] and you'd lose the .saveAsVcf/etc functions, which we use later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm liking this pattern less as time goes on. Perhaps we should drop RDD from the VariantContextRDD class names for something more generic and remove the implicit conversion. Or at least remove the implicit conversion. What the user gets back from an ADAMContext load method should has_a RDD and not pretend to is_a RDD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we should drop RDD from the VariantContextRDD

I agree there may be some nomenclature concern that our many *RDD container objects for rdd+dicts are not RDDs in same true sense as IntervalRDD and IndexedRDD which actually extend RDD. If I am reading correctly, there is not an implicit conversion that would even allow these to be treated as an RDD directly. The meaning seems more like Holder ofRDDofVariantContextWithDicts though I don't advocate that name. What do you think @fnothaft ? At this point though we have already gone down this path a ways using the suffix "RDD" in a broad sense, so a renaming should perhaps come in the future after this PR.

Copy link
Member

@heuermh heuermh May 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's this implicit conversion from GenomicRDD (which VariantContextRDD and others extend from) to RDD
https://github.com/fnothaft/adam/blob/822c64e3530d7ed8bb86daeb7b900bd7bf2f7854/adam-core/src/main/scala/org/bdgenomics/adam/rdd/ADAMContext.scala#L116

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a new issue #1040, I'm ok with this for now.

@heuermh
Copy link
Member

heuermh commented May 24, 2016

Created HadoopGenomics/Hadoop-BAM#96 to document missing support for gzipped and BGZF VCF formats.

.map { case (v: RichVariant, g) => new VariantContext(ReferencePosition(v), v, g, None) }
}

def filterByOverlappingRegion(query: ReferenceRegion): RDD[Genotype] = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get this operation supported over the GenotypeRDD class?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

@fnothaft
Copy link
Member Author

Just pushed 0f90302 to address the VCF.GZ issues. I will merge @jpdna's PR as well. @heuermh I'm still working on the BCF issue. There's something more going on here. I will ping back with more of an error trace.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1243/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented May 26, 2016

I'm still working on the BCF issue.

Which BCF issue? BCF files probably won't work in ADAM due to issue(s) upstream in HTSJDK, where they don't follow the latest BCF specification.

@fnothaft
Copy link
Member Author

I'm still working on the BCF issue.

Which BCF issue? BCF files probably won't work in ADAM due to issue(s) upstream in HTSJDK, where they don't follow the latest BCF specification.

Ah, OK! That is the exact issue I was running into. I won't hassle myself trying to fix that then.

(I tried regenerating BCF using Picard, which amusingly enough, doesn't seem to work)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1244/
Test PASSed.

@fnothaft
Copy link
Member Author

Just pushed a commit that (I think) addresses the remaining review comments. Can I get a final pass on this? I will then squash this down to two commits.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1245/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented May 26, 2016

+1, thanks!

@jpdna
Copy link
Member

jpdna commented May 26, 2016

+1 - I have no outstanding concerns

@erictu
Copy link
Member

erictu commented May 26, 2016

+1, seems good to me!

@fnothaft
Copy link
Member Author

Squashed down into three commits and cleaned up the history.

@fnothaft
Copy link
Member Author

Hold on the merge; I see a unit test failure in one of the intermediate commits when building locally.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1246/
Test PASSed.

@fnothaft
Copy link
Member Author

Fixed and pushed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1247/

Build result: FAILURE

[...truncated 24 lines...]Triggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.3.0,2.10,1.4.1,centosADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.3.1,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.10,1.4.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,1.5.2,centos completed with result SUCCESSADAM-prb ? 2.6.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.11,1.4.1,centos completed with result SUCCESSADAM-prb ? 2.3.0,2.10,1.4.1,centos completed with result SUCCESSNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft
Copy link
Member Author

Jenkins, retest this please.

@heuermh
Copy link
Member

heuermh commented May 26, 2016

Do you think our logging related changes might have anything to do with that failure? We removed some of the logging configuration initialization bits in #1028

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1248/
Test PASSed.

@erictu
Copy link
Member

erictu commented May 27, 2016

Can we add in the corresponding changes of schema to org.bdgenomics.adam.projections.GenotypeField?

@heuermh
Copy link
Member

heuermh commented Jun 1, 2016

@fnothaft do you want to push another commit for GenotypeField and then re-squash? I could pr against this one but it probably isn't worth it for such a small change.


// due to a bug upstream in Hadoop-BAM, the VCFHeaderReader class errors when reading
// headers from .vcf.gz files
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fnothaft added 3 commits June 1, 2016 14:24
Resolves bigdatagenomics#909:

* Refactors `org.bdgenomics.adam.rdd.variation` to add `GenomicRDD`s for
  `Genotype`, `Variant`, and `VariantContext`. These classes write
  sequence and sample metadata to disk.
* Refactors `ADAMRDDFunctions` to an abstract class in preparation for
  further refactoring in bigdatagenomics#1011.
* Added `AvroGenomicRDD` trait which consolidates Parquet + Avro metadata
  writing code across all Avro data models.
@fnothaft
Copy link
Member Author

fnothaft commented Jun 1, 2016

@fnothaft do you want to push another commit for GenotypeField and then re-squash? I could pr against this one but it probably isn't worth it for such a small change.

Sorry about the delay on this @heuermh @erictu! Just pushed the change.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1249/
Test PASSed.

@heuermh
Copy link
Member

heuermh commented Jun 1, 2016

+1

1 similar comment
@erictu
Copy link
Member

erictu commented Jun 2, 2016

+1

@heuermh heuermh merged commit 525fda8 into bigdatagenomics:master Jun 3, 2016
@heuermh
Copy link
Member

heuermh commented Jun 3, 2016

Merged manually as 449a882, 2c07382 and 525fda8. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants