Skip to content

WIP ADAM queries.#6

Open
fnothaft wants to merge 1 commit intoSteven-N-Hart:masterfrom
fnothaft:adam
Open

WIP ADAM queries.#6
fnothaft wants to merge 1 commit intoSteven-N-Hart:masterfrom
fnothaft:adam

Conversation

@fnothaft
Copy link

@fnothaft fnothaft commented Jan 5, 2017

Don't have these quite done, but they're 99% there. I've been running these in the Spark shell. Query 2 (gVCF merge) is done and tested. Query 1 (integrate across/within populations) is done, but I'm debugging a serialization error that the Spark shell gives me. The scripts are pretty self contained, but I will be adding a README with some minor implementation details. I'm also going to be going through and improving the time measurement code. @Steven-N-Hart let me know what information you would need from me to make maximum use of this.

BTW, this depends on a few features that we are triaging into the next release of ADAM, which should be out very soon. They are:

@fnothaft
Copy link
Author

fnothaft commented Jan 5, 2017

CC @heuermh @jpdna

@drachimera
Copy link
Collaborator

drachimera commented Jan 5, 2017 via email

Copy link

@heuermh heuermh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, added a few comments for discussion.


// reload datasets and cache
val na12878Gts = sc.loadGenotypes("%s/NA12878.gt.adam".format(dataDir)).transform(_.cache())
val otherGts = sc.loadGenotypes("%s/NA1289*.gt.adam/*".format(dataDir)).transform(_.cache())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also asked this elsewhere, why the /* in the glob pattern?

// and dumping a bad GQ
//
// since these lines are obviously strong HomRef calls, let's recover them by parsing PL's
if (gt.getVariant.getAlternateAllele == null) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In converting from VCF, we go from <NON_REF> symbolic allele to null variant alternateAllele? For going back to VCF, is there any other reason why variant alternateAllele might be null? Do we write "<NON_REF>" when variant alternateAllele is null or do we rely on htsjdk to do that for us?

bcastSampleToPopulationMap.value
.get(gt.getSampleId)
.map(population => (population, gt))
}).filter(kv => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there would be utility in filterBy method(s) on VariantRDD/GenotypeRDD to make this easier to read? The filter would need to happen before the flatMap to (population, gt) then. Is there much performance difference between a compound filter like this with lots of expressions and a series of filter calls with smaller expressions?

/**
*
*/
val dataDir = "/data/variant_db/2"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some metrics to this script?

@Steven-N-Hart
Copy link
Owner

Thanks @fnothaft and @heuermh ! Should I accept now or wait until you've completed? I will at least need a README (since I'm not a Scala person).

@fnothaft
Copy link
Author

Hi @Steven-N-Hart! I think things are complete from my side, but I was considering going through and doing a bit of code cleanup, possible repackaging. Let me square with @heuermh this morning and we'll figure out next steps. Beyond the README that we owe you, what would be helpful for you?

@Steven-N-Hart
Copy link
Owner

@fnothaft ,
The README should be sufficient. If I can't complete the analysis or I find a missing dependancy, etc., I'll ping you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants