Conversation
heuermh
left a comment
There was a problem hiding this comment.
LGTM, added a few comments for discussion.
|
|
||
| // reload datasets and cache | ||
| val na12878Gts = sc.loadGenotypes("%s/NA12878.gt.adam".format(dataDir)).transform(_.cache()) | ||
| val otherGts = sc.loadGenotypes("%s/NA1289*.gt.adam/*".format(dataDir)).transform(_.cache()) |
There was a problem hiding this comment.
Also asked this elsewhere, why the /* in the glob pattern?
| // and dumping a bad GQ | ||
| // | ||
| // since these lines are obviously strong HomRef calls, let's recover them by parsing PL's | ||
| if (gt.getVariant.getAlternateAllele == null) { |
There was a problem hiding this comment.
In converting from VCF, we go from <NON_REF> symbolic allele to null variant alternateAllele? For going back to VCF, is there any other reason why variant alternateAllele might be null? Do we write "<NON_REF>" when variant alternateAllele is null or do we rely on htsjdk to do that for us?
| bcastSampleToPopulationMap.value | ||
| .get(gt.getSampleId) | ||
| .map(population => (population, gt)) | ||
| }).filter(kv => { |
There was a problem hiding this comment.
I wonder if there would be utility in filterBy method(s) on VariantRDD/GenotypeRDD to make this easier to read? The filter would need to happen before the flatMap to (population, gt) then. Is there much performance difference between a compound filter like this with lots of expressions and a series of filter calls with smaller expressions?
| /** | ||
| * | ||
| */ | ||
| val dataDir = "/data/variant_db/2" |
There was a problem hiding this comment.
Could you add some metrics to this script?
|
Hi @Steven-N-Hart! I think things are complete from my side, but I was considering going through and doing a bit of code cleanup, possible repackaging. Let me square with @heuermh this morning and we'll figure out next steps. Beyond the README that we owe you, what would be helpful for you? |
|
@fnothaft , |
Don't have these quite done, but they're 99% there. I've been running these in the Spark shell. Query 2 (gVCF merge) is done and tested. Query 1 (integrate across/within populations) is done, but I'm debugging a serialization error that the Spark shell gives me. The scripts are pretty self contained, but I will be adding a README with some minor implementation details. I'm also going to be going through and improving the time measurement code. @Steven-N-Hart let me know what information you would need from me to make maximum use of this.
BTW, this depends on a few features that we are triaging into the next release of ADAM, which should be out very soon. They are: