Skip to content

added contigName Hive style partitioning to AlignmentRecordRDD#17

Open
jpdna wants to merge 1 commit intofnothaft:issues/1018-dataset-apifrom
jpdna:jp_local_frank_1018
Open

added contigName Hive style partitioning to AlignmentRecordRDD#17
jpdna wants to merge 1 commit intofnothaft:issues/1018-dataset-apifrom
jpdna:jp_local_frank_1018

Conversation

@jpdna
Copy link

@jpdna jpdna commented Jul 10, 2017

Will move this PR to bdgenomics once 1018 is merged.
Given this PR, the Parquet directory is laid out with a directory per chromosome (contigName) like

_SUCCESS
_common_metadata
_metadata
_rgdict.avro
_seqdict.avro
contigName=1
    -> part-r-00000-f872ea82-3036-455a-a35d-d043ec386db4.gz.parquet
          ->(in future we will have another layer before the parquet files  )   

            posBin=10000,posBin=20000,...

Later, we will either add a posBin column to Avro or figure out how to allow that column to exist in parquet/dataset but drop from Avro, which will add another layer of directly hierarchy under the 'contigName=N' dirs that bins start pos into 10000 bp bins ( or some other optimal size )

As per discussion in bigdatagenomics#651
such binning should allow a more efficient predicate pushdown of range queries than we may currently get from Parquet.
I'm hoping this strategy is compatible and complementary with the sorted partition mapping system.

The code here can be tested in shell with

import org.bdgenomics.adam.rdd.ADAMContext._
val rdd = sc.loadAlignments("../adam/adam-core/src/test/resources/small.sam")

 val x = rdd.transformDataset(ds => {
     |  import ds.sqlContext.implicits._
     |  val df = ds.toDF()
     |  df.where(df("contigName") === "1")
     |    .as[AlignmentRecordProduct]
     | })

x.saveAsParquet("test_chr_partitioned_parquet")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant