-
Notifications
You must be signed in to change notification settings - Fork 8
Open
Description
@endrebak asked for suggestions on a possible API to make this a more general purpose read-my-bam tool.
My thought is a read_bam function that returns a pandas.DataFrame (similar to what is already here) but that
- Reads all alignments and all fields by default (including unmapped reads)
- Supports subselecting the fields (columns) being read for efficiency using a parameter, say,
fields. For examplefields=["Chromosome", "Start", "End", "Strand"]would only read in the specified columns and return a DataFrame with only those columns. Similar tousecolsin pandas.read_csv. - Supports subselecting the alignments (rows) being read to specified regions (and uses the BAM index for doing this). E.g.
regions=[("chr1", 100, 10000)]would subselect tochr1:100-10000. - Supports subselecting the alignments (rows) being read according to the BAM record flags. I think adding particular parameters for each of these would be the most user friendly. E.g.
only_mapped=Truewould be the equivalent of passing-F 4to samtools. I think really helpful to use named parameters here rather than making the user do bit arithmetic with binary flag codes. Basically implement this as named arguments. - Has a
max_alignmentsargument so the user can read just the first 10 records by passingmax_alignments=10
I think one function that implements this would handle the majority of my use cases for reading BAMs in Python, and provide a much simpler API to get started with and use than pysam
Metadata
Metadata
Assignees
Labels
No labels