Skip to content

Upgrade to Pysam 0.14.1#6

Open
Munchic wants to merge 357 commits intocryptofrom
update/pysam-0.14.1
Open

Upgrade to Pysam 0.14.1#6
Munchic wants to merge 357 commits intocryptofrom
update/pysam-0.14.1

Conversation

@Munchic
Copy link
Owner

@Munchic Munchic commented Jun 26, 2018

Notices:

  • bgzf.c and hfile_libcurl.c remain as originally implemented in pd3/htslib/crypto

AndreasHeger and others added 30 commits July 3, 2017 15:50
…x-instantiate-records

Ah faidx instantiate records
…bix_add_line_skip

{AH} add tabix line_skip, fixes pysam-developers#487. Remove pileup preset.
…am-developers#513)

## Background

`bcf_translate` works correctly on newly created or loaded BCF records, but it does not properly account for records which have been modified by having one or more INFO fields deleted.

## Nature of fix

Prior to attempt to update the info value (either in place or after resizing), update the entry's key id (to keep all ids valid) and then check whether `line->d.info[i].vptr` is `NULL`.   This indicates that the key and value have been deleted.  Otherwise, `vptr` is incorrectly assumed to point to a valid value and a segmentation fault results.

## Other issues not addressed in this PR

`bcf_translate` uses a data structure stored within the source BCF header data structure.  This is a poor design decision, because it presumes that any header will be used to translate to no more than one destination header.  Worse, there is no check to verify that the same destination header is used or that the source and destination have not changed since the translation table was initialized.  Both requirements are unchecked and can cause crashes or write corrupted data.  There is no good reason that the translation data structure is part of the header, except that it would require breaking ABI compatibility.  This suggests adding a `bcf_translate2` function that takes an explicit translation table as a parameter with clear documentation on the assumptions and lifetime of the translation struct.

## Notes

This fix was also submitted upstream to htslib.  See [PR pysam-developers#568](samtools/htslib#568)
## Scope

Running tests with nose would use the `tests` directory as CWD, so test file paths were all relative to the `tests` directory.  In contrast, pytest runs from the `pysam` base directory as CWD, so all test file file paths must be updated.  Prior changes were sufficient to ensure that existing test files were found in the correct locations, but many temporary files were created in CWD, rather than the `tests` directory.  

## Limitations

- There are still a few temp files not created in `tests`, including the 1kGenomes index file.
- In future, all temporary files should probably be written to a dedicated directory that can be removed after the test suite is run.
Doing so exposed a bifurcation in terminology.  AlignmentFile uses
`reference` and `end`, where as VariantFile is careful to always use
`contig` and `stop`.  Although there is more code in the wild using
AlignmentFile, I *strongly* prefer the latter nomenclature and updated all
of the code to use that.  However, extra keyword arguments were added to
enable backward compatible use of `reference` and `end`.  I understand this
change may be a bit controversial and I am happy to discuss before merging
this PR.
…r-parse-region

## Scope

* Update glossary entry for CIGAR
* Correct error return in pysam_bam_update
* Remove incorrect use of inline declarations in external Cython definitions
* Refactor region_parsing and related code to HTSFile base class.

    Doing so exposed a bifurcation in terminology.  `AlignmentFile` uses
    `reference` and `end`, where as `VariantFile` is careful to always use
    `contig` and `stop`.  Although there is more code in the wild using
    `AlignmentFile`, I *strongly* prefer the latter nomenclature and updated all
    of the code to use that.  However, extra keyword arguments were added to
    enable backward compatible use of `reference` and `end`.
wckdouglas and others added 28 commits March 1, 2018 01:38
Adding a warning to count_coverage when an alignment has an empty QUAL field
* Allow setting number of threads for htslib

Internally this uses hts_set_threads. There is only a minor
speedup for reading files, but writing files can benefit well
from setting n_threads to more than 1:

```
In [1]: def benchmark_write(threads=1):
   ...:     source = pysam.AlignmentFile('/Users/mvandenb/src/readtagger/tests/h4.bam', threads=threads)
   ...:     out = pysam.AlignmentFile('out.bam', mode='wb', template=source, threads=threads)
   ...:     for _ in range(1000000):
   ...:         out.write(next(source))
   ...:

In [2]: %timeit benchmark_write(threads=1)
1 loop, best of 3: 9.21 s per loop

In [3]: %timeit benchmark_write(threads=2)
1 loop, best of 3: 4.68 s per loop

In [4]: %timeit benchmark_write(threads=3)
1 loop, best of 3: 3.17 s per loop

In [5]: %timeit benchmark_write(threads=4)
1 loop, best of 3: 2.66 s per loop
```
This had been discussed in pysam-developers#579 (comment).

* Avoid combining n_threads > 1 w/ ignore_truncation

Apparently htslib won't raise a separate error if
hitting a truncated read. Also subtracts 1 from `n_threads`,
since `hts_set_threads` will add additional threads.
This should be more intuitive to users.
Also moves `hts_set_threads` as the last action when cerating
new AlignmentFile instances.

* Rename `n_thread` to `thread`

* Move hts_set_threads to HTSFile class

* Implement threads for pysam.VariantFile

* Implement threads for pysam.TabixFile
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.