Skip to content

Conversation

@daviesrob
Copy link
Collaborator

  • Add a way to store private data in the BGZF struct without changing the API or ABI
  • Add reference counting to the bcf_hdr_t struct and add it as BGZF private data
  • Get bcf_read() and bcf_readrec() to use the stored header to call updatephasing()
  • Speed up updatephasing()

The motivation for this is to enable passing of a pointer to
a bcf_hdr_t structure to bcf_readrec(), which currently does
not get one.  It does always get a pointer for the BGZF handle,
so a header struct could be passed in via that if it can be
stored somewhere.

To enable this while not changing the bgzf API or ABI, extra
fields are added to the opaque bgzf_cache_t field.  The BGZF_CACHE
macro that could be use to disable addition of the cache feature
removed as it was always turned on anyway.  The cache struct now
has to be created for files open for write, although the cache
part is not used.  The hash type used by the cache is renamed from
"cache" to "bgzf_cache" to improve its name-spacing.

The interfaces to add, get, and remove private data are put in
a new bgzf_internal.h header.  The bgzf_cache_t struct definition
is also moved there so that the get function can be inlined for
faster access to the private data field.

The bgzf_cache_t definition is rewritten slightly so that it's
not necessary to invoke KHASH_MAP_INIT_INT64() before it in the
header file, as doing that would require struct cache_t to be
moved from bgzf.c to the new header as well.  Instead, typedef
kh_bgzf_cache_t is used in place of khash(bgzf_cache), and
unsigned int instead of khint_t.
For bcf files, the header pointer hasn't always been passed
into bcf_read(), especially when using iterators.  As having
it available would be useful for VCF 4.4+ support, this works
around its absence by attaching a pointer to the header in
BGZF private data, which was previously unused for vcf/bcf.
It also adds reference counting to the header so that it can
be cleaned up safely irrespective of whether hts_close() or
bcf_hdr_destroy() was called first.  To avoid ABI breakage,
the reference count is stored in the bcf_hdr_aux_t struct.
BCF saved by versions of HTSlib before 1.22 will always store the
first phasing bit as 0.  For consistency with the VCF reader,
update this bit when reading BCF so that is is set if all other
phasing bits are also set.
Phasing should now be fixed up in bcf_read()/vcf_read(), so
there's no need to try again in bcf_get_format_values().
By noting that we're only interested in the least-significant
bit of each GT value, it's possible to reduce the number of
branches in this function by doing bit manipulations on the
first byte of each stored value.  The common haploid and diploid
cases are also specialised so the inner loop on ploidy can
be avoided for those cases.
@vasudeva8 vasudeva8 merged commit 140319a into vasudeva8:phase44update1 Sep 4, 2025
9 of 10 checks passed
@daviesrob daviesrob deleted the pr1938e2 branch October 23, 2025 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants