Skip to content

Conversation

@lkirk
Copy link
Owner

@lkirk lkirk commented Sep 12, 2025

As discussed in tskit-dev#2834, this patch renames tsk_bit_array_t to tsk_bitset_t. Philosophically, we treat these as sets and not arrays, performing intersections, unions, and membership tests. Therefore, it makes sense to alter the API to use set theoretic vocabulary, describing the intent more precisely. Fundamentally, the bitset structure is a list of N independent bitsets. Each operation on two sets must select the row on which to operate. The tsk_bitset_t originally tracked len only, which was N, the number of sets. For convenience, we also track the row_len, which is the number of unsigned integers per row. If we multiply row_len by TSK_BITSET_BITS, we get the number of bits that each set (or row) in the list of bitsets will hold.

We had also discussed each set theoretic operation accepting a row index instead of a pointer to a row within the bitset object. Now, each operation accepts a row index for each bitset structure passed into the function. This simplifies the consumption of this API considerably, removing the need of storing and tracking many intermediate temporary array pointers. We also see some performance improvements from this cleanup. For DRY purposes, I've created a private macro, BITSET_DATA_ROW, which abstracts away the pointer arithmetic for selecting a row out of the list of sets. Because of these changes, tsk_bit_array_get_row is no longer needed and has been removed from the API.

This change does not change the size of the "chunk", which is the unsigned integer storing bits. It remains a 32 bit unsigned integer, which is most performant for bit counting (popcount). I've streamlined the macros used to determine which integer in the row will be used to store a particular bit. Everything now revolves around the TSK_BITSET_BITS macro, which is simply 32 and bitshift operations have been converted to unsigned integer division.

Testing has been refactored to reflect these changes, removing tests that operate on a specific rows. Tests in c and python are passing and valgrind shows no errors.

As discussed in tskit-dev#2834, this patch renames tsk_bit_array_t to
tsk_bitset_t. Philosophically, we treat these as sets and not arrays,
performing intersections, unions, and membership tests. Therefore, it
makes sense to alter the API to use set theoretic vocabulary, describing
the intent more precisely. Fundamentally, the bitset structure is a list
of N independent bitsets. Each operation on two sets must select the row
on which to operate. The tsk_bitset_t originally tracked `len` only,
which was N, the number of sets. For convenience, we also track the
`row_len`, which is the number of unsigned integers per row. If we
multiply `row_len` by `TSK_BITSET_BITS`, we get the number of bits that
each set (or row) in the list of bitsets will hold.

We had also discussed each set theoretic operation accepting a row index
instead of a pointer to a row within the bitset object. Now, each
operation accepts a row index for each bitset structure passed into the
function. This simplifies the consumption of this API considerably,
removing the need of storing and tracking many intermediate temporary
array pointers. We also see some performance improvements from this
cleanup. For DRY purposes, I've created a private macro,
`BITSET_DATA_ROW`, which abstracts away the pointer arithmetic for
selecting a row out of the list of sets. Because of these changes,
`tsk_bit_array_get_row` is no longer needed and has been removed from
the API.

This change does not change the size of the "chunk", which is the
unsigned integer storing bits. It remains a 32 bit unsigned integer,
which is most performant for bit counting (popcount). I've streamlined
the macros used to determine which integer in the row will be used to
store a particular bit. Everything now revolves around the
TSK_BITSET_BITS macro, which is simply 32 and bitshift operations have
been converted to unsigned integer division.

Testing has been refactored to reflect these changes, removing tests
that operate on a specific rows. Tests in c and python are passing and
valgrind shows no errors.
@lkirk lkirk force-pushed the bit-array-refactor branch from 94a72a9 to 8971aa9 Compare September 14, 2025 21:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants