Refactor bit array api, rename to bitset. #13

lkirk · 2025-09-12T20:58:48Z

As discussed in tskit-dev#2834, this patch renames tsk_bit_array_t to tsk_bitset_t. Philosophically, we treat these as sets and not arrays, performing intersections, unions, and membership tests. Therefore, it makes sense to alter the API to use set theoretic vocabulary, describing the intent more precisely. Fundamentally, the bitset structure is a list of N independent bitsets. Each operation on two sets must select the row on which to operate. The tsk_bitset_t originally tracked len only, which was N, the number of sets. For convenience, we also track the row_len, which is the number of unsigned integers per row. If we multiply row_len by TSK_BITSET_BITS, we get the number of bits that each set (or row) in the list of bitsets will hold.

We had also discussed each set theoretic operation accepting a row index instead of a pointer to a row within the bitset object. Now, each operation accepts a row index for each bitset structure passed into the function. This simplifies the consumption of this API considerably, removing the need of storing and tracking many intermediate temporary array pointers. We also see some performance improvements from this cleanup. For DRY purposes, I've created a private macro, BITSET_DATA_ROW, which abstracts away the pointer arithmetic for selecting a row out of the list of sets. Because of these changes, tsk_bit_array_get_row is no longer needed and has been removed from the API.

This change does not change the size of the "chunk", which is the unsigned integer storing bits. It remains a 32 bit unsigned integer, which is most performant for bit counting (popcount). I've streamlined the macros used to determine which integer in the row will be used to store a particular bit. Everything now revolves around the TSK_BITSET_BITS macro, which is simply 32 and bitshift operations have been converted to unsigned integer division.

Testing has been refactored to reflect these changes, removing tests that operate on a specific rows. Tests in c and python are passing and valgrind shows no errors.

As discussed in tskit-dev#2834, this patch renames tsk_bit_array_t to tsk_bitset_t. Philosophically, we treat these as sets and not arrays, performing intersections, unions, and membership tests. Therefore, it makes sense to alter the API to use set theoretic vocabulary, describing the intent more precisely. Fundamentally, the bitset structure is a list of N independent bitsets. Each operation on two sets must select the row on which to operate. The tsk_bitset_t originally tracked `len` only, which was N, the number of sets. For convenience, we also track the `row_len`, which is the number of unsigned integers per row. If we multiply `row_len` by `TSK_BITSET_BITS`, we get the number of bits that each set (or row) in the list of bitsets will hold. We had also discussed each set theoretic operation accepting a row index instead of a pointer to a row within the bitset object. Now, each operation accepts a row index for each bitset structure passed into the function. This simplifies the consumption of this API considerably, removing the need of storing and tracking many intermediate temporary array pointers. We also see some performance improvements from this cleanup. For DRY purposes, I've created a private macro, `BITSET_DATA_ROW`, which abstracts away the pointer arithmetic for selecting a row out of the list of sets. Because of these changes, `tsk_bit_array_get_row` is no longer needed and has been removed from the API. This change does not change the size of the "chunk", which is the unsigned integer storing bits. It remains a 32 bit unsigned integer, which is most performant for bit counting (popcount). I've streamlined the macros used to determine which integer in the row will be used to store a particular bit. Everything now revolves around the TSK_BITSET_BITS macro, which is simply 32 and bitshift operations have been converted to unsigned integer division. Testing has been refactored to reflect these changes, removing tests that operate on a specific rows. Tests in c and python are passing and valgrind shows no errors.

lkirk force-pushed the bit-array-refactor branch from 94a72a9 to 8971aa9 Compare September 14, 2025 21:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor bit array api, rename to bitset. #13

Refactor bit array api, rename to bitset. #13

Uh oh!

lkirk commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Refactor bit array api, rename to bitset. #13

Are you sure you want to change the base?

Refactor bit array api, rename to bitset. #13

Uh oh!

Conversation

lkirk commented Sep 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants