Skip to content

Conversation

@Cyan4973
Copy link
Contributor

@Cyan4973 Cyan4973 commented Jan 29, 2026

Summary:
Introduces a new core bitSplit transform that splits a numeric stream into multiple output streams according to a configurable set of bit widths.

As a core transform, bitSplit can be instantiated to produce multiple nodes that all share a common, compatible decoder.

On the encode path, the transform extracts consecutive bit ranges from each element (from LSB to MSB) into separate streams. On the decode path, it reconstructs the original values by reassembling these ranges. Partial coverage is supported (sum(widths) < element_width); in that case, the remaining high bits must be zero.

Performance depends on the number of ranges and the total covered bit width, but is on the order of ~1 GB/s on my devserver for both encoding and decoding.

bitSplit also enables specialized nodes, e.g. floating-point component separation (sign, exponent, mantissa) across formats and sizes (FP32, FP16, BF16, etc.). Internally, it supports specialized instances to improve performance for common scenarios.

This patch ships a BF16-specialized instance. On my devserver it reaches ~30 GB/s on decode and ~40 GB/s on encode using plain C; such performance requires vectorization, and the compiler auto-vectorizes the hot loop with AVX2.

The initial motivation is to replace several corner-case graphs where zigzag is effectively used as a “poor man’s bitshift” in combination with transpose. Recent sample analysis shows this pattern is common in compressor graphs produced by ACE. These multi-stage constructs can be replaced with a single bitSplit node, improving throughput, slightly reducing output size, and providing a clearer decision signal than the current indirect emulation.

Test Plan:

make gtests && ./gtests --gtest_filter="*BitSplit*"

All 28 tests pass covering:

  • Full coverage splits (8/16/32/64-bit inputs)
  • Partial coverage cases
  • Single stream passthrough
  • Edge cases (empty input, single element, all zeros/ones)
  • Asymmetric and many-small-width splits

@Cyan4973 Cyan4973 self-assigned this Jan 29, 2026
@meta-cla meta-cla bot added the cla signed label Jan 29, 2026
@meta-codesync
Copy link

meta-codesync bot commented Jan 29, 2026

@Cyan4973 has imported this pull request. If you are a Meta employee, you can view this in D91787385.

Summary:
Introduces a new core bitSplit transform that splits numeric input elements into multiple output streams based on configurable bit widths.

This is a core transform, that will then be used to generate multiple nodes, all compatible with the same decoder transform.

The encoder extracts consecutive bit ranges (LSB to MSB order) from each input element into separate typed streams. The decoder reconstructs the original values by combining the bit ranges. Supports partial coverage where sum(widths) < element_width, with top bits required to be zero.

The specific nodes that can be created from bitSplit include, for example, Floating component separation (sign, exponent, mantissa), for all size and formats (FP32, FP16, BF16, etc.).

But the main use case I want to use this transform for is to replace a bunch of weird corner cases where the `zigzag` transform is employed as a sort of "poor man's bitshift" to produce a certain effect _in combination with_ `transpose`. According to analysis I did recently on a bunch of samples, this is very common in compressor graphs created by ACE. All of this could be replaced by a single evocation of a `bitSplit` derivative node, no only being faster and a little bit more compact, but more importantly providing a clear decision, as opposed to a muddy one emulated through multiple abused stages.


Test Plan:
make gtests && ./gtests --gtest_filter="*BitSplit*"

All 28 tests pass covering:
- Full coverage splits (8/16/32/64-bit inputs)
- Partial coverage cases
- Single stream passthrough
- Edge cases (empty input, single element, all zeros/ones)
- Asymmetric and many-small-width splits
for capabilities shared between compression and decompression
to be pure C11 with no dependencies
unused variable when debug is disabled
and removed local config file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant