Seeing that many of the operations I'm encountering are addition of a small bit-width number to a much larger-bit-width number (e.g. to accumulating a new 8-bit width value into a large 20-bit width value each clock cycle).
Knowing that for the example of 8bit + 20 bit number, that the 9th-20th bit of the "8-bit-width" number are all zeroes, seems we might be able to further optimize trees for this type of scenario.
Hoping to start the discussion, curious what you would suggest as a good starting point for this exploration?