Go application to convert a number to a floating-point format and print details like the hex/bit representation of the converted number, conversion error etc.
- IEEE-754 Float32
- BFloat16
- Float16
float-conv --num=<number> [--format=<format>] [--round-mode=<rounding mode>] [--overflow-mode=<overflow mode>] [--underflow-mode=<underflow-mode>]- The option
--numis used to provide the input number. The number can be input as either decimal, scientific or in hexfloat formats. - The
--formatoption is used to specify which floating point format to convert to. Valid values arefloat32[Default],bfloat16,float16. - The
--round-modeoption is used to specify which rounding mode to use, if the number cannot be exactly represented in the desired format. Supported options arertz: Round Towards Zerortposinf: Round Towards Positive Infinityrtneginf: Round Towards Negative Infinityrthalfzero: Round to the closest number, break ties by rounding towards zerorthalfposinf: Round to the closest number, break ties by rounding towards positive infinityrthalfneginf: Round to the closest number, break ties by rounding towards negative infinityrne: Round towards the nearest even number (LSB is 0) [Default]rno: Round towards the nearest odd number (LSB is 1)
- The
--overflow-modeoption is used to specify the response if the number (in magnitude) is larger than the maximum representable (in magnitude) in the target format. Supported options aresatinf: Saturate the number to infinity with the same sign as the inputsatmax: Saturate the number to the maximum possible in the format, with the same sign as the input [Default]
- The
--underflow-modeoption is used to specify the response if the number (in magnitude) is smaller than the minimum representable (in magnitude) in the target format. Supported options areflushzero: Flush the number to 0. If the target format supports signed zeros, then the sign is same as that of the inputsatmin: Saturates the number to the minimum representable, with the same sign as the input [Default]
- The
--precisionflag is used to augment the precision to use when parsing the input. The default is 53.
$ float-conv --num=0.125 --format=float32
Float32
|Sign|Exponent| Mantissa|
| 0|00111110|00000000000000000000000|
Decimal: 1.25e-01
Hexfloat: 0x1p-03
Conversion Error: 0e+00 (Exact)
Binary: 0b00111110000000000000000000000000
Hexadecimal: 0x3e000000
$ float-conv --num=1e-256 --format=bfloat16 --underflow-mode=flushzero
BFloat16
|Sign|Exponent|Mantissa|
| 0|00000000| 0000000|
Decimal: 0e+00
Hexfloat: 0x0p+00
Conversion Error: -1e-256 (Below)
Binary: 0b0000000000000000
Hexadecimal: 0x0000
UNDERFLOW- Add support for OCP Microscaling 8-bit Formats