-
Notifications
You must be signed in to change notification settings - Fork 1
Description
MAX_FLOAT can be represented in both fp32 and fp64. Its raw value is 0x1.fffffep+127. As hex fp32 and fp64, this would be 0x7f7fffff and 0x47efffffe0000000. BiNums correctly converts from raw form to their base-10 formats as 3.40282346638528859811704e+38.
With the input command binums.exe 0x1.fffffep127, binums will output the correct raw bits representations for fp64/32/16/etc. But the as number representation is (probably) correct only for fp64. Having floathex imply fp64 seems fine, but this causes difficulty if one wants to see floathex represented in float32. (It takes two steps: first floathex-->hex_fp32 then hex_fp32 --> fp32.)
Should binums be using the separate raw bits as the input for the as number output?
(bfloat16 is also saturating differently than float16 in the below example. Maybe leave subnormal discussion as separate from when the values are representable among the different types.)
Current output (trimmed):
> Release\bin\binums.exe 0x1.fffffep127
Representations:
type float64
decimal 3.40282346638528859811704e+38
floathex 0x1.fffffe0000000p+127
raw hex 0x47EFFFFFE0000000
raw oct 0o
raw bin 0b0100011111101111111111111111111111100000000000000000000000000000
fields bin frac:0b1111111111111111111111100000000000000000000000000000 exp:0b10001111110 sign:0b0
As raw bits:
float16 0x7C00
bfloat16 0x7F7F
float32 0x7F7FFFFF
-> float64 0x47EFFFFFE0000000
As number:
float16 0
bfloat16 0
float32 -36893488147419103232
-> float64 3.40282346638528859811704e+38
Proposed output (trimmed):
> Release\bin\binums.exe 0x1.fffffep127
Representations:
type float64
decimal 3.40282346638528859811704e+38
floathex 0x1.fffffe0000000p+127
raw hex 0x47EFFFFFE0000000
raw oct 0o
raw bin 0b0100011111101111111111111111111111100000000000000000000000000000
fields bin frac:0b1111111111111111111111100000000000000000000000000000 exp:0b10001111110 sign:0b0
As raw bits:
float16 0x7C00
bfloat16 0x7F80
float32 0x7F7FFFFF
-> float64 0x47EFFFFFE0000000
As number:
float16 inf
bfloat16 inf
float32 3.40282346638528859811704e+38
-> float64 3.40282346638528859811704e+38