Skip to content

Conversation

@pabvald
Copy link
Collaborator

@pabvald pabvald commented Oct 6, 2024

@milankl
I have the basic implementation:

# minimum and maximum representable value of type T
Tmin, Tmax = Float64(typemin(T)), Float64(typemax(T))

# LinQuantization
Δ⁻¹ =  (Tmax-Tmin)/(Amax-Amin) # (Tmax -  Tmin) == (2^sizeof(T)-1), but  imo Tmax - Tmin is more informative and the values are used afterwards again anyways

Q[i] = round(T, clamp((A[i]-Amin)*Δ⁻¹ + Tmin, Tmin, Tmax)) 
...

# Base.Array
Δ = (Qmax-Qmin)/(Tmax-Tmin)
A[i] = Qmin + (Q[i] - Tmin)*Δ

And I have two questions:

  1. Could I remove the n::Integer argument from Base.Array ? In the end, the range of values can be calculated as typemax(eltype(Q)) - typemin(eltype(Q)) instead of 2^n - 1.
  2. Does it make sense to create aliases for LinQuantization and LinQuantArray, since the alias is equally long as the original call? e.g.
LinQuantInt8Array(A::AbstractArray{T,N},dim::Int,e::Option{Tuple}) where {T,N} = LinQuantArray(Int8,A,dim,e)

Also I am facing some issues. The back and forth conversion with signed integers is not equal for some values, with a different of 1.0 or 2.0. For Float16 and Int16 some values of A2 end up being -∞ or . I don't exactly know what the problem is.

for T in [Float64, Float32, Float16]
   for s in [(100,), (10,20), (13,14,15), (23,17,12,5)]:
        A = rand(T,s...)

        for U in [Int8, Int16,  Int32]
            A2 = Array{T}(LinQuantization(U,A))
            B = Array{T}(LinQuantization(U,A2))

            if A2 != B 
                println("T = ", T, " U = ", U)
            end 
            for i in eachindex(A2)
                if A2[i] != B[i]
                    println(" A2[i] = ", A2[i], " B[i] = ", B[i])
                end
            end
        end
    end
end

# T = Float32 U = Int32
#  A2[i] = 2.0507566e7 B[i] = 2.0507564e7
# T = Float32 U = Int32
#  A2[i] = 8.266649e6 B[i] = 8.266648e6
# T = Float32 U = Int32
#  A2[i] = 6.41308e6 B[i] = 6.413079e6
#  A2[i] = 848596.0 B[i] = 848595.0
#  A2[i] = 3.2166338e7 B[i] = 3.2166336e7
#  A2[i] = 1.717558e6 B[i] = 1.717557e6
#  A2[i] = 1.5504138e7 B[i] = 1.5504137e7
#  A2[i] = 3.065833e6 B[i] = 3.065832e6
#  A2[i] = 6.550137e6 B[i] = 6.550136e6
#  A2[i] = 9.974494e6 B[i] = 9.974493e6
# T = Float32 U = Int32
#  A2[i] = 1.3689715e7 B[i] = 1.3689714e7
#  A2[i] = 7.192029e6 B[i] = 7.192028e6
#  A2[i] = 8.769405e6 B[i] = 8.769404e6
#  A2[i] = 6.903743e6 B[i] = 6.903742e6
#  A2[i] = 2.544131e6 B[i] = 2.54413e6
#  A2[i] = 1.0867795e7 B[i] = 1.0867794e7
# ...

@pabvald
Copy link
Collaborator Author

pabvald commented Oct 6, 2024

TODOs:

  • Fix issues
  • Extend tests
  • Extend README

@pabvald pabvald marked this pull request as draft October 6, 2024 20:05
@milankl
Copy link
Owner

milankl commented Oct 7, 2024

Could I remove the n::Integer argument from Base.Array ? In the end, the range of values can be calculated as typemax(eltype(Q)) - typemin(eltype(Q)) instead of 2^n - 1.

Yes, I agree with that. See comments.

Does it make sense to create aliases for LinQuantization and LinQuantArray, since the alias is equally long as the original call? e.g.

LinQuantInt8Array(A::AbstractArray{T,N},dim::Int,e::Option{Tuple}) where {T,N} = LinQuantArray(Int8,A,dim,e)

Yeah, no, I prefer not to export 8 new function names, the latter sounds better to me.

Also I am facing some issues. The back and forth conversion with signed integers is not equal for some values

That's expected given the lossy compression. That information loss happens in the round function but it should be round-to-nearest tie-to-even but not sure we actually test that. Only for large integers that error becomes smaller but also depends on the range of values. We could formulate also a idempotency test because that loss would only happen on the first round trip.

with a different of 1.0 or 2.0.

Linear quantization introduces an absolute error that is bounded by $\Delta/2$ with $\Delta$ being the spacing between quants that's also computed for the (de-)quantization.

For Float16 and Int16 some values of A2 end up being -∞ or . I don't exactly know what the problem is.

Okay yeah that shouldn't happen, if inputs are finite, the quantization should create finite values too. But if you start in integer-quantization-space you can create numbers that are quantized perfectly representable but aren't in Float16.
E.g. you can represent -32768:32767 with Int16 (typemin-typemax range), but for a spacing of $\Delta = 2$ then your largest number you can represent becomes 65534 which is just larger than floatmin(Float16) which is 65504.

julia> Float16(typemax(Int16)*2)
Inf16

So finite floats should be quantized into finite (signed) integers, but vice versa can trigger overflow for floats!

@milankl milankl added the enhancement New feature or request label Oct 7, 2024
@milankl milankl linked an issue Oct 7, 2024 that may be closed by this pull request
@pabvald
Copy link
Collaborator Author

pabvald commented Oct 7, 2024

Okay yeah that shouldn't happen, if inputs are finite, the quantization should create finite values too. But if you start in integer-quantization-space you can create numbers that are quantized perfectly representable but aren't in Float16.

I have implemented your feedback but I still have to check the issues with the overflowing of Float16.

Edit: Everything seems to be working now 😃

@milankl
Copy link
Owner

milankl commented Oct 9, 2024

Fails because tuples aren't equal with different but otherwise equal == (not identical ===) elements

   Expression: Base.extrema(A2) == ext
   Evaluated: (0.22f0, 0.75f0) == (0.22, 0.75)

@pabvald pabvald marked this pull request as ready for review October 23, 2024 07:57
@pabvald
Copy link
Collaborator Author

pabvald commented Oct 23, 2024

@milankl I have updated the README, although the benchmarking is missing. Do you happen to still have the code for benchmarking?

@milankl
Copy link
Owner

milankl commented Oct 24, 2024

I don't have the benchmarking anymore, I believe I did something like

julia> A = rand(10_000_000);   # 80MB vector
julia> sizeof(A)/1000^2
80.0

julia> using BenchmarkTools

julia> @btime LinQuant16Array($A);
  37.425 ms (7 allocations: 20.27 MiB)

julia> 80/40e-3    # Float64 -> UInt16 at 2000MB/s
2000.0

and coming from single precision

julia> A = rand(Float32, 10_000_000);

julia> @btime LinQuant16Array($A);
  42.537 ms (7 allocations: 20.27 MiB)

julia> 40/40e-3    # Float32 -> UInt16 at 1000MB/s
1000.0

@pabvald
Copy link
Collaborator Author

pabvald commented Oct 24, 2024

It has complete backwards compatibility. I would like to extend the README with a use case about using signed integer quantization to quantize document embeddings for a RAG application but I didn't have the time. When dealing with document/word embeddings it makes sense to keep negative values and therefore using signed integer quantization.

@pabvald
Copy link
Collaborator Author

pabvald commented Nov 4, 2024

@milankl When do you think you will be able to review it?

@milankl
Copy link
Owner

milankl commented Nov 6, 2024

This looks fantastic, thanks so much for all your work on this pull request. I'm on it, I'll do a review today/tomorrow. Found a few typos otherwise it looks like it's ready to go, but I'll review it more thoroughly!

@pabvald
Copy link
Collaborator Author

pabvald commented Nov 15, 2024

@milankl I have refactored the code to use a keyword argument dims instead of dim. Check it out and let me know if there's something that needs fixing. Otherwise, looking forward to merge this!

P.S. I have modified the version to 1.0.0 because it now includes breaking changes

@pabvald
Copy link
Collaborator Author

pabvald commented Nov 26, 2024

@milankl Could you please run the test pipeline and merge it if everything is green 🙏🏻 ?

@milankl
Copy link
Owner

milankl commented Nov 26, 2024

Running now 🥳

@pabvald
Copy link
Collaborator Author

pabvald commented Dec 2, 2024

@milankl There was a typo using the old notation without the keyword argument dims. Should work now.

@milankl
Copy link
Owner

milankl commented Dec 3, 2024

Shoot yeah, I forgot the renaming from dim to dims. I did put in the dim functionality but didn't use it much so intuitively didn't consider it to be public API. But yes, in that case can we go to v0.3 first, and then if you are arguing to go to v1 straight, happy to consider this too

@pabvald pabvald force-pushed the linear-quantization-for-signed-integers branch from 5a05c88 to 4bf816a Compare December 3, 2024 12:37
@pabvald
Copy link
Collaborator Author

pabvald commented Dec 3, 2024

Version 0.3.0 is ready

@pabvald
Copy link
Collaborator Author

pabvald commented Dec 4, 2024

@milankl The tests for logarithmic quantization were using signed integers, I guess I fixed it in a later commit. It should pass now.

I have also added benchmarking for Base.extrema.

@milankl
Copy link
Owner

milankl commented Dec 4, 2024

Awesome, so we merge this, dim is dim not dims yet. I would then actually just merge #15 too, and tag both as v0.3 denothing breaking and new features. Deciding on v1.0 we can then still do?

@milankl milankl merged commit de6436e into milankl:main Dec 4, 2024
1 check passed
@pabvald pabvald deleted the linear-quantization-for-signed-integers branch December 5, 2024 07:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extending Linear Quantization for Signed Integers

2 participants