-
Notifications
You must be signed in to change notification settings - Fork 2
Linear quantization for signed integers #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linear quantization for signed integers #13
Conversation
|
TODOs:
|
Yes, I agree with that. See comments.
Yeah, no, I prefer not to export 8 new function names, the latter sounds better to me.
That's expected given the lossy compression. That information loss happens in the
Linear quantization introduces an absolute error that is bounded by
Okay yeah that shouldn't happen, if inputs are finite, the quantization should create finite values too. But if you start in integer-quantization-space you can create numbers that are quantized perfectly representable but aren't in julia> Float16(typemax(Int16)*2)
Inf16So finite floats should be quantized into finite (signed) integers, but vice versa can trigger overflow for floats! |
I have implemented your feedback but I still have to check the issues with the overflowing of Float16. Edit: Everything seems to be working now 😃 |
|
Fails because tuples aren't equal with different but otherwise equal Expression: Base.extrema(A2) == ext
Evaluated: (0.22f0, 0.75f0) == (0.22, 0.75) |
|
@milankl I have updated the README, although the benchmarking is missing. Do you happen to still have the code for benchmarking? |
|
I don't have the benchmarking anymore, I believe I did something like julia> A = rand(10_000_000); # 80MB vector
julia> sizeof(A)/1000^2
80.0
julia> using BenchmarkTools
julia> @btime LinQuant16Array($A);
37.425 ms (7 allocations: 20.27 MiB)
julia> 80/40e-3 # Float64 -> UInt16 at 2000MB/s
2000.0and coming from single precision julia> A = rand(Float32, 10_000_000);
julia> @btime LinQuant16Array($A);
42.537 ms (7 allocations: 20.27 MiB)
julia> 40/40e-3 # Float32 -> UInt16 at 1000MB/s
1000.0 |
|
It has complete backwards compatibility. I would like to extend the README with a use case about using signed integer quantization to quantize document embeddings for a RAG application but I didn't have the time. When dealing with document/word embeddings it makes sense to keep negative values and therefore using signed integer quantization. |
|
@milankl When do you think you will be able to review it? |
|
This looks fantastic, thanks so much for all your work on this pull request. I'm on it, I'll do a review today/tomorrow. Found a few typos otherwise it looks like it's ready to go, but I'll review it more thoroughly! |
|
@milankl I have refactored the code to use a keyword argument P.S. I have modified the version to |
|
@milankl Could you please run the test pipeline and merge it if everything is green 🙏🏻 ? |
|
Running now 🥳 |
|
@milankl There was a typo using the old notation without the keyword argument |
|
Shoot yeah, I forgot the renaming from |
5a05c88 to
4bf816a
Compare
|
Version 0.3.0 is ready |
|
@milankl The tests for logarithmic quantization were using signed integers, I guess I fixed it in a later commit. It should pass now. I have also added benchmarking for |
|
Awesome, so we merge this, |
@milankl
I have the basic implementation:
And I have two questions:
n::Integerargument fromBase.Array? In the end, the range of values can be calculated astypemax(eltype(Q)) - typemin(eltype(Q))instead of2^n - 1.LinQuantizationandLinQuantArray, since the alias is equally long as the original call? e.g.Also I am facing some issues. The back and forth conversion with signed integers is not equal for some values, with a different of 1.0 or 2.0. For
Float16andInt16some values of A2 end up being-∞or∞. I don't exactly know what the problem is.