Mirror of GEMM MR from source repo to produce smaller diff#1
Open
kwyss-nvidia wants to merge 31 commits intokwyss/subchannel_quantize_dequantizefrom
Open
Mirror of GEMM MR from source repo to produce smaller diff#1kwyss-nvidia wants to merge 31 commits intokwyss/subchannel_quantize_dequantizefrom
kwyss-nvidia wants to merge 31 commits intokwyss/subchannel_quantize_dequantizefrom
Conversation
12 tasks
08ae7e8 to
eee37bf
Compare
5eb3a82 to
91a6721
Compare
eee37bf to
ce4ca80
Compare
91a6721 to
909dac7
Compare
ce4ca80 to
5ebc93a
Compare
5aa279e to
8466c36
Compare
909dac7 to
0ee51eb
Compare
fa019d5 to
cd3e414
Compare
27c9188 to
18f19bb
Compare
cd3e414 to
f1e9e62
Compare
Make sure that weight matrix has required usages for dgrad GEMM Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Blockwise float8 quantizer and quantized tensor class. The classes are configurable for 128x128 blocksize and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively. Scale tensors are stored in a format emenable for matrix multiplication, however the integration of matmul is deferred as a separate story. Fusions of quantization and DBIAS or activation functions are not yet implemented, and the dequantization is currently implemented in torch. Tests for quantization are included in C++ and pytorch layers, with exact comparison to reference quantizer behavior as well as an attempt to hit interesting branches through the API such as tensor creation in pytorch and CPP and dequantization of row and columnwise usage. Two CUDA kernels for quantization are included, and are direct ports of equivalents in the kitchen repository, where a subchannel recipe has been used for end to end training. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Apply linting changes. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Alignment for 1D scaling for GEMM edge case. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Change API name. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fix merge conflict with name change. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use common tensor map API. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Change API to use two scaling mode enums. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fix typo. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update some call sites. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Tests for torch tensor API surface. Since the quantized tensor is a tensor subclass, these tests exercise torch hooks. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Reuse scale calculation between quantizer refs. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Save memory by dropping reference to saved tensors. Issues previously observed are solved. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove constexpr parameters from kernel. Code size is reduced with fewer constexpr params. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Merge conflict from rebase. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add shape implementations for block scaling. nvte_shape was added upstream. Logic added for block scaled fp8. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Move benchmark to te_playground Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove amax_epsilon and pow_2_scales from tensor. Hardcodes the default values. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Lint changes. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fixup MR changes that broke. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Safer ifdef in kernel. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Documentation prose. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Reuse compute_scale function from Current Scaling. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Bugfix on inf_value scale refactor. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Remove qopt calls from test. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update pytest list. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Add copyright to reference scale calc. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use ptx.cuh functions instead of cde. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update shape logic with allocation and reuse shape. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Usage defaults MR feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Copyright and header guard. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Updating torch dispatch code. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Fix exception type. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Use TypeInfo Signed-off-by: Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by: Keith Wyss <kwyss@nvidia.com> * Update CS scale update test to use updated ref impl Signed-off-by: Tim Moon <tmoon@nvidia.com> * Update JAX scaling mode enum Signed-off-by: Tim Moon <tmoon@nvidia.com> * Skip tests on Lovelace Signed-off-by: Tim Moon <tmoon@nvidia.com> --------- Signed-off-by: Keith Wyss <kwyss@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
MXFP8 flax layer tests Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
GEMM test cases included in pytorch integration. Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
f1e9e62 to
758dc4a
Compare
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
…VIDIA#1644) * rename QuantizeAxis to QuantizeLayout, get_layout to get_data_layout, q_axis to q_layout * add fatten_axis option * added gated act to test encoder * sharding constraint fixes * fix padding when flattening first dim needs to be padded * update test sizes so that padding is tested * rm output sharding as it can be done in the flax module * sharding scale_inv for mxfp8 --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
32799ab to
861c870
Compare
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Configure A and B matrices separately. Have separate code path for each scaling mode. Signed-off-by: Tim Moon <tmoon@nvidia.com>
for more information, see https://pre-commit.ci
* refactor to add cp support for sbhd/bshd Signed-off-by: Xin Yao <xiny@nvidia.com> * support interleaved Signed-off-by: Xin Yao <xiny@nvidia.com> * format Signed-off-by: Xin Yao <xiny@nvidia.com> * add interleaved to RotaryPositionEmbedding in test Signed-off-by: Xin Yao <xiny@nvidia.com> * update Signed-off-by: Xin Yao <xiny@nvidia.com> * merge sbhd/bshd and thd functions Signed-off-by: Xin Yao <xiny@nvidia.com> --------- Signed-off-by: Xin Yao <xiny@nvidia.com>
* fix cpp warning Signed-off-by: Xin Yao <xiny@nvidia.com> * more fix Signed-off-by: Xin Yao <xiny@nvidia.com> --------- Signed-off-by: Xin Yao <xiny@nvidia.com>
Support fp8 primary weight in fsdp training Signed-off-by: jianbinc <shjwudp@gmail.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* rm no scaling enum Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * update jax enum Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a different branch target from NVIDIA#1545 to make the diff easier to review.