Skip to content

Mirror of GEMM MR from source repo to produce smaller diff#1

Open
kwyss-nvidia wants to merge 31 commits intokwyss/subchannel_quantize_dequantizefrom
kwyss/cublas_gemm_github_mr
Open

Mirror of GEMM MR from source repo to produce smaller diff#1
kwyss-nvidia wants to merge 31 commits intokwyss/subchannel_quantize_dequantizefrom
kwyss/cublas_gemm_github_mr

Conversation

@kwyss-nvidia
Copy link
Owner

This is a different branch target from NVIDIA#1545 to make the diff easier to review.

@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch 3 times, most recently from 08ae7e8 to eee37bf Compare March 15, 2025 00:25
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch 2 times, most recently from 5eb3a82 to 91a6721 Compare March 17, 2025 17:22
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch from eee37bf to ce4ca80 Compare March 17, 2025 17:24
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch from 91a6721 to 909dac7 Compare March 19, 2025 22:41
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch from ce4ca80 to 5ebc93a Compare March 19, 2025 22:42
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch from 5aa279e to 8466c36 Compare April 1, 2025 19:45
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch from 909dac7 to 0ee51eb Compare April 1, 2025 19:46
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch 4 times, most recently from fa019d5 to cd3e414 Compare April 2, 2025 19:20
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/subchannel_quantize_dequantize branch from 27c9188 to 18f19bb Compare April 3, 2025 20:38
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch from cd3e414 to f1e9e62 Compare April 4, 2025 01:17
timmoon10 and others added 13 commits April 3, 2025 22:01
Make sure that weight matrix has required usages for dgrad GEMM

Signed-off-by: Tim Moon <tmoon@nvidia.com>
* Blockwise float8 quantizer and quantized tensor class.

The classes are configurable for 128x128 blocksize
and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively.

Scale tensors are stored in a format emenable for matrix multiplication,
however the integration of matmul is deferred as a separate story.

Fusions of quantization and DBIAS or activation functions are not yet
implemented, and the dequantization is currently implemented in torch.

Tests for quantization are included in C++ and pytorch layers, with
exact comparison to reference quantizer behavior as well as an attempt
to hit interesting branches through the API such as tensor creation
in pytorch and CPP and dequantization of row and columnwise usage.

Two CUDA kernels for quantization are included, and are direct ports
of equivalents in the kitchen repository, where a subchannel recipe
has been used for end to end training.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Apply linting changes.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Alignment for 1D scaling for GEMM edge case.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Change API name.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix merge conflict with name change.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use common tensor map API.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Change API to use two scaling mode enums.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix typo.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update some call sites.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Tests for torch tensor API surface.

Since the quantized tensor is a tensor
subclass, these tests exercise torch hooks.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reuse scale calculation between quantizer refs.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Save memory by dropping reference to saved tensors.

Issues previously observed are solved.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove constexpr parameters from kernel.

Code size is reduced with fewer constexpr params.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Merge conflict from rebase.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add shape implementations for block scaling.

nvte_shape was added upstream. Logic added
for block scaled fp8.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Move benchmark to te_playground

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove amax_epsilon and pow_2_scales from tensor.

Hardcodes the default values.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Lint changes.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fixup MR changes that broke.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Safer ifdef in kernel.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Documentation prose.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Reuse compute_scale function from Current Scaling.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Bugfix on inf_value scale refactor.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Remove qopt calls from test.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update pytest list.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Add copyright to reference scale calc.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use ptx.cuh functions instead of cde.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update shape logic with allocation and reuse shape.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Usage defaults MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Copyright and header guard.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Updating torch dispatch code.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Fix exception type.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Use TypeInfo

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* MR feedback.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>

* Update CS scale update test to use updated ref impl

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Update JAX scaling mode enum

Signed-off-by: Tim Moon <tmoon@nvidia.com>

* Skip tests on Lovelace

Signed-off-by: Tim Moon <tmoon@nvidia.com>

---------

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <tmoon@nvidia.com>
Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com>
MXFP8 flax layer tests

Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com>
GEMM test cases included in pytorch integration.

Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch from f1e9e62 to 758dc4a Compare April 4, 2025 16:14
kwyss-nvidia and others added 5 commits April 4, 2025 10:39
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
…VIDIA#1644)

* rename QuantizeAxis to QuantizeLayout, get_layout to get_data_layout, q_axis to q_layout

* add fatten_axis option

* added gated act to test encoder

* sharding constraint fixes

* fix padding when flattening first dim needs to be padded

* update test sizes so that padding is tested

* rm output sharding as it can be done in the flax module

* sharding scale_inv for mxfp8

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
@kwyss-nvidia kwyss-nvidia force-pushed the kwyss/cublas_gemm_github_mr branch from 32799ab to 861c870 Compare April 5, 2025 00:59
kwyss-nvidia and others added 10 commits April 4, 2025 18:28
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Configure A and B matrices separately. Have separate code path for each scaling mode.

Signed-off-by: Tim Moon <tmoon@nvidia.com>
* refactor to add cp support for sbhd/bshd

Signed-off-by: Xin Yao <xiny@nvidia.com>

* support interleaved

Signed-off-by: Xin Yao <xiny@nvidia.com>

* format

Signed-off-by: Xin Yao <xiny@nvidia.com>

* add interleaved to RotaryPositionEmbedding in test

Signed-off-by: Xin Yao <xiny@nvidia.com>

* update

Signed-off-by: Xin Yao <xiny@nvidia.com>

* merge sbhd/bshd and thd functions

Signed-off-by: Xin Yao <xiny@nvidia.com>

---------

Signed-off-by: Xin Yao <xiny@nvidia.com>
* fix cpp warning

Signed-off-by: Xin Yao <xiny@nvidia.com>

* more fix

Signed-off-by: Xin Yao <xiny@nvidia.com>

---------

Signed-off-by: Xin Yao <xiny@nvidia.com>
Support fp8 primary weight in fsdp training

Signed-off-by: jianbinc <shjwudp@gmail.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* rm no scaling enum

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

* update jax enum

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>

---------

Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com>
Signed-off-by: Keith Wyss <kwyss@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants