Open
Conversation
…A#790) * Fix MHA kernel Summary: ATT Test Plan: Reviewers: Subscribers: Tasks: Tags: * Extend DualGemm to support batched mode (NVIDIA#5) Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set. * Decouple LayoutB0 and LayoutB1 in DualGemm The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously. In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above. * Remove comment as no longer relevant * Revert Fix MHA kernel --------- Co-authored-by: mikeiovine <mikeiovine@fb.com>
fix the copyright of a new file.
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* fix typo * fix a deadlink to code
copyright banner
* Changes to iterators to support s8 gemm with f16 outputs * should work --------- Co-authored-by: Sujan Gonugondla <gsujan@amaon.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* fMHA: Add support for bias+dropout in FW * Remove 'getMaximumSharedMemoryPerBlockKb' * fix comments --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
* ex24[gemm_grouped]: Allow to change layout/dtype * Address suggestion from @jackkosaian --------- Co-authored-by: danthe3rd <danthe3rd>
Co-authored-by: Aniket Shivam <ashivam@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Hide streams and typinfo from nvrtc * Use __CUDACC_RTC__ instead CUDA_ARCH for guard
* expose StoreT parameter for potential speed * add storeT to more elementwise --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
`std::vector<T>::size_type` is unsigned type, so let's iterate over unsigned type as well Discovered, while trying to enable PyTorch building without `-Wno-sign-compare` warning suppression, see https://github.com/pytorch/pytorch/actions/runs/4418987999/jobs/7746850762#step:10:10532
msft moe paper
* add bytetransformer * update arxiv link * re-order
* add guards for sm>=70 * drop guard to 530
* [layout] Fix AffineRank2ColumnMajor::packed() * correct affine2row::packed --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>
Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Fix README * Improve README --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>
* Include of regular_tile_iterator.h fixed for NVRTC * More include fixed for NVRTC
…s/gemm/device/gemm_universal.h" (NVIDIA#1569) fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`
…A#1894) Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
…_Traits support (NVIDIA#1856) * fix wrong A/BLayout in MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for m8n8k128, m16n8k128 mma.and.popc in MMA_Traits instantiation * add "print" template for subbyte_reference<T>
…rs (NVIDIA#1931) * move two warpgroup_wait * merge main --------- Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>
* Fix `cutlass` python library with cuda `12.6.2.post1`
Previously we had this error:
```
File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp>
_version_splits = [int(x) for x in __version__.split("rc")[0].split(".")]
^^^^^^
ValueError: invalid literal for int() with base 10: 'post1'
```
* Update sm90_utils.py
* Update generator.py
* Update python/cutlass_library/generator.py
Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
* Update python/cutlass_library/sm90_utils.py
Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
---------
Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>
* update * fix a typo
|
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.