doc and swap stuff by Carol25170 · Pull Request #1 · Carol25170/cutlass

Carol25170 · 2025-05-24T06:06:36Z

No description provided.

…A#790) * Fix MHA kernel Summary: ATT Test Plan: Reviewers: Subscribers: Tasks: Tags: * Extend DualGemm to support batched mode (NVIDIA#5) Following the GemmUniversalMode::kBatched implementation, batched mode is added to the DualGemm (under examples/45_dual_gemm). DualGemmMode::kBatched and SplitKSerial are not compatible: Status::kErrorInvalidProblem is returned if both are set. * Decouple LayoutB0 and LayoutB1 in DualGemm The DualGemm template assumed the same layout, LayoutB, for both right operand matrices B0 and B1. This is problematic if the layout of the two matrices is different. In particular, this may be the case when one of the matrices is row-major, while the other is a (column) vector that has to be broadcasted in column-major with zero stride (e.g., as {B1.device_data(), 0}) for the DualGemm implementation to be able to process B0 and B1 simultaneously. In this commit, LayoutB0 and LayoutB1 are decoupled throughout the DualGemm code (device, kernel, and mma). Additionally, the batch strides of B0 and B1 are also decoupled to accommodate the column vector B1 case described above. * Remove comment as no longer relevant * Revert Fix MHA kernel --------- Co-authored-by: mikeiovine <mikeiovine@fb.com>

fix the copyright of a new file.

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

* fix typo * fix a deadlink to code

copyright banner

* Changes to iterators to support s8 gemm with f16 outputs * should work --------- Co-authored-by: Sujan Gonugondla <gsujan@amaon.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

* fMHA: Add support for bias+dropout in FW * Remove 'getMaximumSharedMemoryPerBlockKb' * fix comments --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

@jackkosaian

* ex24[gemm_grouped]: Allow to change layout/dtype * Address suggestion from @jackkosaian --------- Co-authored-by: danthe3rd <danthe3rd>

Co-authored-by: Aniket Shivam <ashivam@nvidia.com>

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>

* Hide streams and typinfo from nvrtc * Use __CUDACC_RTC__ instead CUDA_ARCH for guard

* expose StoreT parameter for potential speed * add storeT to more elementwise --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

`std::vector<T>::size_type` is unsigned type, so let's iterate over unsigned type as well Discovered, while trying to enable PyTorch building without `-Wno-sign-compare` warning suppression, see https://github.com/pytorch/pytorch/actions/runs/4418987999/jobs/7746850762#step:10:10532

msft moe paper

* add bytetransformer * update arxiv link * re-order

* add guards for sm>=70 * drop guard to 530

* [layout] Fix AffineRank2ColumnMajor::packed() * correct affine2row::packed --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

…VIDIA#896)

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>

* Fix README * Improve README --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>

* Include of regular_tile_iterator.h fixed for NVRTC * More include fixed for NVRTC

…s/gemm/device/gemm_universal.h" (NVIDIA#1569) fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`

…A#1894) Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>

…_Traits support (NVIDIA#1856) * fix wrong A/BLayout in MMA_Traits<SM80_16x8x256_S32U1U1S32_TN_XORPOPC> and append support for m8n8k128, m16n8k128 mma.and.popc in MMA_Traits instantiation * add "print" template for subbyte_reference<T>

)

…rs (NVIDIA#1931) * move two warpgroup_wait * merge main --------- Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>

* Fix `cutlass` python library with cuda `12.6.2.post1` Previously we had this error: ``` File "/storage/home/cutlass/python/cutlass/backend/operation.py", line 39, in <listcomp> _version_splits = [int(x) for x in __version__.split("rc")[0].split(".")] ^^^^^^ ValueError: invalid literal for int() with base 10: 'post1' ``` * Update sm90_utils.py * Update generator.py * Update python/cutlass_library/generator.py Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> * Update python/cutlass_library/sm90_utils.py Co-authored-by: Jack Kosaian <jackkosaian@gmail.com> --------- Co-authored-by: Jack Kosaian <jackkosaian@gmail.com>

* update * fix a typo

… as MmaType (NVIDIA#1977)

vercel · 2025-05-24T06:06:40Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
cutlass	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	May 28, 2025 2:12am

aakhundov and others added 30 commits February 13, 2023 15:27

Update dual_gemm_common.h

8f5c242

fix the copyright of a new file.

fix alignmentC=8 for imma N=128 (NVIDIA#822)

9fb38ac

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

Fix some typos (NVIDIA#791)

a101ac2

* fix typo * fix a deadlink to code

Update helper.h

34bed24

copyright banner

Changes to iterators to support s8 gemm with f16 outputs (NVIDIA#812)

d8359c8

* Changes to iterators to support s8 gemm with f16 outputs * should work --------- Co-authored-by: Sujan Gonugondla <gsujan@amaon.com> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

streamk fix (NVIDIA#830)

91b8de8

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

Update base_grouped.h (NVIDIA#832)

95f673e

Add fixed_channel and few_channel mode to int8 in generator (NVIDIA#829)

9cdbe33

fMHA: Sync FW with xFormers (NVIDIA#828)

f303889

* fMHA: Add support for bias+dropout in FW * Remove 'getMaximumSharedMemoryPerBlockKb' * fix comments --------- Co-authored-by: danthe3rd <danthe3rd> Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

streamk fix (NVIDIA#836)

65688c2

Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

Fix typos (NVIDIA#839)

92ebbf1

ex24[gemm_grouped]: Allow to change layout/dtype (NVIDIA#841)

f396cdd

* ex24[gemm_grouped]: Allow to change layout/dtype * Address suggestion from @jackkosaian --------- Co-authored-by: danthe3rd <danthe3rd>

Re-enable aarch64 support lost in 277bd6e (NVIDIA#846)

a31b43b

Reduce versbosity in manifest.py (NVIDIA#845)

a68e2f9

Updates for 3.0 (NVIDIA#857)

c4f6b8c

Co-authored-by: Aniket Shivam <ashivam@nvidia.com>

Fix typos 2 (NVIDIA#842)

7e370c9

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>

Hide streams and typinfo from nvrtc (NVIDIA#853)

29801e3

* Hide streams and typinfo from nvrtc * Use __CUDACC_RTC__ instead CUDA_ARCH for guard

expose StoreT parameter for potential speed (NVIDIA#838)

86cae03

* expose StoreT parameter for potential speed * add storeT to more elementwise --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

Add missing comma in cutlass/arch/mma_sm90.h (NVIDIA#862)

af332d4

Set batch_strides on Params::update (NVIDIA#883)

6116706

remove spurious comma (NVIDIA#871)

209faf7

Fix for dangling pointers (NVIDIA#885)

42290f5

Update PUBLICATIONS.md

77549ae

msft moe paper

add a CUTLASS publication (NVIDIA#893)

87070b6

* add bytetransformer * update arxiv link * re-order

add guards for __CUDA_ARCH__ >= 530 (NVIDIA#891)

1eef5c3

* add guards for sm>=70 * drop guard to 530

CUTLASS 3.0 Hopper GEMMs are GETTs in disguise (NVIDIA#897)

15d9d31

[layout] Fix AffineRank2ColumnMajor::packed() (NVIDIA#879)

bc36122

* [layout] Fix AffineRank2ColumnMajor::packed() * correct affine2row::packed --------- Co-authored-by: Haicheng Wu <haichengw@nvidia.com>

fix split_k_mode and add reduction kernel for f16 input/accum/output (N…

660a05f

…VIDIA#896)

Xinyu302 and others added 25 commits October 23, 2024 12:44

add maximum support (NVIDIA#1833)

f3a3bfc

fix typo (NVIDIA#1853)

ea69cc2

fix by adding public (NVIDIA#1753)

b0c09ed

added mapping for bf16 to torch::kBFloat16 (NVIDIA#1843)

83ae20c

Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>

Fix README (NVIDIA#1658)

e5f3caf

* Fix README * Improve README --------- Co-authored-by: Haicheng Wu <57973641+hwu36@users.noreply.github.com>

Adjusting code indentation (NVIDIA#1639)

03e3bff

Include of regular_tile_iterator.h fixed for NVRTC (NVIDIA#1765)

f02913c

* Include of regular_tile_iterator.h fixed for NVRTC * More include fixed for NVRTC

Update gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu with include "cutlas…

12626bc

…s/gemm/device/gemm_universal.h" (NVIDIA#1569) fix compile with `cmake .. -DCUTLASS_ENABLE_TESTS=ON -DCUTLASS_TEST_LEVEL=2`

remove redundant hardcoded packing configs in mixed dtype gemm (NVIDI…

be692b4

…A#1894) Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>

Add a print for the uint{x}b_t type. (NVIDIA#1871)

08a4995

Refactor some GroupedGEMM logic (NVIDIA#1899)

e8a8b69

feat: support kFactor 8 used in mma tensor op tile iterator (NVIDIA#1512

19f5159

)

Update publications (NVIDIA#1912)

9004ed2

remove restriction of stride == kernel in nhwc_pooling (NVIDIA#1896)

32e3c38

fix undefined in device code error (NVIDIA#1880)

d656afb

Fix the racing condition of mixed-input gemm when writing the registe…

8aa95db

…rs (NVIDIA#1931) * move two warpgroup_wait * merge main --------- Co-authored-by: Siyuan Fu <siyuanf@nvidia.com>

add {uint4, uint2, int2} => {fp16, bf16} conversion (NVIDIA#1966)

80243e0

Improve mixed dtype GEMM (NVIDIA#1972)

4c42f73

* update * fix a typo

fix a typo that fails the compiling when ElementScale is not the same…

2b6cfd3

… as MmaType (NVIDIA#1977)

Fix CuTe README Typo (NVIDIA#1951)

33c5843

Fix Typo (NVIDIA#1962)

e1cd8c7

3.6.0 update

b12b66f

doc and swap stuff

87eaa69

vercel bot deployed to Preview May 24, 2025 06:07 View deployment

Carol25170 changed the base branch from master to yz/softmax-applyshape May 27, 2025 22:38

Carol25170 changed the base branch from yz/softmax-applyshape to yz/softmax-misalignment May 27, 2025 22:40

vercel bot deployed to Production May 28, 2025 02:12 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

doc and swap stuff#1

doc and swap stuff#1
Carol25170 wants to merge 323 commits intoyz/softmax-misalignmentfrom
v3.6.0-update

Carol25170 commented May 24, 2025

Uh oh!

vercel bot commented May 24, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Comments

Conversation

Carol25170 commented May 24, 2025

Uh oh!

vercel bot commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

vercel bot commented May 24, 2025 •

edited

Loading