Skip to content

Squash to main#46

Merged
tastelikefeet merged 829 commits intomainfrom
dev
Feb 13, 2026
Merged

Squash to main#46
tastelikefeet merged 829 commits intomainfrom
dev

Conversation

@tastelikefeet
Copy link
Copy Markdown
Collaborator

No description provided.

tastelikefeet and others added 30 commits February 5, 2026 15:46
Move sequence-parallel strategy construction to a lazy method `_ensure_sp_strategy` to reduce side effects during model initialization. The strategy is now created only when needed, after the underlying Hugging Face model is fully initialized and before wrapping. This improves initialization performance and avoids unnecessary overhead when sequence parallelism is not enabled.
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
tastelikefeet and others added 28 commits February 13, 2026 14:25
* feat(sequence_parallel): refactor loss reduction using custom autograd functions

Replace manual gradient handling with `torch.autograd.Function` subclasses `_ReduceSequenceParallelLoss` and `_ReduceSequenceParallelSum` to compute global loss via autograd-aware all-reduce. This simplifies the logic for both sum and mean reductions, improves gradient correctness, and removes the need for separate metric scaling when `world_size > 1`.

* feat(sequence_parallel): compensate gradient scaling for FSDP averaging

Add `compensate_fsdp_avg` config flag to adjust loss reduction when sequence parallel (SP) is combined with FSDP or accelerate DDP/FSDP. This prevents gradient magnitude from being incorrectly scaled down by an extra factor of SP world size during data-parallel averaging.

- In `GatherLoss` backward, scale gradients by SP world size before splitting, so downstream FSDP averaging does not shrink this path.
- In `SequenceParallelStrategy.reduce_loss`, apply a compensation factor (ulysses_size) when `compensate_fsdp_avg` is enabled.
- Automatically set `compensate_fsdp_avg=True` in `TransformersModel` when using NativeFSDPStrategy or AccelerateStrategy with both SP and data parallelism active.

* delete unused unit test

* fix lint

* feat: add kernels optional dependency and refactor CI installation

- Add 'kernels' as an optional dependency group in pyproject.toml
- Refactor CI container test script to use a reusable installation function
- Install twinkle with kernels in both debug and release modes for consistency
- Improve maintainability by centralizing the installation command

* feat(kernel): add backward compatibility for kernels API changes

Update `_load_from_hub` function to handle API changes in `select_revision_or_version` and `get_kernel` calls. The changes introduce try-except blocks to catch `TypeError` exceptions, allowing the function to work with both modern keyword-based APIs and older positional argument variants. This ensures compatibility across different versions of the kernels module without breaking existing functionality.
* feat(tests): replace manual sp_group retrieval with module attribute

Replace calls to `_get_sp_group_from_device_mesh` with direct access to `sequence_parallel._sp_group` in sequence parallel attention tests. This simplifies the test setup by using the already initialized group stored in the module, improving code clarity and reducing redundancy.

* feat(tests): improve kernel availability check in test_function_kernel

Add additional imports and a try-except block to verify that the 'kernels-test/flattened-build' kernel can be successfully loaded in the current environment before proceeding with the test. This prevents test failures due to environment-specific loading issues and provides a more informative skip message.
* feat(tests): replace manual sp_group retrieval with module attribute

Replace calls to `_get_sp_group_from_device_mesh` with direct access to `sequence_parallel._sp_group` in sequence parallel attention tests. This simplifies the test setup by using the already initialized group stored in the module, improving code clarity and reducing redundancy.

* feat(tests): improve kernel availability check in test_function_kernel

Add additional imports and a try-except block to verify that the 'kernels-test/flattened-build' kernel can be successfully loaded in the current environment before proceeding with the test. This prevents test failures due to environment-specific loading issues and provides a more informative skip message.

* wip

* wip

* remove debug info
* feat(tests): replace manual sp_group retrieval with module attribute

Replace calls to `_get_sp_group_from_device_mesh` with direct access to `sequence_parallel._sp_group` in sequence parallel attention tests. This simplifies the test setup by using the already initialized group stored in the module, improving code clarity and reducing redundancy.

* feat(tests): improve kernel availability check in test_function_kernel

Add additional imports and a try-except block to verify that the 'kernels-test/flattened-build' kernel can be successfully loaded in the current environment before proceeding with the test. This prevents test failures due to environment-specific loading issues and provides a more informative skip message.

* wip

* wip

* remove debug info

* feat: add ep/sp FSDP MoE finetuning entry and update script

- Add new entry for ep/sp FSDP MoE finetuning in README table
- Update ep_fsdp_qwen3_moe.py script to include ulysses_size parameter for enhanced parallelism configuration
@tastelikefeet tastelikefeet merged commit 9aa5579 into main Feb 13, 2026
3 of 4 checks passed
@tastelikefeet tastelikefeet deleted the dev branch February 13, 2026 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants