Skip to content

sequence parallel fix bug#47

Merged
meichangsu1 merged 14 commits intodevfrom
sp_ljl_dev
Feb 9, 2026
Merged

sequence parallel fix bug#47
meichangsu1 merged 14 commits intodevfrom
sp_ljl_dev

Conversation

@meichangsu1
Copy link
Copy Markdown
Collaborator

No description provided.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @meichangsu1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the robustness and correctness of sequence parallelism, particularly for scenarios involving 'packed batches' where multiple sequences are concatenated. Key changes include refined handling of attention masks and position IDs to ensure proper attention computation and prevent incorrect label supervision across sequence boundaries. The PR also improves the loss reduction mechanism for distributed training, corrects a gradient scaling error, and introduces a new example for packed datasets. These updates collectively aim to provide a more stable and accurate sequence parallel training experience.

Highlights

  • Sequence Parallelism for Packed Batches: Introduced explicit handling for 'packed batches' (from PackingDataset with padding-free collate) in sequence parallelism. This includes deriving cu_seqlens for FlashAttention2 and preventing cross-sequence supervision in labels by masking out boundary tokens.
  • Loss Reduction Refinement: Enhanced the reduce_loss mechanism in sequence parallelism to correctly support both 'sum' and 'mean' reductions, ensuring accurate global loss calculation across distributed ranks.
  • Gradient Scaling Correction: Removed an incorrect world_size scaling factor from the backward pass of sequence parallel loss, improving the accuracy and correctness of gradient computations.
  • New Packed Dataset Example: Added a new cookbook example (single_controller_sp_packing.py) demonstrating how to effectively use sequence parallelism with packed datasets for improved training efficiency.
  • Attention Backend Compatibility: Implemented a runtime check to prevent the use of the SDPA backend with packed batches, enforcing the use of FlashAttention2 for such cases due to its specialized varlen/packed semantics.
  • Improved LLM Model Extraction: Introduced a more robust get_llm_model utility to consistently extract the base language model from various wrapped model types (e.g., by Accelerate, PEFT, Swift).
  • Test Coverage: Added comprehensive unit tests for sequence parallel attention, including scenarios with padding and integration with FSDP, to validate the correctness and stability of the implemented fixes.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • cookbook/sft/single_controller_sp.py
    • Increased max_length in create_dataset from 64 to 256.
    • Removed the eval function and refactored the training loop to calculate metrics per step and save a final checkpoint.
  • cookbook/sft/single_controller_sp_packing.py
    • Added a new training script for sequence parallelism with packed datasets, utilizing PackingDataset and padding_free input processing.
  • src/twinkle/model/transformers/strategy/sequence_parallel.py
    • Moved get_llm_model to twinkle.utils.transformers_utils and replaced HfConfigFactory with a standalone get_config_attr function.
    • Removed world_size scaling from _SequenceParallelLoss.backward.
    • Implemented logic to derive cu_seqlens for FlashAttention2 from position_ids in packed batches and added a runtime check to disallow SDPA backend for such cases.
    • Removed padding_free parameter from _SequenceParallel initialization and pad_and_split_mm_tokens method.
    • Introduced _is_packed_position_ids to detect packed batches and modified pad_and_split_inputs to track this status, generate attention masks for SP-alignment padding, and adjust labels for packed batches.
    • Added loss_reduction to SequenceParallelConfig and updated reduce_loss to support 'sum' and 'mean' reductions.
    • Refactored postprocess_outputs for consistent handling and trimming of gathered logits.
  • src/twinkle/model/transformers/transformers.py
    • Ensured loss_reduction is propagated to sp_strategy.sp_config before calling reduce_loss.
  • src/twinkle/utils/transformers_utils.py
    • Added a comprehensive get_llm_model utility for robust LLM module extraction from wrapped models.
  • tests/transformers/test_sequence_parallel_single_attention.py
    • Added a new test file with distributed tests for sequence parallel attention, covering scenarios with and without padding, and FSDP integration.
Activity
  • The author meichangsu1 initiated this pull request to address and resolve a bug within the sequence parallelism implementation.
  • The changes encompass modifications to the core logic of sequence parallelism, updates to training scripts, and enhancements to utility functions.
  • New, comprehensive unit tests have been introduced to thoroughly validate the correctness of the sequence parallel attention mechanism across various conditions, including different padding strategies and integration with FSDP.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant fixes and improvements for sequence parallelism, with a focus on supporting packed datasets. Key changes include refactoring sequence parallel logic for clarity, fixing critical bugs in gradient calculation and label handling, and adding a new test file for sequence parallel attention. While the core changes are solid, I've identified a couple of areas for improvement in an example script and a minor code quality issue.

Comment on lines +79 to +88
if step % 1 == 0:
metric = model.calculate_metric(is_training=True, adapter_name='default')
_metrics = {}
for key, value in metric.items():
try:
value = float(value)
_metrics[key] = value
except:
pass
logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block has a few issues:

  1. Inefficiency: Calculating metrics on every step (step % 1 == 0) can significantly slow down training. It's better to do this periodically, for example, every 50 or 100 steps, as was done previously (if step % 50 == 0 and step > 0:).
  2. Redundant condition: The if step % 1 == 0: condition is always true and can be removed.
  3. Dead code: The _metrics dictionary is created and populated but never used. The log message uses the original metric dictionary. This block of code can be removed.
  4. Bare except: The except: on line 86 is a bare except, which is bad practice as it can hide unexpected errors. It should catch specific exceptions like ValueError or TypeError.

Consider refactoring this to calculate metrics periodically and removing the unused code for better performance and readability.

Suggested change
if step % 1 == 0:
metric = model.calculate_metric(is_training=True, adapter_name='default')
_metrics = {}
for key, value in metric.items():
try:
value = float(value)
_metrics[key] = value
except:
pass
logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}')
if step > 0 and step % 50 == 0:
metric = model.calculate_metric(is_training=True, adapter_name='default')
logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}')

@@ -1,5 +1,6 @@
# Copyright (c) ModelScope Contributors. All rights reserved.
import contextlib
import os
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This import os is redundant as the os module is already imported on line 5. Please remove this duplicate import to keep the code clean.

Add a new test file `test_sequence_parallel_single_attention.py` to verify the correctness of the sequence parallel attention implementation. The test includes a distributed setup using torch.distributed and compares outputs between sequence parallel and local attention modes. Also adds an empty `__init__.py` to the transformers test directory for proper module imports.
- Add `_enable_strict_determinism` helper to disable TF32 and enable deterministic algorithms
- Add `_to_local` helper to unwrap DTensors for gradient comparison
- Update test to use full world size for sequence parallel group and increase head count
- Switch to float32 dtype for stricter numerical alignment
- Improve gradient comparison by cloning and unwrapping tensors
…free logic

- Replace HfConfigFactory utility with direct get_config_attr function
- Move get_llm_model to shared transformers utilities
- Remove padding_free parameter and related conditional logic
- Simplify attention mask construction for padded tokens
- Update SequenceParallelConfig to drop padding_free field
- Add detection of packed batches via `_is_packed_position_ids` heuristic
- Raise RuntimeError when SDPA backend is used with packed batches, as SDPA lacks native packed/varlen support
- Build 2D attention_mask for padded sequences to ensure correct FlashAttention2 unpad behavior
- Avoid unnecessary 4D causal mask generation for packed/padding-free batches
Introduce a new cookbook script demonstrating supervised fine-tuning with a single controller using sequence parallelism (SP) and FSDP across 4 GPUs. The example includes:
- Device mesh configuration with dp=2 and fsdp=2 dimensions
- PackingDataset setup with self-cognition data and left truncation
- Training loop with LoRA adapter, AdamW optimizer, and periodic evaluation
- Checkpoint saving based on loss improvement
- Validation of FSDP + SP input slicing across multiple GPUs
…formers cookbook

- Add new single_controller_sp.py example demonstrating FSDP + SP validation over 4 GPUs
- Move legacy single_controller_sp.py to transformers/sp_fsdp_dense.py
- Add shell script sp_fsdp_dense.sh for running the example
- Update imports and structure to use twinkle framework components
…irectory

Relocate test_sequence_parallel_single_attention.py from tests/transformers/ to tests/sequence_parallel/ to better organize test files by feature area. This improves maintainability and aligns with the project's test structure conventions.
- Add bash script header and comments to `sp_fsdp_dense.sh` explaining how to enable sequence parallelism with ulysses_size
- Remove duplicate `import os` statement in transformers.py for cleaner code
- Fix minor formatting by removing extra blank line in transformers_utils.py
- Switch from `ray` to `local` mode for twinkle initialization
- Add evaluation function with separate dataset slice
- Increase dataset size from 100 to 500 samples
- Add cosine warmup learning rate scheduler
- Remove unused torch import and remote_group parameters
- Adjust batch size from 4 to 8 and logging frequency to every 20 steps
- Improve logging with train configs and total steps information
Removed unnecessary imports (`math`, `os`, `SimpleNamespace`) from the sequence_parallel strategy file to clean up the codebase and improve maintainability.
@meichangsu1 meichangsu1 merged commit 2dc5ff8 into dev Feb 9, 2026
0 of 4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant