TypeError: flex_flash_attn_func() got an unexpected keyword argument 'max_seqlen_q'

Hi, I just come from Magi-1 repo, when I run a sample test but it fails：

(magi) kanghengrui@x86_64-conda-linux-gnu [/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1] git:(main) ✗ ➜  bash example/4.5B/test_sample.sh                                                                                                                                 [20:50:34] 
/home/kanghengrui/miniconda3/envs/magi/lib/python3.10/site-packages/magi_attention/__init__.py:26: UserWarning: You are using magi_attention without installing it. This may cause some unexpected errors.                                                                                                     
  warnings.warn(                                                                                                                                                                                                                                                                                               
/home/kanghengrui/miniconda3/envs/magi/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.                                                                                   
  warnings.warn(                                                                                                                                                                                                                                                                                               
/home/kanghengrui/miniconda3/envs/magi/lib/python3.10/site-packages/timm/models/layers/__init__.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers                                                                                                           
  warnings.warn(f"Importing from {__name__} is deprecated, please import via timm.layers", FutureWarning)                                                                                                                                                                                                      
[W1216 20:50:57.975591094 CUDAAllocatorConfig.h:28] Warning: expandable_segments not supported on this platform (function operator())                                                                                                                                                                          
[2025-12-16 20:50:57,458 - INFO] Initialize torch distribution and model parallel successfully                                                                                                                                                                                                                 
[2025-12-16 20:50:57,458 - INFO] MagiConfig(model_config=ModelConfig(model_name='videodit_ardf', num_layers=34, hidden_size=3072, ffn_hidden_size=12288, num_attention_heads=24, num_query_groups=8, kv_channels=128, layernorm_epsilon=1e-06, apply_layernorm_1p=True, x_rescale_factor=1, half_channel_vae=Fa
lse, params_dtype=torch.bfloat16, patch_size=2, t_patch_size=1, in_channels=16, out_channels=16, cond_hidden_ratio=0.25, caption_channels=4096, caption_max_length=800, xattn_cond_hidden_ratio=1.0, cond_gating_ratio=1.0, gated_linear_unit=False), runtime_config=RuntimeConfig(cfg_number=3, cfg_t_range=[0
.0, 0.0217, 0.1, 0.3, 0.999], prev_chunk_scales=[1.5, 1.5, 1.5, 1.0, 1.0], text_scales=[7.5, 7.5, 7.5, 0.0, 0.0], noise2clean_kvrange=[5, 4, 3, 2], clean_chunk_kvrange=1, clean_t=0.9999, seed=1234, num_frames=96, video_size_h=720, video_size_w=720, num_steps=64, window_size=4, fps=24, chunk_width=6, t5
_pretrained='/mnt/shared-storage-gpfs2/gpfs2-shared-public/huggingface/zskj-hub/models--sand-ai--MAGI-1/ckpt/t5', t5_device='cpu', vae_pretrained='/mnt/shared-storage-gpfs2/gpfs2-shared-public/huggingface/zskj-hub/models--sand-ai--MAGI-1/ckpt/vae', scale_factor=0.18215, temporal_downsample_factor=4, lo
ad='/mnt/shared-storage-gpfs2/gpfs2-shared-public/huggingface/zskj-hub/models--sand-ai--MAGI-1/ckpt/magi/4.5B_base'), engine_config=EngineConfig(distributed_backend='nccl', distributed_timeout_minutes=15, pp_size=1, cp_size=1, cp_strategy='cp_ulysses', ulysses_overlap_degree=1, fp8_quant=False, distill
_nearly_clean_chunk_threshold=0.3, shortcut_mode='8,16,16', distill=False, kv_offload=True, enable_cuda_graph=False))                                                                                                                                                                                          
[2025-12-16 20:50:57,458 - INFO] Precompute validation prompt embeddings                                                                                                                                                                                                                                       
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be
 set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565                                                                                                                                                 
Loading checkpoint shards: 100%|██████████| 2/2 [00:15<00:00,  7.84s/it]                                                                                                                                                                                                                                       
[2025-12-16 20:51:22,098 - INFO] VideoDiTModel(                                                                                                                                                                                                                                                                
  (x_embedder): Conv3d(16, 3072, kernel_size=(1, 2, 2), stride=(1, 2, 2), bias=False)                                                                                                                                                                                                                          
  (t_embedder): TimestepEmbedder(                                                                                                                                                                                                                                                                              
    (mlp): Sequential(                                                                                                                                                                                                                                                                                         
      (0): Linear(in_features=256, out_features=768, bias=True)                                                                                                                                                                                                                                                
      (1): SiLU()                                                                                                                                                                                                                                                                                              
      (2): Linear(in_features=768, out_features=768, bias=True)                                                                                                                                                                                                                                                
    )                                                                                                                                                                                                                                                                                                          
  )                                                                                                                                                                                                                                                                                                            
  (y_embedder): CaptionEmbedder(                                                                                                                                                                                                                                                                               
    (y_proj_xattn): Sequential(                                                                                                                                                                                                                                                                                
      (0): Linear(in_features=4096, out_features=3072, bias=True)                                                                                                                                                                                                                                              
      (1): SiLU()                                                                                                                                                                                                                                                                                              
    )                                                                                                                                                                                                                                                                                                          
    (y_proj_adaln): Sequential(                                                                                                                                                                                                                                                                                
      (0): Linear(in_features=4096, out_features=768, bias=True)                                                                                                                                                                                                                                               
    )                                                                                                                                                                                                                                                                                                          
  )                                                                                                                                                                                                                                                                                                            
  (rope): LearnableRotaryEmbeddingCat()                                                                                                                                                                                                                                                                        
  (videodit_blocks): TransformerBlock(                                                                                                                                                                                                                                                                         
    (layers): ModuleList(                                                                                                                                                                                                                                                                                      
      (0-33): 34 x TransformerLayer(                                                                                                                                                                                                                                                                           
        (ada_modulate_layer): AdaModulateLayer(                                                                                                                                                                                                                                                                
          (act): SiLU()                                                                                                                                                                                                                                                                                        
          (proj): Sequential(
            (0): Linear(in_features=768, out_features=6144, bias=True)
          )
        )
        (self_attention): FullyParallelAttention(
          (linear_qkv): CustomLayerNormLinear(
            (layer_norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=True)
            (q): Linear(in_features=3072, out_features=3072, bias=False)
            (qx): Linear(in_features=3072, out_features=3072, bias=False)
            (k): Linear(in_features=3072, out_features=1024, bias=False)
            (v): Linear(in_features=3072, out_features=1024, bias=False)
          )
          (linear_kv_xattn): Linear(in_features=3072, out_features=2048, bias=False)
          (linear_proj): Linear(in_features=6144, out_features=3072, bias=False)
          (q_layernorm): FusedLayerNorm()
          (q_layernorm_xattn): FusedLayerNorm()
          (k_layernorm): FusedLayerNorm()
          (k_layernorm_xattn): FusedLayerNorm()
        )
        (self_attn_post_norm): FusedLayerNorm()
        (mlp): CustomMLP(
          (layer_norm): LayerNorm((3072,), eps=1e-06, elementwise_affine=True)
          (linear_fc1): Linear(in_features=3072, out_features=12288, bias=False)                                                                                                                                                                                                                      
          (linear_fc2): Linear(in_features=12288, out_features=3072, bias=False)
        )
        (mlp_post_norm): FusedLayerNorm()
      )
    )
    (final_layernorm): FusedLayerNorm()
  )
  (final_linear): FinalLinear(
    (linear): Linear(in_features=3072, out_features=64, bias=False)
  )
)
[2025-12-16 20:51:22,101 - INFO] (cp, pp) rank (0, 0): param count 4459898128, model size 8.34 GB
[2025-12-16 20:51:22,101 - INFO] Build DiTModel successfully
[2025-12-16 20:51:22,102 - INFO] After build_dit_model, memory allocated: 0.01 GB, memory reserved: 0.02 GB
[2025-12-16 20:51:22,102 - INFO] load inference_weight weight from /mnt/shared-storage-gpfs2/gpfs2-shared-public/huggingface/zskj-hub/models--sand-ai--MAGI-1/ckpt/magi/4.5B_base/inference_weight
Loading shards: 100%|██████████| 2/2 [00:00<00:00,  7.70it/s]
[2025-12-16 20:51:30,853 - INFO] Load Weight Missing Keys: []
[2025-12-16 20:51:30,854 - INFO] Load Weight Unexpected Keys: []
[2025-12-16 20:51:31,010 - INFO] After load_checkpoint, memory allocated: 8.36 GB, memory reserved: 8.37 GB
[2025-12-16 20:51:31,013 - INFO] After high_precision_promoter, memory allocated: 8.36 GB, memory reserved: 8.37 GB
[2025-12-16 20:51:31,164 - INFO] Load checkpoint successfully
[2025-12-16 20:51:31,164 - INFO] special_token = ['HQ_TOKEN', 'DURATION_TOKEN']
InferBatch 0:   0%|          | 0/4 [00:00<?, ?it/s][2025-12-16 20:51:31,198 - INFO] transport_inputs len: 1
[rank0]: Traceback (most recent call last):
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/pipeline/entry.py", line 54, in <module>
[rank0]:     main()
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/pipeline/entry.py", line 40, in main
[rank0]:     pipeline.run_text_to_video(prompt=args.prompt, output_path=args.output_path)
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/pipeline/pipeline.py", line 35, in run_text_to_video
[rank0]:     self._run(prompt, None, output_path)
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/pipeline/pipeline.py", line 49, in _run
[rank0]:     [
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/pipeline/pipeline.py", line 49, in <listcomp>
[rank0]:     [
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/pipeline/video_generate.py", line 763, in generate_per_chunk
[rank0]:     for _, _, chunk in sample_transport.walk():
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/pipeline/video_generate.py", line 725, in walk
[rank0]:     velocity = self.forward_velocity(infer_idx, 0)
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/pipeline/video_generate.py", line 657, in forward_velocity
[rank0]:     velocity = forward_fn(
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/model/dit/dit_model.py", line 503, in forward_dispatcher
[rank0]:     (out_cond_pre_and_text, out_cond_pre, out_uncond, denoise_width) = self.forward_3cfg(
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/model/dit/dit_model.py", line 414, in forward_3cfg
[rank0]:     out_cond_pre_and_text = self.forward(
[rank0]:   File "/home/kanghengrui/miniconda3/envs/magi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/model/dit/dit_model.py", line 385, in forward
[rank0]:     x = self.videodit_blocks.forward(
[rank0]:   File "/home/kanghengrui/miniconda3/envs/magi/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/model/dit/dit_module.py", line 1427, in forward
[rank0]:     hidden_states = layer(
[rank0]:   File "/home/kanghengrui/miniconda3/envs/magi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/kanghengrui/miniconda3/envs/magi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/model/dit/dit_module.py", line 1308, in forward
[rank0]:     core_attn_out, cross_attn_out = self.self_attention(
[rank0]:   File "/home/kanghengrui/miniconda3/envs/magi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/kanghengrui/miniconda3/envs/magi/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/model/dit/dit_module.py", line 1195, in forward
[rank0]:     core_attn_out, xattn_out = UlyssesScheduler.get_attn_and_xattn_with_fused_kv_comm(
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/infra/parallelism/context_parallel.py", line 525, in get_attn_and_xattn_with_fused_kv_comm
[rank0]:     return UlyssesScheduler.get_attn_and_xattn_base(
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/infra/parallelism/context_parallel.py", line 585, in get_attn_and_xattn_base
[rank0]:     core_attn_out_new = core_attn_func(query[i], key, value)
[rank0]:   File "/mnt/shared-storage-gpfs2/kanghengrui-gpfs02/manivid/vid_models/MAGI-1/inference/model/dit/dit_module.py", line 1032, in core_attention
[rank0]:     core_attn_out, _ = flex_attention(
[rank0]: TypeError: flex_flash_attn_func() got an unexpected keyword argument 'max_seqlen_q'
InferBatch 0:   0%|          | 0/4 [00:00<?, ?it/s]

My environment:
cuda=12.4
torch=2.4.0+cu124
python=3.10.12
magi_attention = 0.0.0
flash_attn=2.7.0.post1

I just reduced FFA_JOBS from 160 to 10 and successfully installed magi_attention=0.0.0

**I found if I didn't install magi_attention=0.0.0, this error wouldn't exist. Furthermore, at the beginning this log reminds me that:**

>UserWarning: You are using magi_attention without installing it. This may cause some unexpected errors.

**So I doubt that maybe it is related to magi_attention?**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TypeError: flex_flash_attn_func() got an unexpected keyword argument 'max_seqlen_q' #193

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TypeError: flex_flash_attn_func() got an unexpected keyword argument 'max_seqlen_q' #193

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions