Add support for SWA (left, right) with FusedAttention #2477

sudhakarsingh27 · 2025-12-04T00:54:57Z

Description

FusedAttention supports "right" side sliding window attention for some time now. This adds support for SWA (left, right) with FusedAttention backend in TE.
(changes cherry-picked from original PR: #1369)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

transformer_engine

common
- fused_attn
  - fused_attn.cpp
    - add bottom_right_diagonal parameter to the API
    - Edit the filters to allow sliding window config to pick arbitrary seqlen fused attn backend
  - fused_attn_f16_arbitrary_seqlen.cu: add bottom_right_diagonal parameter to the API
  - fused_attn_fp8.cu: add bottom_right_diagonal parameter to the FADescriptor_v1 API
  - utils.h: add bottom_right_diagonal parameter to FADescriptor_v1 API
pytorch
- transformer.py
  - plumb bottom_right_diagonal through the call stack: TransformerLayer --> SelfAttention/CrossAttention
- attention
  - dot_product_attention
    - backends.py:
      - UnfusedDotProductAttention
        
        add bottom_right_diagonal parameter to the forward API
        
        why is it not used in the forward?
        
        bottom_right_alignment is being used in the Alibi call, perhaps this should be corrected
      - FusedAttn custom module
        
        add bottom_right_diagonal parameter to the forward API
      - FusedAttention module
        
        plumb bottom_right_diagonal through the call stack
    - dot_product_attention.py
      - DotProductAttention
        
        Plumb bottom_right_diagonal through the call stack
        
        Add calculation of bottom_right_diagonal if it's None
    - utils.py
      - AttentionParams
        
        [x]
      - get_attention_backend
        
        update sliding window filter section
        
        update attention bias filter section
  - multi_head_attention.py
    - Add bottom_right_diagonal to forward API and call
    - Add calculation of bottom_right_diagonal if it's None
- cpp_extentions
  - fused_attn.py
    - plumb bottom_right_diagonal in fused_attn_fwd/fused_attn_bwd
- csrc
  - extension
    - attention.cpp
      - plumb bottom_right_diagonal through the call stack: fused_attn_fwd --> nvte_fused_attn_fwd
      - same as above for bwd
  - extensions.h
    - add bottom_right_diagonal to fused_attn_fwd and fused_attn_bwd API definitions

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…IA#1369 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

for more information, see https://pre-commit.ci

sudhakarsingh27 · 2025-12-04T00:56:01Z

/te-ci pytorch L0

greptile-apps · 2025-12-04T01:04:01Z

Greptile Overview

Greptile Summary

This PR extends FusedAttention's sliding window attention (SWA) support from right-side-only windows to arbitrary bidirectional window configurations by adding a bottom_right_diagonal parameter throughout the entire call stack. Previously, FusedAttention only supported configurations like window_size = (left, 0), but this change enables arbitrary configurations like window_size = (left, right).

The parameter controls diagonal alignment in the attention matrix - when False, the sliding window and ALiBi diagonal align to the top-left corner (typical for causal masks); when True, they align to the bottom-right corner (for other mask types). The change leverages cuDNN 9.6+ features that added support for arbitrary sliding window configurations.

The implementation threads the parameter through all layers: from high-level PyTorch APIs (TransformerLayer, MultiheadAttention, DotProductAttention) down through C++ extensions to the underlying CUDA kernels. The change includes automatic default value logic based on mask types to maintain backward compatibility while enabling the new functionality.

Important Files Changed

Filename	Score	Overview
transformer_engine/pytorch/attention/dot_product_attention/backends.py	2/5	Adds parameter support but has critical bug - UnfusedDotProductAttention doesn't use the passed parameter and still uses hardcoded logic
transformer_engine/pytorch/transformer.py	2/5	Adds complex conditional logic for setting defaults that appears overly complicated and potentially error-prone
transformer_engine/common/fused_attn/fused_attn.cpp	4/5	Core backend selection with complex cuDNN version-dependent logic that needs careful validation
transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py	4/5	Main attention class changes with comprehensive parameter threading and automatic defaults
transformer_engine/pytorch/attention/dot_product_attention/utils.py	4/5	Backend selection logic updates with important filtering condition changes
transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu	5/5	Clean CUDA kernel implementation with proper parameter threading
transformer_engine/common/fused_attn/fused_attn_fp8.cu	5/5	FP8 implementation with intentionally hardcoded values for incomplete feature
transformer_engine/common/include/transformer_engine/fused_attn.h	5/5	Comprehensive API extension with proper documentation across all function signatures
transformer_engine/pytorch/csrc/extensions/attention.cpp	5/5	Clean C++ extension implementation properly threading parameter through call stack
transformer_engine/pytorch/attention/multi_head_attention.py	5/5	Well-implemented parameter addition with sensible defaults based on mask types
transformer_engine/pytorch/cpp_extensions/fused_attn.py	5/5	Clean Python wrapper implementation with proper documentation and parameter threading
tests/pytorch/attention/test_attention.py	5/5	Simple test coverage extension for multiple QKV layouts
transformer_engine/common/fused_attn/utils.h	5/5	Straightforward struct extension with correct comparison operator updates
transformer_engine/pytorch/csrc/extensions.h	5/5	Clean header file extension maintaining API consistency
transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.h	5/5	Simple header extension with consistent parameter addition

Confidence score: 3/5

This PR adds significant functionality but has critical implementation issues that need to be addressed before merging
Score lowered due to unused parameter in UnfusedDotProductAttention (critical bug), overly complex conditional logic in transformer.py, and complex version-dependent backend selection logic that increases maintenance burden
Pay close attention to transformer_engine/pytorch/attention/dot_product_attention/backends.py where the parameter is added but not actually used in the ALiBi computation

greptile-apps

Additional Comments (2)

transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py, line 1281 (link)

logic: Trailing comma creates single-element tuple instead of boolean - should this be just bottom_right_alignment = attn_mask_type not in ["causal", "padding_causal"]?
transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py, line 1482 (link)

style: Uses hardcoded mask type check instead of the new bottom_right_diagonal parameter for ALiBi alignment. Should this use bottom_right_diagonal parameter for consistency?

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

_{15 files reviewed, 8 comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-12-04T01:03:56Z

transformer_engine/pytorch/attention/dot_product_attention/backends.py

+                    # (This should be replaced with `bottom_right_diagonal` which is passed from the arguments)
                    bottom_right_alignment=attn_mask_type not in ["causal", "padding_causal"],
                )


logic: The TODO comment indicates bottom_right_diagonal should replace bottom_right_alignment but the replacement wasn't implemented. This could cause incorrect behavior in alibi attention. Should bottom_right_alignment be replaced with bottom_right_diagonal parameter as the comment suggests?

greptile-apps · 2025-12-04T01:03:57Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

+    # (cyang: Why is window_size is being modified but then its value ignored
+    #  in the following else block?)
+    # else:


style: Commented-out code with developer question should be removed

Suggested change

# (cyang: Why is window_size is being modified but then its value ignored

# in the following else block?)

# else:

if window_size is None:

window_size = check_set_window_size(attn_mask_type, window_size)

_{Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!}

greptile-apps · 2025-12-04T01:03:58Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

+                "Disabling FlashAttention as it only supports sliding window with bottom right"
+                " diagonal alignment for cross-attention"
+            )
+            use_flash_attention = False


logic: Variable name mismatch: use_flash_attention is being set but should be use_flash_attention_2

Suggested change

use_flash_attention = False

use_flash_attention_2 = False

greptile-apps · 2025-12-04T01:03:59Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

+                    "Disabling FlashAttention as it only supports ALiBi with bottom right diagonal"
+                    " alignment for cross-attention"
+                )
+                use_flash_attention = False


logic: Variable name mismatch: use_flash_attention is being set but should be use_flash_attention_2

Suggested change

use_flash_attention = False

use_flash_attention_2 = False

greptile-apps · 2025-12-04T01:04:00Z

transformer_engine/pytorch/transformer.py

+        if self_attn_mask_type in {"causal", "padding_causal"}:
+            bottom_right_diagonal = False
+        if bottom_right_diagonal is None or self_attn_mask_type in {
+            "causal_bottom_right",
+            "padding_causal_bottom_right",
+        }:
+            bottom_right_diagonal = True


logic: Logic overrides the instance variable even when explicitly set in forward call - should preserve user's explicit choice. Should the mask type check override an explicitly passed bottom_right_diagonal parameter, or only apply when it's None?

greptile-apps · 2025-12-04T01:04:01Z

transformer_engine/pytorch/transformer.py

+        if enc_dec_attn_mask_type in {"causal", "padding_causal"}:
+            enc_dec_bottom_right_diagonal = False
+        if enc_dec_bottom_right_diagonal is None or enc_dec_attn_mask_type in {
+            "causal_bottom_right",
+            "padding_causal_bottom_right",
+        }:
+            enc_dec_bottom_right_diagonal = True


logic: Same logic issue as above - mask type check overrides explicit parameter values

cyanguwa · 2025-12-05T22:04:30Z

transformer_engine/common/fused_attn/fused_attn.cpp

             qkv_format == NVTE_QKV_Format::NVTE_SBHD)))) ||
         // 9.6: SWA (left, 0) + top-left/bottom-right diagonal + {bshd, sbhd, thd}
         (cudnn_runtime_version >= 90600 &&
          ((window_size_left == -1 && (window_size_right == -1 || window_size_right == 0)) ||


We can probably remove this line? Because it's covered by the next line?

cyanguwa · 2025-12-05T22:32:50Z

tests/pytorch/attention/test_attention.py

Could you add a couple of SWA tests to the CP tests as well? I think it's just a matter of replacing (left,0) with (left, right) and test them out. Thanks!

cyanguwa · 2025-12-05T22:35:09Z

transformer_engine/common/fused_attn/fused_attn.cpp


 // NVTE fused attention FWD with packed QKV
-// DEPRECATED: This API is deprecated.
+// DEPRECATED: This API is deprecated. (Should there be a version by which this is going to be removed? @cyang)


I made some changes in #2272, but will see if I can make the 2.11 deadline.

cyanguwa · 2025-12-05T22:41:21Z

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu

        sdpa_options.set_diagonal_band_left_bound(window_size_left + 1);
      }
+      if (cudnn_runtime_version >= 90600 && window_size_right != -1) {
+        // (remove comment when reviewed) Should it be `window_size_right + 1` instead?


SWA(left, right) should describe an attention window of left + 1 + right elements, but cuDNN understands it as left - 1 + 1 + right elements, so we need to add the 1 here to left to make all three backends (FlashAttention, FusedAttention, UnfusedDPA) equivalent in terms of the SWA operation.

cyanguwa · 2025-12-05T22:42:06Z

transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu

        sdpa_backward_options.set_diagonal_band_left_bound(window_size_left + 1);
      }
+      if (cudnn_runtime_version >= 90600 && window_size_right != -1) {
+        // (remove comment when reviewed) Should it be `window_size_right + 1` instead?


Same as above.

cyanguwa · 2025-12-05T22:45:45Z

transformer_engine/pytorch/attention/dot_product_attention/backends.py

                    actual_seqlens_q=actual_seqlens_q if "padding" in attn_mask_type else None,
                    actual_seqlens_kv=actual_seqlens_kv if "padding" in attn_mask_type else None,
                    alibi_slopes=alibi_slopes,
+                    # (This should be replaced with `bottom_right_diagonal` which is passed from the arguments)


I wonder if the logic should be something like this: bottom_right_alignment=attn_mask_type not in ["causal", "padding_causal"] if bottom_right_alignment is None else bottom_right_alignment.

cyanguwa · 2025-12-05T22:51:46Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

        `causal_bottom_right`, `padding_causal_bottom_right`, `arbitrary`}
    window_size : Tuple[int, int], default = None
        Sliding window attention size.
+    bottom_right_diagonal: bool, default = `True`


Is the default True here?

cyanguwa · 2025-12-05T22:57:15Z

transformer_engine/pytorch/attention/dot_product_attention/utils.py

-                use_flash_attention_2 = False
+    # (cyang: Why is window_size is being modified but then its value ignored
+    #  in the following else block?)
+    # else:


You are right - the code under else should come out of else because we need to factor in the changes in check_set_window_size. Thanks - well spotted!

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…adding_brcm_try2

greptile-apps

_{15 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

…adding_brcm_try2

greptile-apps

Additional Comments (4)

transformer_engine/pytorch/transformer.py, line 777-783 (link)

logic: Conditional logic overwrites explicitly passed parameter. When self_attn_mask_type in {"causal", "padding_causal"} at line 777, this unconditionally sets bottom_right_diagonal = False, even if the user explicitly passed a different value in the forward call. The check at line 779 (if bottom_right_diagonal is None) becomes unreachable in those cases.

Consider restructuring to only apply defaults when bottom_right_diagonal is None:
transformer_engine/pytorch/transformer.py, line 787-793 (link)

logic: Same conditional logic issue for encoder-decoder attention. The mask type check at line 787 unconditionally overwrites enc_dec_bottom_right_diagonal, ignoring explicitly passed values.
transformer_engine/pytorch/attention/dot_product_attention/utils.py, line 920 (link)

logic: Variable name mismatch: setting use_flash_attention instead of use_flash_attention_2
transformer_engine/pytorch/attention/dot_product_attention/utils.py, line 947 (link)

logic: Variable name mismatch: setting use_flash_attention instead of use_flash_attention_2

_{15 files reviewed, 4 comments}

_{Edit Code Review Agent Settings | Greptile}

sudhakarsingh27 · 2025-12-10T23:33:18Z

/te-ci pytorch L0

sudhakarsingh27 · 2025-12-11T18:19:15Z

/te-ci pytorch L0

greptile-apps

Additional Comments (1)

transformer_engine/pytorch/attention/dot_product_attention/dot_product_attention.py, line 1283 (link)

syntax: Tuple has trailing comma causing it to be a single-element tuple instead of boolean

_{15 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

…ansformerEngine into swa_padding_brcm_try2

sudhakarsingh27 · 2025-12-11T21:25:46Z

/te-ci pytorch L0

greptile-apps

_{15 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

sudhakarsingh27 · 2025-12-11T23:19:25Z

/te-ci pytorch L0

SWA (left, right) with FusedAttention changes cherry-picked from NVID…

33e2e47

…IA#1369 Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

sudhakarsingh27 requested a review from cyanguwa December 4, 2025 00:55

[pre-commit.ci] auto fixes from pre-commit.com hooks

eab24be

for more information, see https://pre-commit.ci

cyanguwa mentioned this pull request Dec 4, 2025

[common/PyTorch] Add FusedAttention support for SWA (left, right) #1369

Closed

13 tasks

greptile-apps bot reviewed Dec 4, 2025

View reviewed changes

cyanguwa reviewed Dec 5, 2025

View reviewed changes

sudhakarsingh27 added the 2.11.0 label Dec 6, 2025

sudhakarsingh27 added 2 commits December 5, 2025 16:53

fix test_kv_cache failures

e761a26

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into swa_p…

48e4f4d

…adding_brcm_try2

greptile-apps bot reviewed Dec 6, 2025

View reviewed changes

Merge branch 'main' of github.com:NVIDIA/TransformerEngine into swa_p…

172ebbe

…adding_brcm_try2

greptile-apps bot reviewed Dec 10, 2025

View reviewed changes

Merge branch 'main' into swa_padding_brcm_try2

a3e0370

greptile-apps bot reviewed Dec 11, 2025

View reviewed changes

sudhakarsingh27 added 2 commits December 11, 2025 13:24

remove comments and unnecessary condition in filter

663d2e6

Signed-off-by: Sudhakar Singh <sudhakars@nvidia.com>

Merge branch 'swa_padding_brcm_try2' of github.com:sudhakarsingh27/Tr…

a56c4d2

…ansformerEngine into swa_padding_brcm_try2

greptile-apps bot reviewed Dec 11, 2025

View reviewed changes

Add support for SWA (left, right) with FusedAttention #2477

Are you sure you want to change the base?

Add support for SWA (left, right) with FusedAttention #2477

Uh oh!

Conversation

sudhakarsingh27 commented Dec 4, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

sudhakarsingh27 commented Dec 4, 2025

Uh oh!

greptile-apps bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 3/5

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (2)

Uh oh!

greptile-apps bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Dec 4, 2025

Choose a reason for hiding this comment

Uh oh!

cyanguwa Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

cyanguwa Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyanguwa Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

cyanguwa Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

cyanguwa Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

cyanguwa Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cyanguwa Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

cyanguwa Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (4)

Uh oh!

sudhakarsingh27 commented Dec 10, 2025

Uh oh!

sudhakarsingh27 commented Dec 11, 2025

greptile-apps bot commented Dec 4, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

cyanguwa Dec 5, 2025 •

edited

Loading

cyanguwa Dec 5, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading