[WIP] Add sliding-window attention support to the varlen kernel

**Status: WIP — authored without a Metal build/test environment. Do not merge before building and running the parity tests (see checklist).**

### Why
Sliding-window (SWA) models — Mistral/Ministral, Gemma 2/3, Qwen sliding layers, the gpt-oss family — pass `window_size` into `flash_attn_varlen_func`. Today the wrapper raises `NotImplementedError("Window attention is not supported")`, so on MPS Transformers falls back to **eager** attention, which materializes the full `[B, H, n, n]` score matrix.

Measured on an M-series GPU (fp16, `openai/privacy-filter`, eager): driver memory grows 3.1 → 4.2 → 18.5 → 35.9 GiB across 512 → 2048 → 8192 → 16384 tokens, then crashes (`integer out of range` — the score tensor exceeds int32 indexing at 16k). Accepting the window lets these models use the flash path, whose memory is flat in sequence length.

### What
- Plumb `window_left` / `window_right` through `flash_attn_varlen_func` → `flash_attention_varlen` → the `flash_attention_varlen` op → host launcher → `AttnParams` → the shader (appended two `int32` to both the host and device `AttnParams`, ABI matched).
- New function constant `has_window` (302), gating a mask block that is a direct analog of the existing `do_causal` block: a key at signed distance `d` from the query row is kept iff `d <= window_left` (past) and `-d <= window_right` (future); `-1` = unbounded on that side. Follows flash-attn `window_size=(left, right)` semantics.
- `flash_attn_varlen_func` now forwards `window_size` instead of raising.

### Scope / limitations
- Correctness + the flat-memory win only. The K-block loop still visits every block (the band is applied as a mask, not by skipping out-of-band blocks). Block-skipping for a compute speedup is a natural follow-up.
- **Attention sinks (gpt-oss `s_aux`) are NOT in this PR.** They are a denominator-only term in the online softmax: before the final `Otile.row_bin_op<DivOp>(sum_score)`, add `exp2(sink_h * log2(e) - max_score[i])` to `sum_score[i]` (rescaling the running max to include the sink for numerical safety). Happy to send that as a second PR.

### Validate before merge
- [ ] Build with kernel-builder for the `torchNN-metal-aarch64-darwin` targets.
- [ ] `tests/test_flash_attention.py` + a new windowed case vs a reference banded-mask SDPA.
- [ ] Numeric parity for a bidirectional window (e.g. Gemma/Mistral) and a causal window.

Files changed (5) hide show

sdpa-metal/scaled_dot_product_attention.metal +37 -0
sdpa-metal/scaled_dot_product_attention.mm +18 -6
torch-ext/metal_flash_sdpa/_custom_ops.py +12 -5
torch-ext/torch_binding.cpp +1 -1
torch-ext/torch_binding.h +3 -1

sdpa-metal/scaled_dot_product_attention.metal CHANGED Viewed

@@ -1506,6 +1506,10 @@ struct AttnParams {
   int total_k_tokens; ///< Total number of key/value tokens
   int max_seqlen_q; ///< Maximum query sequence length
   int max_seqlen_k; ///< Maximum key/value sequence length
 };
 struct AttnMaskParams {
@@ -1521,6 +1525,7 @@ constant bool align_K [[function_constant(201)]];
 constant bool has_mask [[function_constant(300)]];
 constant bool do_causal [[function_constant(301)]];
 template <typename T>
 struct TransformScale {
@@ -1894,6 +1899,38 @@ template <
       }
     }
     // Other masking as needed
     if (has_mask) {
       using stile_t = decltype(Stile);

   int total_k_tokens; ///< Total number of key/value tokens
   int max_seqlen_q; ///< Maximum query sequence length
   int max_seqlen_k; ///< Maximum key/value sequence length
+  // Sliding-window attention support (-1 on a side = unbounded on that side)
+  int window_left;  ///< Max distance into the past a query may attend
+  int window_right; ///< Max distance into the future a query may attend
 };
 struct AttnMaskParams {
 constant bool has_mask [[function_constant(300)]];
 constant bool do_causal [[function_constant(301)]];
+constant bool has_window [[function_constant(302)]];
 template <typename T>
 struct TransformScale {
       }
     }
+    // Mask out keys outside the sliding window band [row - window_left, row + window_right]
+    if (has_window) {
+      using stile_t = decltype(Stile);
+      using selem_t = typename stile_t::elem_type;
+      constexpr auto neg_inf = -metal::numeric_limits<selem_t>::infinity();
+      const int wl = params->window_left;   // -1 => unbounded into the past
+      const int wr = params->window_right;  // -1 => unbounded into the future
+      STEEL_PRAGMA_UNROLL
+      for (short i = 0; i < stile_t::kTileRows; i++) {
+        // Same row-position machinery as the causal block above.
+        int row_pos = block_idx * BQ + tm + sm + (i * stile_t::kFragRows);
+        if (q_seq_len < k_seq_len) {
+          row_pos += (k_seq_len - q_seq_len);
+        }
+        STEEL_PRAGMA_UNROLL
+        for (short j = 0; j < stile_t::kTileCols; j++) {
+          const int col_pos_in_seq = kb * BK + sn + (j * stile_t::kFragCols);
+          STEEL_PRAGMA_UNROLL
+          for (short jj = 0; jj < stile_t::MMAFrag_t::kElemCols; jj++) {
+            const int col = col_pos_in_seq + jj;
+            const bool past_ok = (wl < 0) || ((row_pos - col) <= wl);
+            const bool future_ok = (wr < 0) || ((col - row_pos) <= wr);
+            if (!(past_ok && future_ok)) {
+              Stile.frag_at(i, j)[jj] = neg_inf;
+            }
+          }
+        }
+      }
+    }
     // Other masking as needed
     if (has_mask) {
       using stile_t = decltype(Stile);

sdpa-metal/scaled_dot_product_attention.mm CHANGED Viewed

@@ -69,6 +69,8 @@ struct AttnParams {
   int32_t total_k_tokens; // Total number of key/value tokens
   int32_t max_seqlen_q;   // Maximum query sequence length
   int32_t max_seqlen_k;   // Maximum key/value sequence length
 };
 // Forward declarations for kernel implementations
@@ -86,7 +88,9 @@ void call_flash_attention_varlen(
     int64_t max_seqlen_k,
     bool do_causal,
     double scale,
-    double softcapping);
 void flash_attention_varlen(
@@ -100,7 +104,9 @@ void flash_attention_varlen(
     int64_t max_seqlen_k,         // Maximum key sequence length
     bool do_causal,               // Whether to use causal mask
     double scale,                 // Attention scale
-    double softcapping) {         // Softcapping value
   try {
     // Get device and stream
@@ -142,9 +148,9 @@ void flash_attention_varlen(
   // For variable-length Flash Attention, always use the full attention kernel
   // Call the Flash Attention kernel
-  call_flash_attention_varlen(device, cmdBuf, lib, out, query, key, value,
                               cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
-                              do_causal, scale, softcapping);
   } catch (const std::exception& e) {
     throw;
   } catch (...) {
@@ -167,7 +173,9 @@ void call_flash_attention_varlen(
     int64_t max_seqlen_k,
     bool do_causal,
     double scale,
-    double softcapping) {
   // Get dimensions
   int64_t total_q_tokens = query.size(0);
@@ -197,7 +205,9 @@ void call_flash_attention_varlen(
   params.total_k_tokens = key.size(0);
   params.max_seqlen_q = max_seqlen_q;
   params.max_seqlen_k = max_seqlen_k;
   // Initialize fields that might be checked but aren't used in Flash Attention
   params.qL = 0;  // Not used in variable-length attention
   params.kL = 0;  // Not used in variable-length attention
@@ -227,11 +237,13 @@ void call_flash_attention_varlen(
   // The kernel will handle the cu_seqlens internally
   bool has_mask = false;  // Masks are not supported in Flash Attention
   // Setup function constants
   MTLFunctionConstantValues *constants = [MTLFunctionConstantValues new];
   [constants setConstantValue:&has_mask type:MTLDataTypeBool atIndex:300];
   [constants setConstantValue:&do_causal type:MTLDataTypeBool atIndex:301];
   // Construct kernel name based on data type and head dimension
   std::string kernel_name = "steel_attention_";

   int32_t total_k_tokens; // Total number of key/value tokens
   int32_t max_seqlen_q;   // Maximum query sequence length
   int32_t max_seqlen_k;   // Maximum key/value sequence length
+  int32_t window_left;    // Sliding window: max distance into the past (-1 = unbounded)
+  int32_t window_right;   // Sliding window: max distance into the future (-1 = unbounded)
 };
 // Forward declarations for kernel implementations
     int64_t max_seqlen_k,
     bool do_causal,
     double scale,
+    double softcapping,
+    int64_t window_left,
+    int64_t window_right);
 void flash_attention_varlen(
     int64_t max_seqlen_k,         // Maximum key sequence length
     bool do_causal,               // Whether to use causal mask
     double scale,                 // Attention scale
+    double softcapping,           // Softcapping value
+    int64_t window_left,          // Sliding window past extent (-1 = unbounded)
+    int64_t window_right) {       // Sliding window future extent (-1 = unbounded)
   try {
     // Get device and stream
   // For variable-length Flash Attention, always use the full attention kernel
   // Call the Flash Attention kernel
+  call_flash_attention_varlen(device, cmdBuf, lib, out, query, key, value,
                               cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k,
+                              do_causal, scale, softcapping, window_left, window_right);
   } catch (const std::exception& e) {
     throw;
   } catch (...) {
     int64_t max_seqlen_k,
     bool do_causal,
     double scale,
+    double softcapping,
+    int64_t window_left,
+    int64_t window_right) {
   // Get dimensions
   int64_t total_q_tokens = query.size(0);
   params.total_k_tokens = key.size(0);
   params.max_seqlen_q = max_seqlen_q;
   params.max_seqlen_k = max_seqlen_k;
+  params.window_left = static_cast<int32_t>(window_left);
+  params.window_right = static_cast<int32_t>(window_right);
   // Initialize fields that might be checked but aren't used in Flash Attention
   params.qL = 0;  // Not used in variable-length attention
   params.kL = 0;  // Not used in variable-length attention
   // The kernel will handle the cu_seqlens internally
   bool has_mask = false;  // Masks are not supported in Flash Attention
+  bool has_window = (window_left >= 0) || (window_right >= 0);
   // Setup function constants
   MTLFunctionConstantValues *constants = [MTLFunctionConstantValues new];
   [constants setConstantValue:&has_mask type:MTLDataTypeBool atIndex:300];
   [constants setConstantValue:&do_causal type:MTLDataTypeBool atIndex:301];
+  [constants setConstantValue:&has_window type:MTLDataTypeBool atIndex:302];
   // Construct kernel name based on data type and head dimension
   std::string kernel_name = "steel_attention_";

torch-ext/metal_flash_sdpa/_custom_ops.py CHANGED Viewed

@@ -17,6 +17,8 @@ def flash_attention_varlen(
     do_causal: bool = False,
     scale: Optional[float] = None,
     softcapping: float = 1.0,
 ) -> None:
     """
     Flash Attention with variable-length sequences.
@@ -38,10 +40,11 @@ def flash_attention_varlen(
         - cu_seqlens_q and cu_seqlens_k must have dtype torch.int32 for Metal compatibility
         - Supported head dimensions: 32, 64, 72, 80, 96, 128
         - Masks are not supported
     """
     if scale is None:
         scale = query.shape[-1] ** -0.5
     ops.flash_attention_varlen(
         out,
         query,
@@ -54,6 +57,8 @@ def flash_attention_varlen(
         do_causal,
         scale,
         softcapping,
     )
 def flash_attn_varlen_func(
@@ -77,14 +82,14 @@ def flash_attn_varlen_func(
     Note: This implementation does not support:
     - dropout
-    - window attention
     - alibi slopes
     - returning attention probabilities
     """
     if dropout_p > 0:
         raise NotImplementedError("Dropout is not supported in this implementation")
-    if window_size != (-1, -1):
-        raise NotImplementedError("Window attention is not supported")
     if alibi_slopes is not None:
         raise NotImplementedError("ALiBi is not supported")
     if return_attn_probs:
@@ -106,8 +111,10 @@ def flash_attn_varlen_func(
         do_causal=causal,
         scale=softmax_scale,
         softcapping=1.0,
     )
     return out

     do_causal: bool = False,
     scale: Optional[float] = None,
     softcapping: float = 1.0,
+    window_left: int = -1,
+    window_right: int = -1,
 ) -> None:
     """
     Flash Attention with variable-length sequences.
         - cu_seqlens_q and cu_seqlens_k must have dtype torch.int32 for Metal compatibility
         - Supported head dimensions: 32, 64, 72, 80, 96, 128
         - Masks are not supported
+        - window_left / window_right bound a sliding-window band (-1 = unbounded)
     """
     if scale is None:
         scale = query.shape[-1] ** -0.5
     ops.flash_attention_varlen(
         out,
         query,
         do_causal,
         scale,
         softcapping,
+        window_left,
+        window_right,
     )
 def flash_attn_varlen_func(
     Note: This implementation does not support:
     - dropout
     - alibi slopes
     - returning attention probabilities
+    `window_size = (left, right)` follows the flash-attn convention: a token attends to
+    keys in [pos - left, pos + right]; -1 means unbounded on that side.
     """
     if dropout_p > 0:
         raise NotImplementedError("Dropout is not supported in this implementation")
     if alibi_slopes is not None:
         raise NotImplementedError("ALiBi is not supported")
     if return_attn_probs:
         do_causal=causal,
         scale=softmax_scale,
         softcapping=1.0,
+        window_left=window_size[0],
+        window_right=window_size[1],
     )
     return out

torch-ext/torch_binding.cpp CHANGED Viewed

@@ -4,7 +4,7 @@
 #include "torch_binding.h"
 TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
-  ops.def("flash_attention_varlen(Tensor! out, Tensor query, Tensor key, Tensor value, Tensor cu_seqlens_q, Tensor cu_seqlens_k, int max_seqlen_q, int max_seqlen_k, bool do_causal, float scale, float softcapping) -> ()");
   ops.impl("flash_attention_varlen", torch::kMPS, flash_attention_varlen);
 }

 #include "torch_binding.h"
 TORCH_LIBRARY_EXPAND(TORCH_EXTENSION_NAME, ops) {
+  ops.def("flash_attention_varlen(Tensor! out, Tensor query, Tensor key, Tensor value, Tensor cu_seqlens_q, Tensor cu_seqlens_k, int max_seqlen_q, int max_seqlen_k, bool do_causal, float scale, float softcapping, int window_left, int window_right) -> ()");
   ops.impl("flash_attention_varlen", torch::kMPS, flash_attention_varlen);
 }

torch-ext/torch_binding.h CHANGED Viewed

@@ -13,4 +13,6 @@ void flash_attention_varlen(
     int64_t max_seqlen_k,
     bool do_causal,
     double scale,
-    double softcapping);

     int64_t max_seqlen_k,
     bool do_causal,
     double scale,
+    double softcapping,
+    int64_t window_left,
+    int64_t window_right);