Prism: Spectral-Aware Block-Sparse Attention

As Large Language Models (LLMs) evolve to handle million-token contexts, the quadratic cost of self-attention has become the primary bottleneck—Prism offers a training-free, spectral-aware solution to break this efficiency barrier without sacrificing accuracy.

Block Importance Estimation for Block-Sparse Attention

A typical approach for efficient estimation (block-level ops only):

Coarse-grained Attention + TopP Selection

$$ \bar{\mathbf{q}}u = \frac{1}{B} \sum_{i \in \mathcal{I}_u} \mathbf{q}i, \quad \bar{\mathbf{k}}v = \frac{1}{B} \sum{j \in \mathcal{I}_v} \mathbf{k}_j \tag{1} $$

$$ \bar{\mathbf{S}} = \text{softmax} \left( \frac{\bar{\mathbf{Q}}\bar{\mathbf{K}}^\top}{\sqrt{d}} \right) \tag{2} $$

However, simply adopting this approach usually underperforms. Prior works often have to resort to token-level ops for precise pattern detection/searching.

A quick recap:

MInference (Jiang et al., 2024): Offline search+Coarse-grained Attention+ Vertical-Slash Pattern Match
FlexPrefill (Lai et al., 2025): Online search+Coarse-grained Attention+ Vertical-Slash Pattern Match
SpargeAttn (Zhang et al., 2025): Coarse-grained Attention + Intra-Block Similarity Check
XAttention (Xu et al., 2025): Antidiagonal Scoring
PBS-Attn (Wang et al., 2025): Token-level Permutation + Coarse-grained Attention

Why does Coarse-grained Attention Fall Short?

The Culprit: RoPE and the "Blind Spot"

To understand why coarse-grained attention fails, we have to look at how modern LLMs handle positions. Most state-of-the-art models (like Llama 3 or Qwen) use Rotary Positional Embeddings (RoPE).

RoPE encodes position by rotating feature pairs in the complex plane at different frequencies $\theta_j$:

$$ x_n^{(j)} = x_{nope}^{(j)} \cdot e^{in\theta_j}, \quad \text{where} \quad \theta_j = b^{-2j/d} \tag{3}

High-frequency dimensions (small $j$) rotate rapidly and capture fine-grained, local positional info.