As Large Language Models (LLMs) evolve to handle million-token contexts, the quadratic cost of self-attention has become the primary bottleneck—Prism offers a training-free, spectral-aware solution to break this efficiency barrier without sacrificing accuracy.

Block Importance Estimation for Block-Sparse Attention

A typical approach for efficient estimation (block-level ops only):

Coarse-grained Attention + TopP Selection

$$ \bar{\mathbf{q}}u = \frac{1}{B} \sum_{i \in \mathcal{I}_u} \mathbf{q}i, \quad \bar{\mathbf{k}}v = \frac{1}{B} \sum{j \in \mathcal{I}_v} \mathbf{k}_j \tag{1} $$

$$ \bar{\mathbf{S}} = \text{softmax} \left( \frac{\bar{\mathbf{Q}}\bar{\mathbf{K}}^\top}{\sqrt{d}} \right) \tag{2} $$

image.png

However, simply adopting this approach usually underperforms. Prior works often have to resort to token-level ops for precise pattern detection/searching.

A quick recap:

Why does Coarse-grained Attention Fall Short?

The Culprit: RoPE and the "Blind Spot"

To understand why coarse-grained attention fails, we have to look at how modern LLMs handle positions. Most state-of-the-art models (like Llama 3 or Qwen) use Rotary Positional Embeddings (RoPE).

RoPE encodes position by rotating feature pairs in the complex plane at different frequencies $\theta_j$:

$$ x_n^{(j)} = x_{nope}^{(j)} \cdot e^{in\theta_j}, \quad \text{where} \quad \theta_j = b^{-2j/d} \tag{3}

$$