As Large Language Models (LLMs) evolve to handle million-token contexts, the quadratic cost of self-attention has become the primary bottleneck—Prism offers a training-free, spectral-aware solution to break this efficiency barrier without sacrificing accuracy.
A typical approach for efficient estimation (block-level ops only):
Coarse-grained Attention + TopP Selection
$$ \bar{\mathbf{q}}u = \frac{1}{B} \sum_{i \in \mathcal{I}_u} \mathbf{q}i, \quad \bar{\mathbf{k}}v = \frac{1}{B} \sum{j \in \mathcal{I}_v} \mathbf{k}_j \tag{1} $$
$$ \bar{\mathbf{S}} = \text{softmax} \left( \frac{\bar{\mathbf{Q}}\bar{\mathbf{K}}^\top}{\sqrt{d}} \right) \tag{2} $$

However, simply adopting this approach usually underperforms. Prior works often have to resort to token-level ops for precise pattern detection/searching.
A quick recap:
To understand why coarse-grained attention fails, we have to look at how modern LLMs handle positions. Most state-of-the-art models (like Llama 3 or Qwen) use Rotary Positional Embeddings (RoPE).
RoPE encodes position by rotating feature pairs in the complex plane at different frequencies $\theta_j$:
$$ x_n^{(j)} = x_{nope}^{(j)} \cdot e^{in\theta_j}, \quad \text{where} \quad \theta_j = b^{-2j/d} \tag{3}
$$