AI · 1h ago
IndexCache Cuts DeepSeek Sparse Attention Bottleneck by Sharing Token Selections Across Layers
IndexCache, a new method from Tsinghua and Z.ai, reduces the O(NL²) indexer cost in DeepSeek's sparse attention by having only some layers run the indexer and share results. Adjacent layers select 70–100% overlapping tokens, enabling reuse. This speeds inference while maintaining quality, addressing a key scaling bottleneck for long-context models.
Meridian48 take
The paper smartly exploits redundancy across layers, but the real test is whether shared token sets degrade quality on complex reasoning tasks.
sparse-attentiondeepseek