Paper Detail
Liubomyr Horbatko
Modern sequence models are dominated by Transformers, where self-attention mixes information from the visible context in an input-dependent way. However, when retrieval is not sharp and attention remains diffuse over an effective support $S_{\mathrm{eff}}(t)$, the influence of any individual token is diluted, typically scaling as $O(1/S_{\mathrm{eff}}(t))$ and reaching $O(1/\ell)$ for old tokens in full-prefix settings. Structured state-space models process sequences recurrently through an explicit feedback path; selective variants such as Mamba make this feedback input-dependent, yet when freeze time cannot be sustained over long intervals, their long-range sensitivity decays exponentially with lag. Existing architectures therefore either retrieve from the past in a single read or propagate information through a single feedback chain. We introduce Sessa, a decoder that places attention inside a feedback path, enabling recurrent many-path aggregation within a layer. Under stated assumptions, Sessa admits regimes with a power-law memory tail in lag $\ell$ of order $O(\ell^{-β})$ for $0<β<1$, which is asymptotically slower than $1/\ell$; moreover, this rate is tight in an explicit diffuse uniform-routing setting where the influence is $Θ(\ell^{-β})$. Under the same conditions, only Sessa among the compared model classes realizes flexible selective retrieval, including non-decaying profiles. Empirically, under matched architectures and training budgets, Sessa achieves the strongest performance on our long-context benchmarks while remaining competitive with Transformer and Mamba style baselines on short-context language modeling.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@article{horbatko2026sessa,
title = {Sessa: Selective State Space Attention},
author = {Liubomyr Horbatko},
year = {2026},
abstract = {Modern sequence models are dominated by Transformers, where self-attention mixes information from the visible context in an input-dependent way. However, when retrieval is not sharp and attention remains diffuse over an effective support \$S\_\{\textbackslash{}mathrm\{eff\}\}(t)\$, the influence of any individual token is diluted, typically scaling as \$O(1/S\_\{\textbackslash{}mathrm\{eff\}\}(t))\$ and reaching \$O(1/\textbackslash{}ell)\$ for old tokens in full-prefix settings. Structured state-space models process sequences recurrently through an expli},
url = {https://arxiv.org/abs/2604.18580},
keywords = {cs.LG, cs.AI, cs.CL, Transformers, self-attention, state-space models, recurrent state, long-context modeling, attention-based paths, power-law memory tails, selective retrieval, Sessa, recurrent feedback path, code available, huggingface daily},
eprint = {2604.18580},
archiveprefix = {arXiv},
}
{}