Paper Detail

TIDE: Every Layer Knows the Token Beneath the Context

Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Mehrdad Farajtabar, Minsik Cho

huggingface Score 9.0

Published 2026-05-07 · First seen 2026-05-09

General AI

Abstract

We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse Problem, where limited parameters models map distributionally similar tokens to indistinguishable hidden states. As an attempt to address both, we propose TIDE, which augments the standard transformer with EmbeddingMemory: an ensemble of K independent MemoryBlocks that map token indices to context-free semantic vectors, computed once and injected into every layer through a depth-conditioned softmax router with a learnable null bank. We theoretically and empirically establish the benefits of TIDE in addressing the issues associated with single-token identity injection as well as improve performance across multiple language modeling and downstream tasks.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{jaiswal2026tide,
  title = {TIDE: Every Layer Knows the Token Beneath the Context},
  author = {Ajay Jaiswal and Lauren Hannah and Han-Byul Kim and Duc Hoang and Mehrdad Farajtabar and Minsik Cho},
  year = {2026},
  abstract = {We revisit a universally accepted but under-examined design choice in every modern LLM: a token index is looked up once at the input embedding layer and then permanently discarded. This single-injection assumption induces two structural failures: (i) the Rare Token Problem, where a Zipf-type distribution of vocabulary causes rare-token embeddings are chronically under-trained due to receiving a fraction of the cumulative gradient signal compared to common tokens; and (ii) the Contextual Collapse},
  url = {https://huggingface.co/papers/2605.06216},
  keywords = {token index, input embedding layer, diffusion models, Zipf-type distribution, cumulative gradient signal, transformer, MemoryBlocks, context-free semantic vectors, depth-conditioned softmax router, learnable null bank, language modeling, downstream tasks, huggingface daily},
  eprint = {2605.06216},
  archiveprefix = {arXiv},
}

Metadata

{}