Paper Detail

DREAM: Dense Retrieval Embeddings via Autoregressive Modeling

Yixuan Tang, Yi Yang

huggingface Score 14.4

Published 2026-06-23 · First seen 2026-06-24

General AI

Abstract

Dense retrieval embedding models are a fundamental component of modern retrieval-based AI systems. Most dense retrievers are trained with contrastive objectives, which require labeled positive and negative document pairs that are often costly and difficult to obtain. In this work, we investigate whether the autoregressive next-token prediction objective of a large language model (LLM) can provide supervision for dense retrieval. The intuition is simple: if a document contains information relevant to a query, conditioning on that document should make the target output easier for the LLM to predict. A key challenge is that the next-token prediction loss is computed inside the LLM, while the retriever is a separate embedding model. To address this challenge, we propose DREAM (Dense Retrieval Embeddings via Autoregressive Modeling), which injects retriever-generated query-document similarity scores into selected attention heads of a frozen LLM. During training, these scores determine how much attention each candidate document receives while the LLM predicts the target output. The resulting prediction loss provides gradients for retriever training through the attention mechanism. We evaluate DREAM on retrieval benchmarks BEIR and RTEB using embedding backbones ranging from 0.5B to 3B parameters. DREAM consistently outperforms existing baselines across different model scales. These results demonstrate that DREAM provides a promising approach for training dense retrievers through autoregressive modeling.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{tang2026dream,
  title = {DREAM: Dense Retrieval Embeddings via Autoregressive Modeling},
  author = {Yixuan Tang and Yi Yang},
  year = {2026},
  abstract = {Dense retrieval embedding models are a fundamental component of modern retrieval-based AI systems. Most dense retrievers are trained with contrastive objectives, which require labeled positive and negative document pairs that are often costly and difficult to obtain. In this work, we investigate whether the autoregressive next-token prediction objective of a large language model (LLM) can provide supervision for dense retrieval. The intuition is simple: if a document contains information relevan},
  url = {https://huggingface.co/papers/2606.24667},
  keywords = {dense retrieval embedding models, contrastive objectives, large language models, next-token prediction, autoregressive modeling, attention heads, frozen LLM, retrieval benchmarks, BEIR, RTEB, embedding backbones, query-document similarity, attention mechanism, code available, huggingface daily},
  eprint = {2606.24667},
  archiveprefix = {arXiv},
}

Metadata

{}