Paper Detail

Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction

Zhuofeng Li, Haoxiang Zhang, Cong Wei, Pan Lu, Ping Nie, Yi Lu, Yuyang Bai, Shangbin Feng, Hangxiao Zhu, Ming Zhong, Yuyu Zhang, Jianwen Xie, Yejin Choi, James Zou, Jiawei Han, Wenhu Chen, Jimmy Lin, Dongfu Jiang, Yu Zhang

Browse

Workflow Queues

huggingface Score 17.0

Published 2026-05-03 · First seen 2026-05-09

General AI

Open paper source

Abstract

Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be recovered by stronger downstream reasoning. Agentic tasks further exacerbate this limitation because they require agents to orchestrate multiple steps, including discovering intermediate entities, combining weak clues, and revising the plan after observing partial evidence. To tackle the limitation, we study direct corpus interaction (DCI), where an agent searches the raw corpus directly with general-purpose terminal tools (e.g., grep, file reads, shell commands, lightweight scripts), without any embedding model, vector index, or retrieval API. This approach requires no offline indexing and adapts naturally to evolving local corpora. Across IR benchmarks and end-to-end agentic search tasks, this simple setup substantially outperforms strong sparse, dense, and reranking baselines on several BRIGHT and BEIR datasets, and attains strong accuracy on BrowseComp-Plus and multi-hop QA without relying on any conventional semantic retriever. Our results indicate that as language agents become stronger, retrieval quality depends not only on reasoning ability but also on the resolution of the interface through which the model interacts with the corpus, with which DCI opens a broader interface-design space for agentic search.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{li2026beyond,
  title = {Beyond Semantic Similarity: Rethinking Retrieval for Agentic Search via Direct Corpus Interaction},
  author = {Zhuofeng Li and Haoxiang Zhang and Cong Wei and Pan Lu and Ping Nie and Yi Lu and Yuyang Bai and Shangbin Feng and Hangxiao Zhu and Ming Zhong and Yuyu Zhang and Jianwen Xie and Yejin Choi and James Zou and Jiawei Han and Wenhu Chen and Jimmy Lin and Dongfu Jiang and Yu Zhang},
  year = {2026},
  abstract = {Modern retrieval systems, whether lexical or semantic, expose a corpus through a fixed similarity interface that compresses access into a single top-k retrieval step before reasoning. This abstraction is efficient, but for agentic search, it becomes a bottleneck: exact lexical constraints, sparse clue conjunctions, local context checks, and multi-step hypothesis refinement are difficult to implement by calling a conventional off-the-shelf retriever, and evidence filtered out early cannot be reco},
  url = {https://huggingface.co/papers/2605.05242},
  keywords = {retrieval systems, lexical retrieval, semantic retrieval, corpus interaction, agentic search, direct corpus interaction, terminal tools, sparse retrieval, dense retrieval, reranking, IR benchmarks, BEIR datasets, BrowseComp-Plus, multi-hop QA, code available, huggingface daily},
  eprint = {2605.05242},
  archiveprefix = {arXiv},
}

Metadata

{}