Paper Detail

DR^{3}-Eval: Towards Realistic and Reproducible Deep Research Evaluation

Qianqian Xie, Qingheng Xiong, He Zhu, Tiantian Xia, Xueming Han, Fanyu Meng, Jiakai Wang, Zhiqi Bai, Chengkang Jiang, Zhaohui Wang, Yubin Guo, Yuqing Wen, Jiayang Mao, Zijie Zhang, Shihao Li, Yanghai Wang, Yuxiang Ren, Junlan Feng, Jiaheng Liu

Browse

Workflow Queues

huggingface Score 19.5

Published 2026-04-16 · First seen 2026-04-17

General AI

Open paper source

Abstract

Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR^{3}-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR^{3}-Eval is constructed from authentic user-provided materials and paired with a per-task static research sandbox corpus that simulates open-web complexity while remaining fully verifiable, containing supportive documents, distractors, and noise. Moreover, we introduce a multi-dimensional evaluation framework measuring Information Recall, Factual Accuracy, Citation Coverage, Instruction Following, and Depth Quality, and validate its alignment with human judgments. Experiments with our developed multi-agent system DR^{3}-Agent based on multiple state-of-the-art language models demonstrate that DR^{3}-Eval is highly challenging and reveals critical failure modes in retrieval robustness and hallucination control. Our code and data are publicly available.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{xie2026dr,
  title = {DR\textasciicircum{}\{3\}-Eval: Towards Realistic and Reproducible Deep Research Evaluation},
  author = {Qianqian Xie and Qingheng Xiong and He Zhu and Tiantian Xia and Xueming Han and Fanyu Meng and Jiakai Wang and Zhiqi Bai and Chengkang Jiang and Zhaohui Wang and Yubin Guo and Yuqing Wen and Jiayang Mao and Zijie Zhang and Shihao Li and Yanghai Wang and Yuxiang Ren and Junlan Feng and Jiaheng Liu},
  year = {2026},
  abstract = {Deep Research Agents (DRAs) aim to solve complex, long-horizon research tasks involving planning, retrieval, multimodal understanding, and report generation, yet their evaluation remains challenging due to dynamic web environments and ambiguous task definitions. We propose DR\textasciicircum{}\{3\}-Eval, a realistic and reproducible benchmark for evaluating deep research agents on multimodal, multi-file report generation. DR\textasciicircum{}\{3\}-Eval is constructed from authentic user-provided materials and paired with a per-task },
  url = {https://huggingface.co/papers/2604.14683},
  keywords = {deep research agents, multimodal understanding, report generation, research sandbox corpus, multi-dimensional evaluation framework, information recall, factual accuracy, citation coverage, instruction following, depth quality, hallucination control, multi-agent system, state-of-the-art language models, code available, huggingface daily},
  eprint = {2604.14683},
  archiveprefix = {arXiv},
}

Metadata

{}