Paper Detail

Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents

Hailey Onweller, Elias Lumer, Austin Huber, Pia Ramchandani, Vamse Kumar Subbiah, Corey Feld

Browse

Workflow Queues

arxiv Score 16.8

Published 2026-05-07 · First seen 2026-05-09

General AI

Open paper source

Abstract

Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and evaluate inline citations from LLM-generated Markdown reports at scale. Unlike methods that verify claims in isolation, our framework closes the loop by retrieving the actual cited content, enabling human or model evaluators to judge each citation against its source. Citations are evaluated along three dimensions. (1) Link Works verifies URL accessibility, (2) Relevant Content measures topical alignment, and (3) Fact Check validates factual accuracy against source content. We benchmark 14 closed-source and open-source LLMs across three evaluation dimensions using rubric-based LLM-as-a-judge evaluators calibrated through human review. Our results reveal that even the strongest frontier models maintain link validity above 94% and relevance above 80%, yet achieve only 39-77% factual accuracy, while fewer than half of open-source models successfully generate cited reports in a one-shot setting. Ablation studies on research depth show that Fact Check accuracy drops by approximately 42% on average across two frontier models as tool calls scale from 2 to 150, demonstrating that more retrieval does not produce more accurate citations. These findings reveal a critical disconnect between surface-level citation quality and factual reliability, and our framework provides the evaluation infrastructure to assess the disconnect.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{onweller2026cited,
  title = {Cited but Not Verified: Parsing and Evaluating Source Attribution in LLM Deep Research Agents},
  author = {Hailey Onweller and Elias Lumer and Austin Huber and Pia Ramchandani and Vamse Kumar Subbiah and Corey Feld},
  year = {2026},
  abstract = {Large language models (LLMs) power deep research agents that synthesize information from hundreds of web sources into cited reports, yet these citations cannot be reliably verified. Current approaches either trust models to self-cite accurately, risking bias, or employ retrieval-augmented generation (RAG) that does not validate source accessibility, relevance, or factual consistency. We introduce the first source attribution evaluation framework that uses a reproducible AST parser to extract and},
  url = {https://arxiv.org/abs/2605.06635},
  keywords = {cs.CL},
  eprint = {2605.06635},
  archiveprefix = {arXiv},
}

Metadata

{}