Paper Detail

AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images

Bo Zhang, Tzu-Yen Ma, Zichen Tang, Junpeng Ding, Zirui Wang, Yizhuo Zhao, Peilin Gao, Zijie Xi, Zixin Ding, Haiyang Sun, Haocheng Gao, Yuan Liu, Liangjia Wang, Yiling Huang, Yujie Wang, Yuyue Zhang, Ronghui Xi, Yuanze Li, Jiacheng Liu, Zhongjun Yang, Haihong E

Browse

Workflow Queues

arxiv Score 17.3

Published 2026-04-30 · First seen 2026-05-01

General AI

Open paper source

Abstract

We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80% overall performance and expert models achieve only limited localization accuracy (IoU 30.09%); (2) Diverse Forgery Simulations: modeling four prevalent academic forgery strategies across 25 generative models, with 11 yielding average forensic accuracy below 50%, showing that forensics lag behind generative advances; and (3) Multi-Dimensional Forensic Evaluation: jointly assessing detection, reasoning, and localization, revealing complementary strengths between model families, with multimodal large language models (MLLMs) at 84.74% accuracy in textual artifact recognition and expert detectors peaking at 79.54% accuracy in binary authenticity detection. By evaluating 25 leading MLLMs, nine expert models, and one unified multimodal understanding and generation model, AEGIS serves as a diagnostic testbed exposing fundamental limitations in academic image forensics.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{zhang2026aegis,
  title = {AEGIS: A Holistic Benchmark for Evaluating Forensic Analysis of AI-Generated Academic Images},
  author = {Bo Zhang and Tzu-Yen Ma and Zichen Tang and Junpeng Ding and Zirui Wang and Yizhuo Zhao and Peilin Gao and Zijie Xi and Zixin Ding and Haiyang Sun and Haocheng Gao and Yuan Liu and Liangjia Wang and Yiling Huang and Yujie Wang and Yuyue Zhang and Ronghui Xi and Yuanze Li and Jiacheng Liu and Zhongjun Yang and Haihong E},
  year = {2026},
  abstract = {We introduce AEGIS, A holistic benchmark for Evaluating forensic analysis of AI-Generated academic ImageS. Compared to existing benchmarks, AEGIS features three key advances: (1) Domain-Specific Complexity: covering seven academic categories with 39 fine-grained subtypes, exposing intrinsic forensic difficulty, where even GPT-5.1 reaches 48.80\% overall performance and expert models achieve only limited localization accuracy (IoU 30.09\%); (2) Diverse Forgery Simulations: modeling four prevalent a},
  url = {https://arxiv.org/abs/2604.28177},
  keywords = {cs.CV, cs.CY},
  eprint = {2604.28177},
  archiveprefix = {arXiv},
}

Metadata

{}