Paper Detail
Cao-Tri Nguyen, Nguyen-Khoa Luong, Vinh-Tiep Nguyen, Minh-Triet Tran
Photographs frequently contain \emph{visual distractors} besides foregrounds and backgrounds of the intended subject, competing for attention and weakening composition. While modern editing tools streamline object removal, identifying which objects to remove remains a mostly manual process. Existing saliency models and open-vocabulary detectors operate without subject awareness, failing to adapt to shifting user intent. Furthermore, context-agnostic removal may disrupt the scene's semantic coherence (e.g., keep the person but remove the chair they are sitting on). To address these limitations, we formalize the task of subject-aware distractor localization, which identifies distractors while retaining compositionally essential objects. This paper introduces \textsc{SADL}, the first real-world benchmark for this task, comprising 1,800 subject-aware cases across 1,000 photographs to enable systematic evaluation and facilitate future research. In total, there are 14,617 annotated candidates, including a robust set of 1,938 hard negatives to stress-test exclusion calibration. We evaluate seven proprietary and open-weight Vision-Language Models (VLMs) on a sequential pipeline of distractor classification followed by exclusion filtering, structured around five inclusion factors and three contextual exclusion rules. Our analysis reveals that VLMs are highly capable of identifying distractors, but then over-apply exclusion, which systematically suppresses true distractors at scale. By exposing this critical bottleneck, \textsc{SADL} provides a foundational diagnostic tool to advance subject-conditioned reasoning in multimodal systems.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@article{nguyen2026sadl,
title = {SADL: What to Ignore? A Benchmark for Subject-Aware Distractor Localization},
author = {Cao-Tri Nguyen and Nguyen-Khoa Luong and Vinh-Tiep Nguyen and Minh-Triet Tran},
year = {2026},
abstract = {Photographs frequently contain \textbackslash{}emph\{visual distractors\} besides foregrounds and backgrounds of the intended subject, competing for attention and weakening composition. While modern editing tools streamline object removal, identifying which objects to remove remains a mostly manual process. Existing saliency models and open-vocabulary detectors operate without subject awareness, failing to adapt to shifting user intent. Furthermore, context-agnostic removal may disrupt the scene's semantic coher},
url = {https://arxiv.org/abs/2606.30393},
keywords = {cs.CV},
eprint = {2606.30393},
archiveprefix = {arXiv},
}
{}