Paper Detail
Song Tang, Shuming Hu, Xincheng Shuai, Henghui Ding, Yu-Gang Jiang
Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360$^\circ$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction ($Δθ, Δφ$) to explore the 360$^\circ$ environment, seeking the object specified by a user instruction for segmentation. To tackle this challenging task, we propose PanoSeeker, a memory-augmented agent for efficient APRS. Rather than relying on heuristic scanning, PanoSeeker integrates a Vision-Language Model (VLM) with EgoSphere, an explicit spatial visual memory. By progressively integrating sequential local observations into a unified 360$^\circ$ representation, EgoSphere enables the agent to plan efficient and non-redundant search trajectories. Once the target is found, the agent performs active viewpoint alignment and outputs the segmentation mask. Furthermore, we curate an expert-annotated search trajectory dataset with memory timelines for Supervised Fine-Tuning, followed by Reinforcement Learning post-training to explicitly optimize PanoSeeker's exploration efficiency. Extensive experiments on our newly established APRS benchmark demonstrate that PanoSeeker achieves superior search efficiency and segmentation accuracy, significantly outperforming adapted state-of-the-art baselines.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@article{tang2026seek,
title = {Seek to Segment: Active Perception for Panoramic Referring Segmentation},
author = {Song Tang and Shuming Hu and Xincheng Shuai and Henghui Ding and Yu-Gang Jiang},
year = {2026},
abstract = {Existing referring segmentation models passively process static images captured from fixed perspectives, limiting their applicability in Embodied AI, where agents must perform active perception in the continuous 360\$\textasciicircum{}\textbackslash{}circ\$ environments. To bridge this gap, we introduce a novel task: Active Panoramic Referring Segmentation (APRS). In this setting, an agent is required to adjust its viewing direction (\$Δθ, Δφ\$) to explore the 360\$\textasciicircum{}\textbackslash{}circ\$ environment, seeking the object specified by a user instruc},
url = {https://arxiv.org/abs/2607.02497},
keywords = {cs.CV},
eprint = {2607.02497},
archiveprefix = {arXiv},
}
{}