Paper Detail
Seokmin Lee, Yunghee Lee, Byeonghyun Pak, Byeongju Woo
For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, enabling reliable detection of subtle dynamics across observations. To this end, we propose CroBo, a visual state representation learning framework based on a global-to-local reconstruction objective. Given a reference observation compressed into a compact bottleneck token, CroBo learns to reconstruct heavily masked patches in a local target crop from sparse visible cues, using the global bottleneck token as context. This learning objective encourages the bottleneck token to encode a fine-grained representation of scene-wide semantic entities, including their identities, spatial locations, and configurations. As a result, the learned visual states reveal how scene elements move and interact over time, supporting sequential decision making. We evaluate CroBo on diverse vision-based robot policy learning benchmarks, where it achieves state-of-the-art performance. Reconstruction analyses and perceptual straightness experiments further show that the learned representations preserve pixel-level scene composition and encode what-moves-where across observations. Project page available at: https://seokminlee-chris.github.io/CroBo-ProjectPage.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@misc{lee2026pixel,
title = {Pixel-level Scene Understanding in One Token: Visual States Need What-is-Where Composition},
author = {Seokmin Lee and Yunghee Lee and Byeonghyun Pak and Byeongju Woo},
year = {2026},
abstract = {For robotic agents operating in dynamic environments, learning visual state representations from streaming video observations is essential for sequential decision making. Recent self-supervised learning methods have shown strong transferability across vision tasks, but they do not explicitly address what a good visual state should encode. We argue that effective visual states must capture what-is-where by jointly encoding the semantic identities of scene elements and their spatial locations, ena},
url = {https://huggingface.co/papers/2603.13904},
keywords = {visual state representation learning, self-supervised learning, global-to-local reconstruction, bottleneck token, masked patches, sparse visible cues, scene-wide semantic entities, sequential decision making, robotic agents, dynamic environments, vision-based robot policy learning, code available, huggingface daily},
eprint = {2603.13904},
archiveprefix = {arXiv},
}
{}