Paper Detail
Ning Gao, Jinliang Zheng, Xing Gao, Haoxiang Ma, Hanqing Wang, Yukai Wang, Jiantong Chen, Zanxin Chen, Shujie Zhang, Mingda Jia, Xuekun Jiang, Zihou Zhu, Xinyu Li, Shuai Wang, Hao Li, Wenzhe Cai, Yuqiang Yang, Xudong Xu, Zhaoyang Lyu, Yao Mu, Tai Wang, Jiangmiao Pang, Jia Zeng, Weinan Zhang, Chunhua Shen
We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including $π_0$, $π_{0.5}$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: $π_{0.5}$ achieves the highest test success rate and the best train--test retention, whereas InternVLA-A1 dominates mobile manipulation but collapses on dexterous tasks, and XVLA exhibits strengths on a disjoint set of atomic skills compared to other policies. Beyond capability profiling, EBench analyzes the generalization ability from 4 representative perspectives, identifying the impact of different distribution shift factors. The results reveal strengths and weaknesses of models behind an overall score. We hope this benchmark offers a broad set of diagnostic signals to guide iteration on generalist manipulation models.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@article{gao2026ebench,
title = {EBench: Elemental Diagnosis of Generalist Mobile Manipulation Policies},
author = {Ning Gao and Jinliang Zheng and Xing Gao and Haoxiang Ma and Hanqing Wang and Yukai Wang and Jiantong Chen and Zanxin Chen and Shujie Zhang and Mingda Jia and Xuekun Jiang and Zihou Zhu and Xinyu Li and Shuai Wang and Hao Li and Wenzhe Cai and Yuqiang Yang and Xudong Xu and Zhaoyang Lyu and Yao Mu and Tai Wang and Jiangmiao Pang and Jia Zeng and Weinan Zhang and Chunhua Shen},
year = {2026},
abstract = {We present EBench, a simulation benchmark that diagnoses generalist mobile manipulation policies beyond a single success-rate scalar. EBench comprises 26 diverse and challenging manipulation tasks annotated along 5 capability dimensions and 4 generalization dimensions. We evaluate state-of-the-art generalist manipulation models including \$π\_0\$, \$π\_\{0.5\}\$, XVLA, and InternVLA-A1, and reveal that models with near success rates exhibit strikingly different capability profiles: \$π\_\{0.5\}\$ achieves th},
url = {https://arxiv.org/abs/2606.18239},
keywords = {cs.RO, generalist manipulation policies, simulation benchmark, capability dimensions, generalization dimensions, success-rate scalar, manipulation tasks, policy evaluation, distribution shift factors, code available, huggingface daily},
eprint = {2606.18239},
archiveprefix = {arXiv},
}
{}