Paper Detail
Junjie Yu, Pengrui Lu, Weiye Si, Hongliang Lu, Jiabao Wu, Kaiwen Tao, Kun Wang, Lingyu Yang, Qiran Zhang, Xiuting Guo, Xuanyu Wang, Yang Wang, Yanjie Wang, Yi Yang, Zijian Hu, Ziyi Yang, Zonghan Zhou, Binghao Qiang, Borui Zhang, Chenning Li, Enchang Zhang, Feifan Chen, Feng Jian, Fengyin Sun, Hao Qiu, Hao Zheng, Haoran Zhu, Hongyu Liu, Jianbin Deng, Jiaxin Song, Jiaying Chi, Jiayou Shi, Jie Fang, Jinghui Zhong, Jingyu Zhou, Jinze Li, Junfeng Yi, Junyan Yu, Junzhi Xue, Ni Song, Pengyi Chen, Qi Chen, Quansheng Li, Rui Tao, Shenghai Gong, Shenhang Lu, Tianqi Shen, Tianxiang Zhu, Tiehan Kang, Tingyu Li, Wendi Wu, Xiao Shen, Xiao Zhou, Xiaotao Zhang, Xinrong Li, Xuankun Yang, Xun Zhang, Yan Li, Ye Lu, Yi Wang, Yibo Zhou, Yichi Zhang, Yihao Sun, Yijun Huang, Yixin Zhu, Yixuan Wu, Yuchen Sun, Yue Wu, Yuheng Sun, Yukun Li, Yutian Tu, Yuxuan Qin, Yuzhuo Wu, Zeyu Li, Zhengyu Lou, Zhenning Ran, Zizhu He, Pengfei Liu
Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@misc{yu2026academiclaw,
title = {AcademiClaw: When Students Set Challenges for AI Agents},
author = {Junjie Yu and Pengrui Lu and Weiye Si and Hongliang Lu and Jiabao Wu and Kaiwen Tao and Kun Wang and Lingyu Yang and Qiran Zhang and Xiuting Guo and Xuanyu Wang and Yang Wang and Yanjie Wang and Yi Yang and Zijian Hu and Ziyi Yang and Zonghan Zhou and Binghao Qiang and Borui Zhang and Chenning Li and Enchang Zhang and Feifan Chen and Feng Jian and Fengyin Sun and Hao Qiu and Hao Zheng and Haoran Zhu and Hongyu Liu and Jianbin Deng and Jiaxin Song and Jiaying Chi and Jiayou Shi and Jie Fang and Jinghui Zhong and Jingyu Zhou and Jinze Li and Junfeng Yi and Junyan Yu and Junzhi Xue and Ni Song and Pengyi Chen and Qi Chen and Quansheng Li and Rui Tao and Shenghai Gong and Shenhang Lu and Tianqi Shen and Tianxiang Zhu and Tiehan Kang and Tingyu Li and Wendi Wu and Xiao Shen and Xiao Zhou and Xiaotao Zhang and Xinrong Li and Xuankun Yang and Xun Zhang and Yan Li and Ye Lu and Yi Wang and Yibo Zhou and Yichi Zhang and Yihao Sun and Yijun Huang and Yixin Zhu and Yixuan Wu and Yuchen Sun and Yue Wu and Yuheng Sun and Yukun Li and Yutian Tu and Yuxuan Qin and Yuzhuo Wu and Zeyu Li and Zhengyu Lou and Zhenning Ran and Zizhu He and Pengfei Liu},
year = {2026},
abstract = {Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates},
url = {https://huggingface.co/papers/2605.02661},
keywords = {OpenClaw, academic workflows, task completion, multi-dimensional rubrics, safety audit, frontier models, token consumption, output quality, code available, huggingface daily},
eprint = {2605.02661},
archiveprefix = {arXiv},
}
{}