Paper Detail

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Ziao Zhang, Kou Shi, Shiting Huang, Avery Nie, Yu Zeng, Yiming Zhao, Zhen Fang, Qishen Su, Haibo Qiu, Wei Yang, Qingnan Ren, Shun Zou, Wenxuan Huang, Lin Chen, Zehui Chen, Feng Zhao

arxiv Score 10.3

Published 2026-04-19 · First seen 2026-04-21

Research Track A · General AI

Abstract

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{zhang2026skillflow,
  title = {SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents},
  author = {Ziao Zhang and Kou Shi and Shiting Huang and Avery Nie and Yu Zeng and Yiming Zhao and Zhen Fang and Qishen Su and Haibo Qiu and Wei Yang and Qingnan Ren and Shun Zou and Wenxuan Huang and Lin Chen and Zehui Chen and Feng Zhao},
  year = {2026},
  abstract = {As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Dom},
  url = {https://arxiv.org/abs/2604.17308},
  keywords = {cs.AI, autonomous agents, plug-and-play external skills, Domain-Agnostic Execution Flow, Agentic Lifelong Learning, skill discovery, skill patching, skill transfer, lifelong learning protocol, code available, huggingface daily},
  eprint = {2604.17308},
  archiveprefix = {arXiv},
}

Metadata

{}