Paper Detail

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

Yubo Li, Yidi Miao, Yuntian Shen, Yuxin Liu

arxiv Score 18.5

Published 2026-05-24 · First seen 2026-05-31

Research Track B · General AI

Abstract

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{li2026pando,
  title = {PANDO: Efficient Multimodal AI Agents via Online Skill Distillation},
  author = {Yubo Li and Yidi Miao and Yuntian Shen and Yuxin Liu},
  year = {2026},
  abstract = {Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We th},
  url = {https://arxiv.org/abs/2605.24785},
  keywords = {cs.AI, multimodal web agents, rollout search, verifier passes, offline skill discovery, specialist model stacks, skill-distillation framework, Skill Library, progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, cache-aware prompting, VisualWebArena, token usage, action repetition rate, step overhead ratio, prompt cache utilization, huggingface daily},
  eprint = {2605.24785},
  archiveprefix = {arXiv},
}

Metadata

{}