Paper Detail

Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents

Sina Hajimiri, Masih Aminbeidokhti, Jose Dolz, Ismail Ben Ayed, Issam H. Laradji, Spandana Gella, Nicolas Gontier

Browse

Workflow Queues

arxiv Score 26.0

Published 2026-06-12 · First seen 2026-06-16

Research Track B · General AI

Open paper source

Abstract

Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Across three WebArena domains and three models, Gemini 3 Flash, GPT-5.4-mini, and Qwen 3.6-27B, the vanilla baseline matches or surpasses all three augmentation methods in aggregate success rate while often using fewer total tokens. We observe a similar trend on WorkArena-L1 with Qwen 3.6-27B, indicating that the effect extends to enterprise knowledge-work tasks. Our results suggest that skills and workflow memory can be useful in specific domains, but their apparent gains often vanish against a budget-matched actor. We further show that run-to-run variance materially affects outcomes and should be reported as a core evaluation criterion for online web agents.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{hajimiri2026are,
  title = {Are Online Skill and Memory Modules Always Worth Their Tokens? A Budget-Constrained Study of Web Agents},
  author = {Sina Hajimiri and Masih Aminbeidokhti and Jose Dolz and Ismail Ben Ayed and Issam H. Laradji and Spandana Gella and Nicolas Gontier},
  year = {2026},
  abstract = {Online web agents often augment a base actor with memory, workflow, or skill modules. These modules can improve performance, but they also consume test-time tokens, a cost rarely reported alongside the actor's inference cost. We study online augmentation, where this overhead is paid on every task, and re-evaluate its benefits under a fixed total inference budget. We compare AWM, ASI, and ReasoningBank with a token-matched vanilla baseline that uses the same budget for additional actor steps. Acr},
  url = {https://arxiv.org/abs/2606.15017},
  keywords = {cs.CL},
  eprint = {2606.15017},
  archiveprefix = {arXiv},
}

Metadata

{}