Paper Detail

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Weixian Xu, Shilong Liu, Mengdi Wang

arxiv Score 9.3

Published 2026-06-09 · First seen 2026-06-10

General AI

Abstract

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that partitions incoming inputs into task clusters and assigns them to suitable prompt configurations. This design is optimized via a router-prompt co-evolution strategy, which employs interleaved router and prompt learning phases to address their mutual dependency. Experiments across multiple datasets demonstrate that the framework improves robustness under heterogeneous data streams while maintaining single-benchmark learning capability and efficiency. Specifically, EEVEE improves average multi-benchmark scores by 10.38 and 24.32 points over Qwen3-4B-Instruct and DeepSeek-V3.2, surpassing SOTA methods GEPA and ACE by up to 37.2% and 48.2%.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{xu2026eevee,
  title = {EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents},
  author = {Weixian Xu and Shilong Liu and Mengdi Wang},
  year = {2026},
  abstract = {In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle heterogeneous input streams drawn from multiple datasets, domains, and task distributions, limiting their practical applicability. To mitigate cross-dataset interference, EEVEE introduces a router that p},
  url = {https://arxiv.org/abs/2606.11182},
  keywords = {cs.LG, cs.AI, test-time prompt learning, LLM agents, multi-dataset, cross-dataset interference, router, prompt configurations, router-prompt co-evolution, task clusters, heterogeneous data streams, code available, huggingface daily},
  eprint = {2606.11182},
  archiveprefix = {arXiv},
}

Metadata

{}