Paper Detail

Guava: An Effective and Universal Harness for Embodied Manipulation

Haowen Liu, Xirui Li, Shaoxiong Yao, Peng Shi, Tianyi Zhou, Jia-Bin Huang, Furong Huang, Jiayuan Mao

huggingface Score 19.5

Published 2026-06-16 · First seen 2026-06-18

General AI

Abstract

Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide range of reasoning models. In this work, we present Guava, a harness framework for embodied tool use developed through systematic exploration of the design space of agent workflows, action spaces, and observation spaces. Our study identifies three key ingredients for effective embodied agents: iterative perception-reasoning-action loops, semantic action abstractions, and multimodal observations. To understand whether these design principles are universal even to small models, we develop an end-to-end training pipeline that distills embodied manipulation capabilities into a 4B open-source model using fewer than 2K trajectories collected entirely in simulation. Experimental results in both simulation and real-world environments show performance comparable to frontier proprietary models while exhibiting strong generalization to unseen objects, novel instructions, and long-horizon tasks. Results suggest that a well-designed harness can serve as a scalable, model-agnostic interface for embodied manipulation, enabling strong emergent embodied capabilities in compact open-source models with minimal training data.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{liu2026guava,
  title = {Guava: An Effective and Universal Harness for Embodied Manipulation},
  author = {Haowen Liu and Xirui Li and Shaoxiong Yao and Peng Shi and Tianyi Zhou and Jia-Bin Huang and Furong Huang and Jiayuan Mao},
  year = {2026},
  abstract = {Language models trained on large-scale vision-language data have demonstrated strong potential for embodied agents. Harnessing models through embodied tools use offers a promising alternative to end-to-end vision-language-action systems by combining high-level reasoning with external modules for perception, planning, and control. However, it remains unclear what makes an effective harness for embodied manipulation, and to what extent such a harness can unlock embodied capabilities in a wide rang},
  url = {https://huggingface.co/papers/2606.18363},
  keywords = {embodied agents, vision-language models, embodied manipulation, agent workflows, action spaces, observation spaces, iterative perception-reasoning-action loops, semantic action abstractions, multimodal observations, end-to-end training, model distillation, simulation training, huggingface daily},
  eprint = {2606.18363},
  archiveprefix = {arXiv},
}

Metadata

{}