Paper Detail

OneHOI: Unifying Human-Object Interaction Generation and Editing

Jiun Tian Hoe, Weipeng Hu, Xudong Jiang, Yap-Peng Tan, Chee Seng Chan

Browse

Workflow Queues

arxiv Score 3.3

Published 2026-04-15 · First seen 2026-04-17

General AI

Open paper source

Abstract

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: later
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{hoe2026onehoi,
  title = {OneHOI: Unifying Human-Object Interaction Generation and Editing},
  author = {Jiun Tian Hoe and Weipeng Hu and Xudong Jiang and Yap-Peng Tan and Chee Seng Chan},
  year = {2026},
  abstract = {Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce},
  url = {https://arxiv.org/abs/2604.14062},
  keywords = {cs.CV, cs.MM, diffusion transformer, relational diffusion transformer, R-DiT, HOI tokens, layout-based spatial Action Grounding, Structured HOI Attention, HOI RoPE, modality dropout, HOI-Edit-44K, code available, huggingface daily},
  eprint = {2604.14062},
  archiveprefix = {arXiv},
}

Metadata

{}