Paper Detail

OneHOI: Unifying Human-Object Interaction Generation and Editing

Jiun Tian Hoe, Weipeng Hu, Xudong Jiang, Yap-Peng Tan, Chee Seng Chan

arxiv Score 4.3

Published 2026-04-15 · First seen 2026-04-17

General AI

Abstract

Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce OneHOI, a unified diffusion transformer framework that consolidates HOI generation and editing into a single conditional denoising process driven by shared structured interaction representations. At its core, the Relational Diffusion Transformer (R-DiT) models verb-mediated relations through role- and instance-aware HOI tokens, layout-based spatial Action Grounding, a Structured HOI Attention to enforce interaction topology, and HOI RoPE to disentangle multi-HOI scenes. Trained jointly with modality dropout on our HOI-Edit-44K, along with HOI and object-centric datasets, OneHOI supports layout-guided, layout-free, arbitrary-mask, and mixed-condition control, achieving state-of-the-art results across both HOI generation and editing. Code is available at https://jiuntian.github.io/OneHOI/.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{hoe2026onehoi,
  title = {OneHOI: Unifying Human-Object Interaction Generation and Editing},
  author = {Jiun Tian Hoe and Weipeng Hu and Xudong Jiang and Yap-Peng Tan and Chee Seng Chan},
  year = {2026},
  abstract = {Human-Object Interaction (HOI) modelling captures how humans act upon and relate to objects, typically expressed as <person, action, object> triplets. Existing approaches split into two disjoint families: HOI generation synthesises scenes from structured triplets and layout, but fails to integrate mixed conditions like HOI and object-only entities; and HOI editing modifies interactions via text, yet struggles to decouple pose from physical contact and scale to multiple interactions. We introduce},
  url = {https://arxiv.org/abs/2604.14062},
  keywords = {cs.CV, cs.MM, diffusion transformer, relational diffusion transformer, R-DiT, HOI tokens, layout-based spatial Action Grounding, Structured HOI Attention, HOI RoPE, modality dropout, HOI-Edit-44K, code available, huggingface daily},
  eprint = {2604.14062},
  archiveprefix = {arXiv},
}

Metadata

{}