Paper Detail

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Xinxin Liu, Ming Li, Zonglin Lyu, Yuzhang Shang, Chen Chen

huggingface Score 5.0

Published 2026-04-27 · First seen 2026-05-02

General AI

Abstract

Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Preference Optimization (DPO). To address this, we propose Semi-DPO, a semi-supervised approach that treats consistent pairs as clean labeled data and conflicting ones as noisy unlabeled data. Our method starts by training on a consensus-filtered clean subset, then uses this model as an implicit classifier to generate pseudo-labels for the noisy set for iterative refinement. Experimental results demonstrate that Semi-DPO achieves state-of-the-art performance and significantly improves alignment with complex human preferences, without requiring additional human annotation or explicit reward models during training. We will release our code and models at: https://github.com/L-CodingSpace/semi-dpo

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{liu2026learning,
  title = {Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization},
  author = {Xinxin Liu and Ming Li and Zonglin Lyu and Yuzhang Shang and Chen Chen},
  year = {2026},
  abstract = {Human visual preferences are inherently multi-dimensional, encompassing aesthetics, detail fidelity, and semantic alignment. However, existing datasets provide only single, holistic annotations, resulting in severe label noise: images that excel in some dimensions but are deficient in others are simply marked as winner or loser. We theoretically demonstrate that compressing multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide Diffusion Direct Pref},
  url = {https://huggingface.co/papers/2604.24952},
  keywords = {Diffusion Direct Preference Optimization, DPO, semi-supervised learning, consensus-filtered clean subset, pseudo-labels, iterative refinement, label noise, multi-dimensional preferences, visual preference learning, huggingface daily},
  eprint = {2604.24952},
  archiveprefix = {arXiv},
}

Metadata

{}