Paper Detail

Why Fine-Tuning Encourages Hallucinations and How to Fix It

Guy Kaplan, Zorik Gekhman, Zhen Zhu, Lotem Rozner, Yuval Reif, Swabha Swayamdipta, Derek Hoiem, Roy Schwartz

Browse

Workflow Queues

arxiv Score 12.0

Published 2026-04-16 · First seen 2026-04-20

Research Track A · General AI

Open paper source

Abstract

Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a self-distillation-based SFT method that facilitates effective factual learning while minimizing hallucinations w.r.t. pre-existing knowledge by regularizing output-distribution drift. We also show that, in settings where new knowledge acquisition is unnecessary, suppressing factual plasticity by freezing parameter groups, can preserve task performance while reducing hallucinations. Lastly, we investigate the mechanism behind SFT-induced hallucinations through three hypotheses: capacity limitations, behavior cloning, and localized interference. Our experiments show that a main driver is interference among overlapping semantic representations, and that self-distillation succeeds by mitigating this interference.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{kaplan2026why,
  title = {Why Fine-Tuning Encourages Hallucinations and How to Fix It},
  author = {Guy Kaplan and Zorik Gekhman and Zhen Zhu and Lotem Rozner and Yuval Reif and Swabha Swayamdipta and Derek Hoiem and Roy Schwartz},
  year = {2026},
  abstract = {Large language models are prone to hallucinating factually incorrect statements. A key source of these errors is exposure to new factual information through supervised fine-tuning (SFT), which can increase hallucinations w.r.t. knowledge acquired during pre-training. In this work, we explore whether SFT-induced hallucinations can be mitigated using established tools from the continual learning literature, since they arise as a by-product of knowledge degradation during training. We propose a sel},
  url = {https://arxiv.org/abs/2604.15574},
  keywords = {cs.CL, cs.AI, cs.LG, cs.NE},
  eprint = {2604.15574},
  archiveprefix = {arXiv},
}

Metadata

{}