Paper Detail

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Yuxing Liu, Jianyu Wang, Tong Zhang

arxiv Score 14.0

Published 2026-05-07 · First seen 2026-05-09

Research Track A · General AI

Abstract

Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model consistency. To better understand it, through controlled experiments and theoretical analysis, we show that: 1) optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints; 2) in response to this regularization effect, the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer. Moreover, we specifically compare Muon and AdamW when they are employed throughout the pretraining and SFT stages and find that Muon performs worse when finetuned for reasoning tasks. With a synthetic language modeling experiment, we demonstrate that this can come from Muon's strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{liu2026optimizer,
  title = {Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less},
  author = {Yuxing Liu and Jianyu Wang and Tong Zhang},
  year = {2026},
  abstract = {Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model },
  url = {https://arxiv.org/abs/2605.06654},
  keywords = {cs.LG, cs.AI, math.OC, Computer science, Forgetting, Regularization (linguistics), Artificial intelligence, Language model},
  eprint = {2605.06654},
  archiveprefix = {arXiv},
}

Metadata

{}