Paper Detail

RepSelect: Robust LLM Unlearning via Representation Selectivity

Filip Sondej, Yushi Yang, Adam Mahdi

huggingface Score 7.5

Published 2026-06-15 · First seen 2026-06-17

General AI

Abstract

Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilities and easy to reverse. We propose RepSelect (Representation Selectivity), isolates forget-set-specific representations by collapsing top principal components of weight gradients before each update, leaving general capabilities intact while limiting what fine-tuning can recover. We evaluate across two forget categories, biohazardous knowledge and abusive tendencies, and four model families spanning dense and Mixture-of-Experts architectures (Llama 3, Qwen 3.5, Gemma 4 E4B, DeepSeek V2 Lite). Compared to five popular baselines (GradDiff, NPO, SimNPO, RMU, UNDIAL), RepSelect achieves a 4-50x larger reduction in post-relearning answer accuracy than the strongest baseline, and is near-perfectly robust to few-shot prompting attacks. Targeting selective representations is thus an important step towards deep and robust LLM forgetting.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{sondej2026repselect,
  title = {RepSelect: Robust LLM Unlearning via Representation Selectivity},
  author = {Filip Sondej and Yushi Yang and Adam Mahdi},
  year = {2026},
  abstract = {Making large language models (LLMs) deeply forget specific knowledge and values without sacrificing general capabilities remains a central challenge in unlearning. However, current methods are easily reversed by fine-tuning or few-shot prompting, suggesting their forgetting is only shallow. We identify the root cause. Existing methods target representations shared with both the retain set and the subspace recovered by a fine-tuning attacker, making unlearning both disruptive to general capabilit},
  url = {https://huggingface.co/papers/2606.17168},
  keywords = {large language models, unlearning, deep forgetting, representation selectivity, principal components, weight gradients, fine-tuning, few-shot prompting, biohazardous knowledge, abusive tendencies, Mixture-of-Experts, code available, huggingface daily},
  eprint = {2606.17168},
  archiveprefix = {arXiv},
}

Metadata

{}