Paper Detail

Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data

Devleena Das, Rajeev Patwari, Elliott Delaye, Ashish Sirasao

arxiv Score 14.3

Published 2026-06-02 · First seen 2026-06-05

General AI

Abstract

Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deployment, where memory capacity and bandwidth are primary constraints. In this work, we extend Recover-LoRA -- a lightweight, data-free accuracy recovery method originally developed for general model weight corruption -- to the setting of ultra-low-bit quantization. We propose a selective mixed-precision strategy in which only gate and up projection layers of the MLP are quantized to 2-bit (W2), while all other linear layers remain at higher precision, yielding a mixed-precision GateUp configuration. We demonstrate via roofline analysis across three model families (4B--20B) and two hardware platforms that a W4/W2-GateUp deployment (4-bit base with 2-bit gate/up) delivers 7.5--23.3\% TPS improvement over uniform W4 depending on model and context length, while confining quantization error to a predictable subset of layers. We then apply Recover-LoRA -- training low-rank adapters on the quantized layers via logit distillation with synthetic data -- to recover accuracy lost from 2-bit quantization of the gate and up layers. In a case study on Qwen3-4B, Recover-LoRA achieves 80--95\% accuracy recovery on 9 of 12 benchmarks, using only 10k synthetic training samples and no labeled data. We further demonstrate that synthetic data performs comparably to curated labeled data for distillation-based recovery, and that recovery generalizes to out-of-distribution evaluation tasks. Our results present Recover-LoRA as a practical post-quantization accuracy recovery tool for aggressive weight compression in deployment settings.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{das2026recover,
  title = {Recover-LoRA for Aggressive Quantization: Reclaiming Accuracy in 2-Bit Language Models via Low-Rank Adaptation with Knowledge Distillation on Synthetic Data},
  author = {Devleena Das and Rajeev Patwari and Elliott Delaye and Ashish Sirasao},
  year = {2026},
  abstract = {Aggressive weight quantization to 2-bit precision offers substantial throughput and memory gains for large language model (LLM) inference, but typically incurs severe accuracy degradation. These gains are particularly relevant for edge and on-device deployment, where memory capacity and bandwidth are primary constraints. In this work, we extend Recover-LoRA -- a lightweight, data-free accuracy recovery method originally developed for general model weight corruption -- to the setting of ultra-low},
  url = {https://arxiv.org/abs/2606.04238},
  keywords = {cs.LG, cs.AI},
  eprint = {2606.04238},
  archiveprefix = {arXiv},
}

Metadata

{}