Paper Detail

Variable-Width Transformers

Zhaofeng Wu, Oliver Sieberling, Shawn Tan, Rameswar Panda, Yury Polyanskiy, Yoon Kim

Browse

Workflow Queues

arxiv Score 4.3

Published 2026-06-16 · First seen 2026-06-17

General AI

Open paper source

Abstract

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: later
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{wu2026variable,
  title = {Variable-Width Transformers},
  author = {Zhaofeng Wu and Oliver Sieberling and Shawn Tan and Rameswar Panda and Yury Polyanskiy and Yoon Kim},
  year = {2026},
  abstract = {Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a \$\textbackslash{}times\$-shaped > <former architecture. This design maintains wid},
  url = {https://arxiv.org/abs/2606.18246},
  keywords = {cs.CL, transformer-based language models, model size scaling, depth, width, parameter-efficient fine-tuning, \$\textbackslash{}times\$-shaped architecture, residual resizing mechanism, decoder-only language models, MoE, FLOPs, KV cache memory, code available, huggingface daily},
  eprint = {2606.18246},
  archiveprefix = {arXiv},
}

Metadata

{}