Paper Detail

Variable-Width Transformers

Zhaofeng Wu, Oliver Sieberling, Shawn Tan, Rameswar Panda, Yury Polyanskiy, Yoon Kim

arxiv Score 4.3

Published 2026-06-16 · First seen 2026-06-17

General AI

Abstract

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped > <former architecture. This design maintains wider early and late layers while narrowing the middle layers, utilizing a parameter-free residual resizing mechanism. Across decoder-only language models ranging from 200M to 2B parameters (dense) and 3B parameters (MoE), our > <former consistently outperforms parameter-matched uniform baselines on language modeling loss. By reducing the average layer width, this architecture also requires fewer overall FLOPs (22% reduction under fitted loss-matched scaling curves) and smaller KV cache memory and I/O cost (15% reduction). In analysis, we show that this bottleneck structure results in qualitatively different representations in residual streams. Overall, our results demonstrate that nonuniform width allocation can result in more resource-optimal scaling of language models.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{wu2026variable,
  title = {Variable-Width Transformers},
  author = {Zhaofeng Wu and Oliver Sieberling and Shawn Tan and Rameswar Panda and Yury Polyanskiy and Yoon Kim},
  year = {2026},
  abstract = {Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a \$\textbackslash{}times\$-shaped > <former architecture. This design maintains wid},
  url = {https://arxiv.org/abs/2606.18246},
  keywords = {cs.CL, transformer-based language models, model size scaling, depth, width, parameter-efficient fine-tuning, \$\textbackslash{}times\$-shaped architecture, residual resizing mechanism, decoder-only language models, MoE, FLOPs, KV cache memory, code available, huggingface daily},
  eprint = {2606.18246},
  archiveprefix = {arXiv},
}

Metadata

{}