Paper Detail

Temporally Extended Mixture-of-Experts Models

Zeyu Shen, Peter Henderson

arxiv Score 15.0

Published 2026-04-22 · First seen 2026-04-23

Research Track A · General AI

Abstract

Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a controller to each layer that learns when to switch expert sets and which to load. By applying this to gpt-oss-20b with low-rank adapters and a self-distillation reward, our method reduces switch rates from over 50% to below 5% while retaining up to 90% of base-model accuracy on MATH, MMLU, and MMMLU. This shows that even existing pre-trained models can be converted to temporally extended MoEs with lightweight training, with the deliberation cost allowing model trainers to trade off switching rates against capability. We hope this opens a principled path, grounded in the options framework, for memory-efficient serving and continual learning in ever-growing MoE models.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{shen2026temporally,
  title = {Temporally Extended Mixture-of-Experts Models},
  author = {Zeyu Shen and Peter Henderson},
  year = {2026},
  abstract = {Mixture-of-Experts models, now popular for scaling capacity at fixed inference speed, switch experts at nearly every token. Once a model outgrows available GPU memory, this churn can render optimizations like offloading and pre-fetching ineffective. We make the case that the options framework in reinforcement learning is a perfect match to tackle this problem, and argue for temporally extended mixture-of-experts layers. Building on the option-critic framework with deliberation costs, we add a co},
  url = {https://arxiv.org/abs/2604.20156},
  keywords = {cs.LG, mixture-of-experts, reinforcement learning, options framework, option-critic framework, deliberation costs, self-distillation, low-rank adapters, GPT-oss-20b, code available, huggingface daily},
  eprint = {2604.20156},
  archiveprefix = {arXiv},
}

Metadata

{}