MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU

Abstract

We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce a pipelined double-buffered execution engine that overlaps parameter prefetching, computation, and gradient offloading across multiple CUDA streams, enabling continuous GPU execution. 2) We replace persistent autograd graphs with stateless layer templates, binding weights dynamically as they stream in, eliminating persistent graph metadata while providing flexibility in scheduling. On a single H200 GPU with 1.5TB host memory, MegaTrain reliably trains models up to 120B parameters. It also achieves 1.84times the training throughput of DeepSpeed ZeRO-3 with CPU offloading when training 14B models. MegaTrain also enables 7B model training with 512k token context on a single GH200.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{yuan2026megatrain,
  title = {MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU},
  author = {Zhengqing Yuan and Hanchi Sun and Lichao Sun and Yanfang Ye},
  year = {2026},
  abstract = {We present MegaTrain, a memory-centric system that efficiently trains 100B+ parameter large language models at full precision on a single GPU. Unlike traditional GPU-centric systems, MegaTrain stores parameters and optimizer states in host memory (CPU memory) and treats GPUs as transient compute engines. For each layer, we stream parameters in and compute gradients out, minimizing persistent device state. To battle the CPU-GPU bandwidth bottleneck, we adopt two key optimizations. 1) We introduce},
  url = {https://huggingface.co/papers/2604.05091},
  keywords = {large language models, full precision, host memory, optimizer states, transient compute engines, parameter streaming, gradient offloading, pipelined double-buffered execution engine, CUDA streams, stateless layer templates, persistent autograd graphs, DeepSpeed ZeRO-3, CPU-GPU bandwidth bottleneck, code available, huggingface daily},
  eprint = {2604.05091},
  archiveprefix = {arXiv},
}

Metadata

{}