Paper Detail

Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering

Gal Bloch, Ariel Gera, Matan Orbach, Ohad Eytan, Assaf Toledo

huggingface Score 4.5

Published 2026-06-09 · First seen 2026-06-13

General AI

Abstract

We present Flash-GMM, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the full responsibility matrix in GPU memory, Flash-GMM achieves a 20times speedup over existing implementations and enables training on datasets more than 100times larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into the IVF coarse quantizer for approximate nearest-neighbor (ANN) search. We show that soft GMM clustering is now a viable drop-in replacement for k-means, and that GMM responsibilities can be leveraged to assign border vectors to multiple clusters. Our approach reaches fixed recall targets with up to 1.7times fewer distance computations, or equivalently, yields +2--12 recall@10 at matched computational cost. We release the kernel as an open-source project.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{bloch2026flash,
  title = {Flash-GMM: A Memory-Efficient Kernel for Scalable Soft Clustering},
  author = {Gal Bloch and Ariel Gera and Matan Orbach and Ohad Eytan and Assaf Toledo},
  year = {2026},
  abstract = {We present Flash-GMM, a fused Triton kernel for efficient computation of Gaussian Mixture Models (GMMs) over large-scale data in a single GPU pass. By eliminating the need to materialize the full responsibility matrix in GPU memory, Flash-GMM achieves a 20times speedup over existing implementations and enables training on datasets more than 100times larger than previously feasible on one device. To demonstrate its impact, we integrate Flash-GMM into the IVF coarse quantizer for approximate neare},
  url = {https://huggingface.co/papers/2606.10896},
  keywords = {Gaussian Mixture Models, Triton kernel, responsibility matrix, approximate nearest-neighbor search, k-means, soft clustering, IVF coarse quantizer, distance computations, recall@10, code available, huggingface daily},
  eprint = {2606.10896},
  archiveprefix = {arXiv},
}

Metadata

{}