Paper Detail

Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs

Huyen Vo, María Martínez-García, Isabel Valera

Browse

Workflow Queues

arxiv Score 5.3

Published 2026-06-11 · First seen 2026-06-13

General AI

Open paper source

Abstract

Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises sample diversity. Inspired by this insight, we propose Hölder++, a novel multimodal VAE that improves the generative quality-coherence trade-off through: (i) the first implementation of Hölder pooling without any approximation for multimodal VAEs; (ii) an extended architecture that models distinct shared and private (i.e., modality-specific) representations (Hölder+); and (iii) hierarchical inference that further enhances the disentanglement between the shared and private representations (Hölder++). Our experiments corroborate that Hölder++ consistently improves the generative quality-coherence trade-off, yields more structured latent spaces, and learns shared representations that are informative for downstream tasks.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: later
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{vo2026h,
  title = {Hölder++: Improving the Quality-Coherence Trade-off in Multimodal VAEs},
  author = {Huyen Vo and María Martínez-García and Isabel Valera},
  year = {2026},
  abstract = {Existing approaches for multimodal variational autoencoders (VAEs) face a trade-off between generative quality and coherence-i.e., they struggle to generate realistic and diverse samples that, at the same time, are semantically consistent across modalities. A recent work shows that using a simple approximation to Hölder pooling as an aggregation method improves coherence over the SOTA MMVAE+, despite assuming a single shared representation across all modalities. Yet, it slightly compromises samp},
  url = {https://arxiv.org/abs/2606.13381},
  keywords = {cs.LG},
  eprint = {2606.13381},
  archiveprefix = {arXiv},
}

Metadata

{}