Paper Detail

Listen, Look, and Learn: Learning Without Forgetting through SAM-Audio

Avi Gupta, Nilotpal Sinha, Vishnu Raj, Sambuddha Saha, Pratik Joshi, Koteswar Rao Jerripothula, Tammam Tillo

Browse

Workflow Queues

arxiv Score 10.3

Published 2026-06-09 · First seen 2026-06-10

General AI

Open paper source

Abstract

Class-Incremental Learning (CIL) aims to continuously learn new classes without forgetting previously acquired knowledge. While recent CIL advances have spurred significant interest across various modalities, the audio-visual setting remains underexplored. Furthermore, although foundational multimodal models like SAM-Audio encapsulate rich static priors, our empirical analysis reveals that these representations struggle in incremental settings. This work bridges this gap by integrating SAM-Audio's audio-visual priors into the CIL setting. Specifically, we leverage its dense audio and visual representations and employ a novel guided attention strategy where the audio features contextually guide the visual representations. To further mitigate catastrophic forgetting, we introduce dual-level distillation objectives at both the feature and logit levels. Extensive evaluations on audio-visual CIL benchmarks demonstrate that our approach consistently outperforms state-of-the-art methods.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{gupta2026listen,
  title = {Listen, Look, and Learn: Learning Without Forgetting through SAM-Audio},
  author = {Avi Gupta and Nilotpal Sinha and Vishnu Raj and Sambuddha Saha and Pratik Joshi and Koteswar Rao Jerripothula and Tammam Tillo},
  year = {2026},
  abstract = {Class-Incremental Learning (CIL) aims to continuously learn new classes without forgetting previously acquired knowledge. While recent CIL advances have spurred significant interest across various modalities, the audio-visual setting remains underexplored. Furthermore, although foundational multimodal models like SAM-Audio encapsulate rich static priors, our empirical analysis reveals that these representations struggle in incremental settings. This work bridges this gap by integrating SAM-Audio},
  url = {https://arxiv.org/abs/2606.10887},
  keywords = {cs.CV},
  eprint = {2606.10887},
  archiveprefix = {arXiv},
}

Metadata

{}