Paper Detail

Multimodal Music Recommendation System using LLMs

Srikar Prabhas Kandagatla, Sreehitha R. Narayana, Chandana Magapu, Swetha Mohan, Shamanth Kuthpadi, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Nesreen Ahmed

huggingface Score 12.5

Published 2026-05-28 · First seen 2026-06-05

General AI

Abstract

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{kandagatla2026multimodal,
  title = {Multimodal Music Recommendation System using LLMs},
  author = {Srikar Prabhas Kandagatla and Sreehitha R. Narayana and Chandana Magapu and Swetha Mohan and Shamanth Kuthpadi and Hongjie Chen and Ryan A. Rossi and Franck Dernoncourt and Nesreen Ahmed},
  year = {2026},
  abstract = {Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In t},
  url = {https://huggingface.co/papers/2606.00125},
  keywords = {multimodal framework, LastFM-1K dataset, audio embeddings, lyric embeddings, pretrained music models, text representation models, LLM-generated semantic metadata, MGPHot annotation schema, listening completion ratios, E4SRec framework, item ID encoder backbones, SASRec, BERT4Rec, GRU4Rec, LLaMa-2-13B, Qwen2.5-7B-Instruct, LLaMa-3-70B, zero-shot learning, fine-tuned settings, Recall, NDCG, naive multimodal fusion, cross-modal integration, huggingface daily},
  eprint = {2606.00125},
  archiveprefix = {arXiv},
}

Metadata

{}