Paper Detail
Ke-Han Lu, Szu-Wei Fu, Chao-Han Huck Yang, Zhehuai Chen, Sung-Feng Huang, Chih-Kai Yang, Yi-Cheng Lin, Chi-Yuan Hsiao, Wenze Ren, En-Pei Hu, Yu-Han Huang, An-Yu Cheng, Cheng-Han Chiang, Yu Tsao, Yu-Chiang Frank Wang, Hung-yi Lee
Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over text descriptions from an audio captioner; and (3) audio-grounded evaluation, where each LLM is fine-tuned into a Large Audio Language Model (LALM) with an audio encoder. Our findings reveal that auditory knowledge varies substantially across families, and text-only results are strongly correlated with audio performance. Our work provides empirical grounding for a comprehensive understanding of LLMs in audio research.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@misc{lu2026how,
title = {How Auditory Knowledge in LLM Backbones Shapes Audio Language Models: A Holistic Evaluation},
author = {Ke-Han Lu and Szu-Wei Fu and Chao-Han Huck Yang and Zhehuai Chen and Sung-Feng Huang and Chih-Kai Yang and Yi-Cheng Lin and Chi-Yuan Hsiao and Wenze Ren and En-Pei Hu and Yu-Han Huang and An-Yu Cheng and Cheng-Han Chiang and Yu Tsao and Yu-Chiang Frank Wang and Hung-yi Lee},
year = {2026},
abstract = {Large language models (LLMs) have been widely used as knowledge backbones of Large Audio Language Models (LALMs), yet how much auditory knowledge they encode through text-only pre-training and how this affects downstream performance remains unclear. We study this gap by comparing different LLMs under two text-only and one audio-grounded setting: (1) direct probing on AKB-2000, a curated benchmark testing the breadth and depth of auditory knowledge; (2) cascade evaluation, where LLMs reason over },
url = {https://huggingface.co/papers/2603.19195},
keywords = {code available, huggingface daily},
eprint = {2603.19195},
archiveprefix = {arXiv},
}
{}