Paper Detail
Bin Lin, Bo Zhao, Boyong Wu, Chao Yan, Chen Wu, Cheng Yi, Chengyuan Yao, Daijiao Liu, Fei Tian, Feng Tian, Haiyang Sun, Haoyang Zhang, Jiangjie Zhen, Jinglan Gong, Jun Chen, Li Xie, Peilin Li, Peng Yang, Pengfei Tan, Qingjian Lin, Runze Li, Shenghua Hu, Siyi Zhou, Wenwen Qu, Xiangyu Li, Xiangyu Tony Zhang, Xuerui Yang, Yang Yang, Yechang Huang, Yu Fu, Yuchu Luo, Yuxin Li, Yuxin Zhang, Zhengyan Sheng, Brian Li, Chang Zeng, Changlin Zhang, Chen Geng, Chenghao Dong, Chengli Feng, Dan Zhou, Danni Wan, Di Chen, Die Zhang, Dongqing Pang, Guanglong Yang, Guoqiang Hu, Huangxi Zhu, Jianzheng Gao, Jinghua Liang, Jinmei Wan, Junjie Yuan, Kang An, Lei Lei, Limin Zhong, Lun Cai, Mengqiang Ren, Min Xu, Mingliang Li, Mingxiao Li, Na Wang, Qiang Tong, Qiaoling Huang, Qingfu Du, Rui Wang, Shengchen Zhou, Shi Qiu, Shihao Peng, Shiliang Yang, Siqi Tu, Tianjiao Deng, Ting Xu, Tong Wang, WeiMing Niu, Wuxun Xie, Xianwei Zhang, Xianyu Feng, Xiaojia Liu, Xing Chen, Xiongbin Wu, Yan Wu, Yang Li, Yi Liu, Yifan Zhang, Yile Liu, Yongshen Long, Yu Luo, Yuanhao Ding, Yuhao Wang, Yuhe Yin, Yunfang Xu, Yuxiang Yang, Zhiguo Huang, Zhiyue Wu, Zichao Li, Zichao Zhou, Daxin Jiang, Future Li, Gang Yu, Xiangyu Zhang, Yibo Zhu
Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across all three capabilities. Rather than treating these tasks as architecturally distinct, we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes: data construction, optimization targets, and decoding constraints. Guided by this insight, we advance the post-training paradigm from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), using it as the primary mechanism to define complex optimization targets. We leverage this RLHF-centric alignment, alongside specialized decoding, to shape a shared backbone into three distinct operational modes. Concretely, the ASR branch advances transcription efficiency via verifiable multi-token decoding; the TTS branch achieves controllable, expressive synthesis through preference-based RLHF and context-rich supervision; and the Realtime branch realizes low-latency, persona-consistent dialogue via generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across ASR, TTS, and Realtime, demonstrating that a singular audio-language foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@misc{lin2026stepaudio,
title = {StepAudio 2.5 Technical Report},
author = {Bin Lin and Bo Zhao and Boyong Wu and Chao Yan and Chen Wu and Cheng Yi and Chengyuan Yao and Daijiao Liu and Fei Tian and Feng Tian and Haiyang Sun and Haoyang Zhang and Jiangjie Zhen and Jinglan Gong and Jun Chen and Li Xie and Peilin Li and Peng Yang and Pengfei Tan and Qingjian Lin and Runze Li and Shenghua Hu and Siyi Zhou and Wenwen Qu and Xiangyu Li and Xiangyu Tony Zhang and Xuerui Yang and Yang Yang and Yechang Huang and Yu Fu and Yuchu Luo and Yuxin Li and Yuxin Zhang and Zhengyan Sheng and Brian Li and Chang Zeng and Changlin Zhang and Chen Geng and Chenghao Dong and Chengli Feng and Dan Zhou and Danni Wan and Di Chen and Die Zhang and Dongqing Pang and Guanglong Yang and Guoqiang Hu and Huangxi Zhu and Jianzheng Gao and Jinghua Liang and Jinmei Wan and Junjie Yuan and Kang An and Lei Lei and Limin Zhong and Lun Cai and Mengqiang Ren and Min Xu and Mingliang Li and Mingxiao Li and Na Wang and Qiang Tong and Qiaoling Huang and Qingfu Du and Rui Wang and Shengchen Zhou and Shi Qiu and Shihao Peng and Shiliang Yang and Siqi Tu and Tianjiao Deng and Ting Xu and Tong Wang and WeiMing Niu and Wuxun Xie and Xianwei Zhang and Xianyu Feng and Xiaojia Liu and Xing Chen and Xiongbin Wu and Yan Wu and Yang Li and Yi Liu and Yifan Zhang and Yile Liu and Yongshen Long and Yu Luo and Yuanhao Ding and Yuhao Wang and Yuhe Yin and Yunfang Xu and Yuxiang Yang and Zhiguo Huang and Zhiyue Wu and Zichao Li and Zichao Zhou and Daxin Jiang and Future Li and Gang Yu and Xiangyu Zhang and Yibo Zhu},
year = {2026},
abstract = {Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation},
url = {https://huggingface.co/papers/2605.23463},
keywords = {unified audio-language modeling, automatic speech recognition, text-to-speech synthesis, real-time spoken interaction, post-training paradigm, Reinforcement Learning from Human Feedback, RLHF, multimodal representational space, task-tailored optimization, verifiable multi-token decoding, preference-based RLHF, generative reward modeling, huggingface daily},
eprint = {2605.23463},
archiveprefix = {arXiv},
}
{}