Paper Detail
Zhen Ye, Xu Tan, Aoxiong Yin, Hongzhan Lin, Guangyan Zhang, Peiwen Sun, Yiming Li, Chi-Min Chan, Wei Ye, Shikun Zhang, Wei Xue
Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.
No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.
No ranking explanation is available yet.
No tags.
@misc{ye2026talker,
title = {Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling},
author = {Zhen Ye and Xu Tan and Aoxiong Yin and Hongzhan Lin and Guangyan Zhang and Peiwen Sun and Yiming Li and Chi-Min Chan and Wei Ye and Shikun Zhang and Wei Xue},
year = {2026},
abstract = {Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rende},
url = {https://huggingface.co/papers/2604.23586},
keywords = {autoregressive diffusion framework, cross-modal modeling, modality-specific decoders, autoregressive language model, patch-level token space, diffusion transformer heads, lip-sync accuracy, cross-modal consistency, huggingface daily},
eprint = {2604.23586},
archiveprefix = {arXiv},
}
{}