Paper Detail

When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models

Zhengyang Sun, Yu Chen, Xin Zhou, Xiaofan Li, Xiwu Chen, Dingkang Liang, Xiang Bai

arxiv Score 3.8

Published 2026-04-09 · First seen 2026-04-10

General AI

Abstract

Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneration. On the introduced CountBench, NUMINA improves counting accuracy by up to 7.4% on Wan2.1-1.3B, and by 4.9% and 5.5% on 5B and 14B models, respectively. Furthermore, CLIP alignment is improved while maintaining temporal consistency. These results demonstrate that structural guidance complements seed search and prompt enhancement, offering a practical path toward count-accurate text-to-video diffusion. The code is available at https://github.com/H-EmbodVis/NUMINA.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{sun2026when,
  title = {When Numbers Speak: Aligning Textual Numerals and Visual Instances in Text-to-Video Diffusion Models},
  author = {Zhengyang Sun and Yu Chen and Xin Zhou and Xiaofan Li and Xiwu Chen and Dingkang Liang and Xiang Bai},
  year = {2026},
  abstract = {Text-to-video diffusion models have enabled open-ended video synthesis, but often struggle with generating the correct number of objects specified in a prompt. We introduce NUMINA , a training-free identify-then-guide framework for improved numerical alignment. NUMINA identifies prompt-layout inconsistencies by selecting discriminative self- and cross-attention heads to derive a countable latent layout. It then refines this layout conservatively and modulates cross-attention to guide regeneratio},
  url = {https://arxiv.org/abs/2604.08546},
  keywords = {cs.CV, text-to-video diffusion models, numerical alignment, prompt-layout inconsistencies, self-attention heads, cross-attention heads, latent layout, attention modulation, CountBench, CLIP alignment, temporal consistency, code available, huggingface daily},
  eprint = {2604.08546},
  archiveprefix = {arXiv},
}

Metadata

{}