Paper Detail

SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages

Sen Fang, Hongbin Zhong, Yanxin Zhang, Dimitris N. Metaxas

arxiv Score 6.2

Published 2026-05-03 · First seen 2026-05-05

General AI

Abstract

Existing large-scale sign language resources typically provide supervision only at the level of raw video-text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open-world recognition and translation, or for modern pose-driven sign language video generation frameworks: 1. RGB-based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions during recording, and are less robust in open-world settings than style-agnostic pose-processing models. 2. Recent pose-guided image/video generation models mostly use a unified keypoint representation such as DWPose as their control interface. At present, the sign language field still lacks a data resource that can directly interface with this modern pose-native paradigm while also targeting real-world open scenarios. We present SignVerse-2M, a large-scale multilingual pose-native dataset for sign language pose modeling and evaluation. Built from publicly available multilingual sign language video resources, it applies DWPose in a unified preprocessing pipeline to convert raw videos into 2D pose sequences that can be used directly for modeling, resulting in a consolidated corpus of about two million clips covering more than 25 sign languages. Unlike many laboratory datasets, this resource preserves the recording conditions and speaker diversity of real-world videos while reducing appearance variation through a unified pose representation. Toward this goal, we further provide the data construction pipeline, task definitions, and a simple SignDW Transformer baseline, demonstrating the feasibility of this resource for multilingual pose-space modeling and its compatibility with modern pose-driven pipelines, while discussing the evaluation claims it can support as well as its current limitations.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
later
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{fang2026signverse,
  title = {SignVerse-2M: A Two-Million-Clip Pose-Native Universe of 25+ Sign Languages},
  author = {Sen Fang and Hongbin Zhong and Yanxin Zhang and Dimitris N. Metaxas},
  year = {2026},
  abstract = {Existing large-scale sign language resources typically provide supervision only at the level of raw video-text alignment and are often produced in laboratory settings. While such resources are important for semantic understanding, they do not directly provide a unified interface for open-world recognition and translation, or for modern pose-driven sign language video generation frameworks: 1. RGB-based pretrained recognition models depend heavily on fixed backgrounds or clothing conditions durin},
  url = {https://arxiv.org/abs/2605.01720},
  keywords = {cs.CV, cs.AI, cs.CL},
  eprint = {2605.01720},
  archiveprefix = {arXiv},
}

Metadata

{}