Paper Detail

BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources

Raghvendra Kumar, Devankar Raj, Sriparna Saha

arxiv Score 10.3

Published 2026-04-20 · First seen 2026-04-21

General AI

Abstract

India's linguistic landscape, spanning 22 scheduled languages and hundreds of marginalized dialects, has driven rapid growth in NLP datasets, benchmarks, and pretrained models. However, no dedicated survey consolidates resources developed specifically for Indian languages. Existing reviews either focus on a few high-resource languages or subsume Indian languages within broader multilingual settings, limiting coverage of low-resource and culturally diverse varieties. To address this gap, we present the first unified survey of Indian NLP resources, covering 200+ datasets, 50+ benchmarks, and 100+ models, tools, and systems across text, speech, multimodal, and culturally grounded tasks. We organize resources by linguistic phenomena, domains, and modalities; analyze trends in annotation, evaluation, and model design; and identify persistent challenges such as data sparsity, uneven language coverage, script diversity, and limited cultural and domain generalization. This survey offers a consolidated foundation for equitable, culturally grounded, and scalable NLP research in the Indian linguistic ecosystem.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{kumar2026bhashasutra,
  title = {BhashaSutra: A Task-Centric Unified Survey of Indian NLP Datasets, Corpora, and Resources},
  author = {Raghvendra Kumar and Devankar Raj and Sriparna Saha},
  year = {2026},
  abstract = {India's linguistic landscape, spanning 22 scheduled languages and hundreds of marginalized dialects, has driven rapid growth in NLP datasets, benchmarks, and pretrained models. However, no dedicated survey consolidates resources developed specifically for Indian languages. Existing reviews either focus on a few high-resource languages or subsume Indian languages within broader multilingual settings, limiting coverage of low-resource and culturally diverse varieties. To address this gap, we prese},
  url = {https://arxiv.org/abs/2604.18423},
  keywords = {cs.CL},
  eprint = {2604.18423},
  archiveprefix = {arXiv},
}

Metadata

{}