Paper Detail

OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models

Yida Xue, Ningyu Zhang, Tingwei Wu, Zhe Ma, Daxiong Ji, Zhao Wang, Guozhou Zheng, Huajun Chen

huggingface Score 15.0

Published 2026-04-25 · First seen 2026-05-05

General AI

Abstract

The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achieved remarkable success in general domains, their application to ocean science remains severely constrained by the absence of large-scale, well-aligned multimodal datasets tailored to marine environments. To bridge this gap, we introduce OceanPile, a large-scale multimodal corpus designed for ocean foundation models. It comprises three key components: OceanCorpus, a unified collection integrating sonar data, underwater imagery, marine science visuals, and scientific text from diverse authoritative sources; OceanInstruction, a high-quality instruction dataset synthesized via a novel pipeline guided by a hierarchical Ocean Concept Knowledge Graph; and OceanBenchmark, a manually curated evaluation benchmark for rigorous assessment. We establish a multi-stage quality control process to ensure scientific validity and alignment across modalities. Experimental validation demonstrates significant performance improvements for models trained on our data. All datasets are publicly released to advance the field of marine artificial intelligence and empower domain-specific MLLMs.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{xue2026oceanpile,
  title = {OceanPile: A Large-Scale Multimodal Ocean Corpus for Foundation Models},
  author = {Yida Xue and Ningyu Zhang and Tingwei Wu and Zhe Ma and Daxiong Ji and Zhao Wang and Guozhou Zheng and Huajun Chen},
  year = {2026},
  abstract = {The vast and underexplored ocean plays a critical role in regulating global climate and supporting marine biodiversity, yet artificial intelligence has so far delivered limited impact in this domain due to a fundamental data bottleneck. Specifically, ocean data are highly fragmented across disparate sources and inherently exhibit multi-modal, high-noise, and weakly labeled characteristics, lacking unified schemas and semantic alignment. Although Multimodal Large Language Models (MLLMs) have achi},
  url = {https://huggingface.co/papers/2605.00877},
  keywords = {Multimodal Large Language Models, OceanCorpus, OceanInstruction, OceanBenchmark, Ocean Concept Knowledge Graph, multimodal datasets, scientific text, underwater imagery, sonar data, marine science visuals, instruction dataset, knowledge graph-guided pipeline, multi-stage quality control, foundation models, code available, huggingface daily},
  eprint = {2605.00877},
  archiveprefix = {arXiv},
}

Metadata

{}