Paper Detail

Hierarchical Synthetic Tabular Data Generation: A Hybrid Top-Down and Bottom-Up Framework

Junfeng Nie, Alvin Jin, Xiaohui Chen

arxiv Score 9.8

Published 2026-05-27 · First seen 2026-05-31

General AI

Abstract

Existing approaches for synthetic tabular data generation are based on either purely generative models or LLMs, both of which struggle with data heterogeneity, logical consistency, rare-event coverage, and robustness in low-data regimes. In this paper, we propose a hierarchical hybrid top-down and bottom-up (H-TDBU) framework that decouples semantic structures from stochastic texture. In the top-down path, structure-driven logical constraints and cross-modal alignment rules are constructed, while in the bottom-up path, lightweight tabular generators are used to learn local statistical patterns from real data. The two paths are consolidated in a unified synthesis engine with an iterative feedback loop. We evaluate the framework on weak multimodal financial benchmarks combining tabular and sentiment-text data. Experimental results show that our H-TDBU approach improves train-synthetic-test-real performance over neural baseline methods while preserving semantic consistency. Our results suggest that hierarchical rule-guided synthesis provides an effective mechanism for combining controllability, semantic coherence, and statistical fidelity in synthetic data generation.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{nie2026hierarchical,
  title = {Hierarchical Synthetic Tabular Data Generation: A Hybrid Top-Down and Bottom-Up Framework},
  author = {Junfeng Nie and Alvin Jin and Xiaohui Chen},
  year = {2026},
  abstract = {Existing approaches for synthetic tabular data generation are based on either purely generative models or LLMs, both of which struggle with data heterogeneity, logical consistency, rare-event coverage, and robustness in low-data regimes. In this paper, we propose a hierarchical hybrid top-down and bottom-up (H-TDBU) framework that decouples semantic structures from stochastic texture. In the top-down path, structure-driven logical constraints and cross-modal alignment rules are constructed, whil},
  url = {https://arxiv.org/abs/2605.28198},
  keywords = {cs.LG},
  eprint = {2605.28198},
  archiveprefix = {arXiv},
}

Metadata

{}