Paper Detail

MathDuels: Evaluating LLMs as Problem Posers and Solvers

Zhiqiu Xu, Shibo Jin, Shreya Arya, Mayur Naik

arxiv Score 6.3

Published 2026-04-23 · First seen 2026-04-24

General AI

Abstract

As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation pipeline (meta-prompting, problem generation, and difficulty amplification), and validated by an independent verifier that excludes ill-posed questions. A Rasch model (Rasch, 1993) jointly estimates solver abilities and problem difficulties; author quality is derived from the difficulties of each model's authored problems. Experiments across 19 frontier models reveal that authoring and solving capabilities are partially decoupled, and that dual-role evaluation reveals capability separations invisible in single-role benchmarks. As newer models enter the arena, they produce problems that defeat previously dominant solvers, so the benchmark's difficulty co-evolves with participant strength rather than saturating at a fixed ceiling. We host a public leaderboard that updates as new models are released.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
soon
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@article{xu2026mathduels,
  title = {MathDuels: Evaluating LLMs as Problem Posers and Solvers},
  author = {Zhiqiu Xu and Shibo Jin and Shreya Arya and Mayur Naik},
  year = {2026},
  abstract = {As frontier language models attain near-ceiling performance on static mathematical benchmarks, existing evaluations are increasingly unable to differentiate model capabilities, largely because they cast models solely as solvers of fixed problem sets. We introduce MathDuels, a self-play benchmark in which models occupy dual roles: each authors math problems under adversarial prompting and solves problems authored by every other participant. Problems are produced through a three-stage generation p},
  url = {https://arxiv.org/abs/2604.21916},
  keywords = {cs.CL, cs.SE, Computer science, Benchmark (surveying), Pipeline (software), Solver, Problem solver},
  eprint = {2604.21916},
  archiveprefix = {arXiv},
}

Metadata

{}