Paper Detail

QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding

Shuxiang Cao, Zijian Zhang, Abhishek Agarwal, Grace Bratrud, Niyaz R. Beysengulov, Daniel C. Cole, Alejandro Gómez Frieiro, Elena O. Glen, Hao Hsu, Gang Huang, Raymond Jow, Greshma Shaji, Tom Lubowe, Ligeng Zhu, Luis Mantilla Calderón, Nicola Pancotti, Joel Pendleton, Brandon Severin, Charles Etienne Staub, Sara Sussman, Antti Vepsäläinen, Neel Rajeshbhai Vora, Yilun Xu, Varinia Bernales, Daniel Bowring, Elica Kyoseva, Ivan Rungger, Giulia Semeghini, Sam Stanwyck, Timothy Costa, Alán Aspuru-Guzik, Krysta Svore

Browse

Workflow Queues

arxiv Score 7.8

Published 2026-04-28 · First seen 2026-04-29

Research Track A · General AI

Open paper source

Abstract

Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero-shot and in-context learning settings. The best general-purpose zero-shot model reaches a mean score of 72.3, and many open-weight models degrade under multi-image in-context learning, whereas frontier closed models improve substantially. A supervised fine-tuning ablation at the 9-billion-parameter scale shows that SFT improves zero-shot performance but cannot close the multimodal in-context learning gap. As a reference case study, we release NVIDIA Ising Calibration 1, an open-weight model based on Qwen3.5-35B-A3B that reaches 74.7 zero-shot average score.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: soon
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@article{cao2026qcaleval,
  title = {QCalEval: Benchmarking Vision-Language Models for Quantum Calibration Plot Understanding},
  author = {Shuxiang Cao and Zijian Zhang and Abhishek Agarwal and Grace Bratrud and Niyaz R. Beysengulov and Daniel C. Cole and Alejandro Gómez Frieiro and Elena O. Glen and Hao Hsu and Gang Huang and Raymond Jow and Greshma Shaji and Tom Lubowe and Ligeng Zhu and Luis Mantilla Calderón and Nicola Pancotti and Joel Pendleton and Brandon Severin and Charles Etienne Staub and Sara Sussman and Antti Vepsäläinen and Neel Rajeshbhai Vora and Yilun Xu and Varinia Bernales and Daniel Bowring and Elica Kyoseva and Ivan Rungger and Giulia Semeghini and Sam Stanwyck and Timothy Costa and Alán Aspuru-Guzik and Krysta Svore},
  year = {2026},
  abstract = {Quantum computing calibration depends on interpreting experimental data, and calibration plots provide the most universal human-readable representation for this task, yet no systematic evaluation exists of how well vision-language models (VLMs) interpret them. We introduce QCalEval, the first VLM benchmark for quantum calibration plots: 243 samples across 87 scenario types from 22 experiment families, spanning superconducting qubits and neutral atoms, evaluated on six question types in both zero},
  url = {https://arxiv.org/abs/2604.25884},
  keywords = {quant-ph, cs.CV},
  eprint = {2604.25884},
  archiveprefix = {arXiv},
}

Metadata

{}