Paper Detail

FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol

Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang, Junhui Li, Xianyin Zhang, Lifan Guo, Feng Chen, Yong Liu, Chi Zhang

Browse

Workflow Queues

huggingface Score 21.0

Published 2026-03-26 · First seen 2026-03-27

General AI

Open paper source

Abstract

This paper introduces FinMCP-Bench, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.

Workflow Status

Review status: pending
Role: unreviewed
Read priority: now
Vote: Not set.
Saved: no
Collections: Not filed yet.
Next action: Not filled yet.

Reading Brief

Key Findings

This paper introduces FinMCP-Bench, a new benchmark for evaluating how Large Language Models (LLMs) solve financial problems by using external tools. The benchmark features 613 diverse samples across 10 scenarios, incorporating 65 real financial protocols and tasks of varying complexity (single-tool, multi-tool, multi-turn). The authors used it to systematically assess mainstream LLMs and proposed new metrics for tool invocation accuracy and reasoning.

Limitations

The abstract positions the benchmark as a foundational tool, implying that future work will focus on using it to advance research on financial LLM agents.

Methodology

The researchers constructed the benchmark by curating real and synthetic financial queries and pairing them with real-world financial Model Context Protocols (MCPs). This dataset was then used to test a range of LLMs on their ability to perform single-step, multi-step, and conversational financial tasks.

Significance

This work provides a standardized, practical, and challenging testbed for systematically evaluating and advancing the development of LLM agents for real-world financial applications.

Why It Surfaced

No ranking explanation is available yet.

BibTeX

@misc{zhu2026finmcp,
  title = {FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol},
  author = {Jie Zhu and Yimin Tian and Boyang Li and Kehao Wu and Zhongzhi Liang and Junhui Li and Xianyin Zhang and Lifan Guo and Feng Chen and Yong Liu and Chi Zhang},
  year = {2026},
  abstract = {This paper introduces FinMCP-Bench, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of },
  url = {https://huggingface.co/papers/2603.24943},
  keywords = {large language models, tool invocation, financial model context protocols, benchmark, financial LLM agents, huggingface daily},
  eprint = {2603.24943},
  archiveprefix = {arXiv},
}

Metadata

{}