Paper Detail
Jie Zhu, Yimin Tian, Boyang Li, Kehao Wu, Zhongzhi Liang, Junhui Li, Xianyin Zhang, Lifan Guo, Feng Chen, Yong Liu, Chi Zhang
This paper introduces FinMCP-Bench, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of models across different levels of task complexity. Using this benchmark, we systematically assess a range of mainstream LLMs and propose metrics that explicitly measure tool invocation accuracy and reasoning capabilities. FinMCP-Bench provides a standardized, practical, and challenging testbed for advancing research on financial LLM agents.
This paper introduces FinMCP-Bench, a new benchmark for evaluating how Large Language Models (LLMs) solve financial problems by using external tools. The benchmark features 613 diverse samples across 10 scenarios, incorporating 65 real financial protocols and tasks of varying complexity (single-tool, multi-tool, multi-turn). The authors used it to systematically assess mainstream LLMs and proposed new metrics for tool invocation accuracy and reasoning.
The abstract positions the benchmark as a foundational tool, implying that future work will focus on using it to advance research on financial LLM agents.
The researchers constructed the benchmark by curating real and synthetic financial queries and pairing them with real-world financial Model Context Protocols (MCPs). This dataset was then used to test a range of LLMs on their ability to perform single-step, multi-step, and conversational financial tasks.
This work provides a standardized, practical, and challenging testbed for systematically evaluating and advancing the development of LLM agents for real-world financial applications.
No ranking explanation is available yet.
No tags.
@misc{zhu2026finmcp,
title = {FinMCP-Bench: Benchmarking LLM Agents for Real-World Financial Tool Use under the Model Context Protocol},
author = {Jie Zhu and Yimin Tian and Boyang Li and Kehao Wu and Zhongzhi Liang and Junhui Li and Xianyin Zhang and Lifan Guo and Feng Chen and Yong Liu and Chi Zhang},
year = {2026},
abstract = {This paper introduces FinMCP-Bench, a novel benchmark for evaluating large language models (LLMs) in solving real-world financial problems through tool invocation of financial model context protocols. FinMCP-Bench contains 613 samples spanning 10 main scenarios and 33 sub-scenarios, featuring both real and synthetic user queries to ensure diversity and authenticity. It incorporates 65 real financial MCPs and three types of samples, single tool, multi-tool, and multi-turn, allowing evaluation of },
url = {https://huggingface.co/papers/2603.24943},
keywords = {large language models, tool invocation, financial model context protocols, benchmark, financial LLM agents, huggingface daily},
eprint = {2603.24943},
archiveprefix = {arXiv},
}
{}