Paper Detail

ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs

Ashutosh Hathidara, Sai Shruthi Sistla, Sebastian Schreiber, Sahil Bansal

huggingface Score 17.5

Published 2026-06-04 · First seen 2026-06-13

General AI

Abstract

Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks. Yet these benchmarks use verbose, fully-specified queries, and their evaluation applies constrained decoding that restricts outputs to valid token paths, neither reveals whether the model actually understands its tools. We introduce ToolSense, an open-source LLM-powered diagnostic framework that takes any tool catalog as input and automatically generates three benchmarks: a Realistic Retrieval Benchmark (RRB) with queries at three ambiguity tiers, an MCQ probing benchmark, and a QA probing benchmark. Applying ToolSense to ToolBench (~47k tools) and evaluating five parametric model training configurations reveals a knowledge-retrieval dissociation: on RRB queries, several configurations collapse by ~50-64 percentage points compared to fully-specified ToolBench benchmarks, falling below the embedding-model baseline. Additionally, despite strong retrieval performance, some models score near-random on factual probes, suggesting a knowledge-retrieval dissociation. We open-source the ToolSense framework and the ToolBench diagnostic benchmarks at https://github.com/SAP/toolsense.

Workflow Status

Review status
pending
Role
unreviewed
Read priority
now
Vote
Not set.
Saved
no
Collections
Not filed yet.
Next action
Not filled yet.

Reading Brief

No structured notes yet. Add `summary_sections`, `why_relevant`, `claim_impact`, or `next_action` in `papers.jsonl` to enrich this view.

Why It Surfaced

No ranking explanation is available yet.

Tags

No tags.

BibTeX

@misc{hathidara2026toolsense,
  title = {ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs},
  author = {Ashutosh Hathidara and Sai Shruthi Sistla and Sebastian Schreiber and Sahil Bansal},
  year = {2026},
  abstract = {Large language models deployed as agents over large tool catalogs face a critical tool-retrieval bottleneck. As embedding-based retrieval approaches rely on compact encoders that may under-capture specialized tool semantics, parametric tool retrieval addresses this by encoding each tool as a virtual token appended to the LLM vocabulary, fine-tuned in two stages (memorization then retrieval SFT) to use the LLM as a retriever, achieving strong performance on standard ToolBench retrieval benchmarks},
  url = {https://huggingface.co/papers/2606.12451},
  keywords = {large language models, tool retrieval, embedding-based retrieval, parametric tool retrieval, ToolBench, constrained decoding, diagnostic framework, retrieval benchmarks, knowledge-retrieval dissociation, code available, huggingface daily},
  eprint = {2606.12451},
  archiveprefix = {arXiv},
}

Metadata

{}