MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration (Dong et al.)

November 8, 2025

-with tool calls, agents have gone from text-generators to high-level workflow orchestrators

Current benchmarks for tool-augmented agents suffer from three fundamental gaps:

1.architectural mismatch
1.current benchmarks think of tools as an unstructured list
1.in real distributed systems, tools are often organized by server/context/namespace each with their own boundaries/constraints, and some cross-server workflows
2.QED in current benchmarks, agents don't have to think about orchestration across servers/contexts
2.functional overlap
1.multiple tools/tool sequences can achieve the same outcome
2.use LLM-as-a-judge for evaluation which is costly/inconsistent
3.fragmented and incomplete
1.modern tool-calling systems comprise a retriever and an LLM reasoner - benchmarks test these in isolation
1.retriever - gathers + filters relevant tools
2.LLM reasoner - reasons to decide how to use tools

Evaluation Curriculum

1.Foundational Single-Tool Tasks
1.baseline competence through direct tool invocation
2.Context-Aware Tool Retrieval
1.tests disambiguation capabilities when there are multiple tools that could fulfill user intent
3.Intra-Server Sequential Chaining
1.tests if the LLM has a sufficient understanding of tool dependencies, data flow, and valid chains of tool calls within individual servers
4.Cross-Server Compositional Chaining
1.tests coherence of agent-generated cross-server tasks flows
5.Robustness via Capability Gap Identification
1.tests if agents are able to recognize when requests exceed their capabilities rather than attempting impossible tasks