MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration (Dong et al.)

November 8, 2025

-with tool calls, agents have gone from text-generators to high-level workflow orchestrators

Current benchmarks for tool-augmented agents suffer from three fundamental gaps:

1.architectural mismatch

1.current benchmarks think of tools as an unstructured list

1.in real distributed systems, tools are often organized by server/context/namespace each with their own boundaries/constraints, and some cross-server workflows

2.QED in current benchmarks, agents don't have to think about orchestration across servers/contexts

2.functional overlap

1.multiple tools/tool sequences can achieve the same outcome

2.use LLM-as-a-judge for evaluation which is costly/inconsistent

3.fragmented and incomplete

1.modern tool-calling systems comprise a retriever and an LLM reasoner - benchmarks test these in isolation

1.retriever - gathers + filters relevant tools

2.LLM reasoner - reasons to decide how to use tools

Evaluation Curriculum

1.Foundational Single-Tool Tasks

1.baseline competence through direct tool invocation

2.Context-Aware Tool Retrieval

1.tests disambiguation capabilities when there are multiple tools that could fulfill user intent

3.Intra-Server Sequential Chaining

1.tests if the LLM has a sufficient understanding of tool dependencies, data flow, and valid chains of tool calls within individual servers

4.Cross-Server Compositional Chaining

1.tests coherence of agent-generated cross-server tasks flows

5.Robustness via Capability Gap Identification

1.tests if agents are able to recognize when requests exceed their capabilities rather than attempting impossible tasks