LLMs之ToolUse:《ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration》翻译与解读
导读:ToolOrchestra 提出并验证了“用小型、训练良好的 Orchestrator 去编排多样化工具(包括更强的模型)”这一范式——通过联合优化正确性、成本与用户偏好,作者展示了在困难推理任务上既能超越或匹配更大模型的表现,又能大幅降低实际运行成本与延迟,从而为可扩展、可控且经济的工具增强智能体部署提供了一条切实可行的路线。
>> 背景痛点:
●智能体复杂任务的计算代价:大语言模型(LLM)在解决复杂、多步骤推理任务(如 Humanity’s Last Exam,HLE)时准确率受限且计算成本高昂。
● 单一大模型局限:以单个强模型配合若干工具的常规方法不能充分利用“多样化工具 + 不同能力模型”组合的潜力,并且会出现“自我偏好/自我增强”与“默认调用最强模型”的问题,导致过度使用高成本模型。
● 可控性与效率权衡缺失:现有工具使用代理通常只关注正确率,缺乏在“结果正确性、计算/延迟成本、用户工具偏好”之间的联合优化机制。
>> 具体的解决方案:
● 方案总览:ToolOrchestra — 训练一个小型“Orchestrator”语言模型(论文中为 8B 参数),让其在多回合推理中以强化学习策略决定何时、以何种顺序调用各种工具(包括查询、解释器、专用 LLMS 以及更大的通用模型)。目标是以更低成本和更好用户偏好对齐取得与或优于大型模型的性能。
●统一工具接口:将所有工具(API、专用 LLM、通用 LLM、函数等)通过统一 JSON 描述暴露给 Orchestrator,包含名称、说明和参数 schema,以便 Orchestrator 可以标准化地选择与调用工具。
●三重奖励设计(训练目标):采用强化学习时,设计 outcome(答案正确性)、efficiency(按货币化计算的算力/延迟惩罚)与 preference(对用户偏好工具的对齐)三类奖励,联合优化决策策略。
●数据合成与ToolScale:为 RL 提供可验证的多回合工具使用训练样本,作者建立了自动化的数据合成管线并发布 ToolScale 数据集,用以覆盖 10 个领域的复杂环境与任务。
>> 核心思路步骤:
● 问题建模:把多回合工具使用任务建模为马尔可夫决策过程(MDP),在每一步 Orchestrator 根据历史上下文决策(reasoning → action(调用某工具)→ observation(工具返回)),直到终止或达到回合上限。
● 统一描述工具并采集工具能力:为每个候选工具(包括其它 LLM)生成简洁描述(通过采样任务、获得工具执行轨迹,然后让 LLM 汇总生成描述),使 Orchestrator 在调用前能“理解”工具擅长的任务类型。
● 奖励构成与计算:
● ●成果奖励:以二元或评分方式判定最终回答是否正确(论文中用 GPT-5 作为评判者来处理多样性输出);
● ● 成本/延时惩罚:将输入/输出 token 与第三方 API 定价换算为货币成本,并加入延时惩罚以鼓励低成本快速解法;
● ● 用户偏好奖励:统计某次轨迹中各工具被调用次数,按用户偏好向量进行归一化奖励/惩罚,鼓励遵守用户指定的工具偏好。
● 训练流程与技巧:先用合成数据做行为克隆或监督预训练,再用策略梯度类的 RL(并结合奖励设计)进行端到端微调;使用多技巧稳定训练(论文附录列出细节)。
>> 优势(相对于现有方法的改进点):
● 成本效率显著:Orchestrator(8B)在 HLE 上取得 37.1% 的得分,超过 GPT-5 的 35.1%,同时在成本上更优(文中宣称约 2.5× 更高效率,且在其他基准上亦以更低成本获得更好或相近性能)。
● 更细粒度的工具选择与组合能力:通过统一接口与 RL 策略,Orchestrator 能在多回合中选择更合适的廉价工具或专用模型,仅在必要时调用昂贵的大模型,从而取得性能—成本的最佳折中。
● 用户可控性:引入用户偏好作为奖励分量,使得系统在遵循用户想用(或不想用)特定工具时有明确优化目标,提升可控性与信任度。
● 泛化性强:在 HLE、-Bench(函数调用基准)、FRAMES(事实性推理)等不同任务上都有良好表现,且在面对未见工具或任务时仍表现鲁棒,说明学到的是策略性调度能力而非任务单一记忆。
>> 后续落地与结论观点(经验与建议):
●小型 Orchestrator+多样工具的体系在工业落地上更可行 — 可以显著降低总体 API 成本与延迟,同时保有或提升最终任务效果;运营团队应优先考虑把“昂贵模型”作为按需资源,而非默认常开调用。
●奖励设计很关键— 成果/成本/偏好三者的权重需要根据业务侧重(例如以用户体验或以成本节约为主)调节,文中展示了如何把 token 使用量映射成货币成本来实现统一衡量。
● 工具描述与能力估计要到位 — 统一 JSON 描述、并对模型工具的能力做示例驱动的“描述化”有助于 Orchestrator 做出更合理选择;在工程上应维护工具能力元数据并定期校准。
● 风险与注意:需要谨慎设计评判器(paper 用 GPT-5 做 judge)以避免引入评判偏差;同时数据合成要尽量覆盖边界情形,避免 Orchestrator 学到对某类任务的错误捷径。
《ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration》翻译与解读
地址 | https://arxiv.org/abs/2511.21689 |
时间 | 2025年11月26日 |
作者 | NVIDIA,香港大学 |
Abstract
Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems. | 大型语言模型是强大的通用型工具,但解决诸如“人类最后考试”(HLE)这类深度复杂的问题,在概念上具有挑战性,且计算成本高昂。我们表明,由小型协调器管理其他模型和各种工具,既能提升智能上限,又能提高解决困难代理任务的效率。我们推出了ToolOrchestra,这是一种用于训练小型协调器的方法,这些协调器能够协调智能工具。ToolOrchestra 明确使用了强化学习,其奖励机制考虑了结果、效率和用户偏好。借助 ToolOrchestra,我们开发出了 Orchestrator,这是一个 80 亿参数的模型,在成本更低的情况下比之前的工具使用代理实现了更高的准确率,同时还能根据用户偏好选择用于特定查询的工具。在 HLE 中,Orchestrator 的得分达到了 37.1%,超过了 GPT-5(35.1%),且效率高出 2.5 倍。在 tau2-Bench 和 FRAMES 上,Orchestrator 的表现大幅超越 GPT-5,而成本仅为其约 30%。大量分析表明,Orchestrator 在多个指标下实现了性能与成本的最佳平衡,并且能稳健地推广到未见过的工具上。这些结果表明,使用轻量级的编排模型组合多样化的工具,比现有方法更高效、更有效,为实用且可扩展的工具增强推理系统铺平了道路。 |
Figure 1:ToolOrchestra shows consistently strong performance on HLE, FRAMES, andτ2-Bench with superior cost efficiency.图 1:ToolOrchestra 在 HLE、FRAMES 和 τ2-Bench 上始终表现出色,且成本效益更优。
1、Introduction
Large language models (LLMs) have been reported to have made remarkable strides towards superhuman intelligence but remain of limited utility in complex agentic tasks such as those posed by the Humanity’s Last Exam (HLE) [1]. Tool use is a promising avenue for the extension of their capabilities beyond what can be learned from the training data. By calling on external resources through search engines and code interpreters, tool use has been shown to enhance accuracy and reduce hallucinations [2, 3, 4, 5, 6, 7, 8, 9, 10]. Prior research on tool-use agents has primarily focused on equipping a single powerful model with utility tools such as web search or calculators. While effective in many scenarios, this approach underutilizes the potential of tools: humans, when reasoning, routinely extend themselves by calling upon resources of greater-than-human intelligence, from domain experts to sophisticated processes and software systems. Motivated by this observation, we propose the orchestration paradigm. Under this paradigm, intelligence emerges not from a monolith but from a composite system. At the center of the system lies an orchestrator model, whose responsibility is to invoke the right tools for the given task, and to do so in the right order to accomplish the task. The crucial difference to the standard monolithic setup featuring a single powerful model is that in addition to deterministic utilities such as web search functions and code interpreters, models of various capabilities are made available to the orchestrator as intelligent tools. The use of tools of different levels of intelligence comes at varying costs, and the challenge for the orchestrator is then to dynamically decide on which tools to invoke in order to solve the task while respecting user preferences for various tools and minimizing the cost. By delegating narrowed-down sub-problems of a larger effort requiring intelligence to intelligent tools instead of handling the entire effort by a single generalist, orchestration teems with the promise of exhibiting higher intelligence than any of the system’s tools and leading monolithic solutions alike. | 大型语言模型(LLMs)已被报道在迈向超人类智能方面取得了显著进展,但在诸如“人类最后考试”(HLE)[1] 所提出的复杂代理任务方面仍存在局限性。工具使用是扩展其能力的一个有前景的途径,超越了从训练数据中所能学到的内容。通过借助搜索引擎和代码解释器等外部资源,工具使用已被证明能够提高准确性并减少幻觉[2, 3, 4, 5, 6, 7, 8, 9, 10]。 先前关于工具使用代理的研究主要集中在为单个强大的模型配备实用工具,如网络搜索或计算器。尽管在许多场景中有效,但这种方法未能充分利用工具的潜力:人类在推理时,通常会通过调用超出人类智能的资源来扩展自身能力,从领域专家到复杂的流程和软件系统。受此观察的启发,我们提出了“编排”范式。在这种范式下,智能并非源自单一整体,而是来自复合系统。该系统的核心是一个协调器模型,其职责是针对给定任务调用合适的工具,并按照正确的顺序调用这些工具以完成任务。与标准的单体式设置(其中只有一个强大的模型)相比,关键的区别在于,除了诸如网络搜索功能和代码解释器等确定性工具之外,各种能力的模型也被作为智能工具提供给协调器。使用不同智能水平的工具会产生不同的成本,协调器面临的挑战在于动态决定调用哪些工具来解决任务,同时尊重用户对各种工具的偏好并尽量降低成本。通过将需要智能的大任务分解为更小的子任务,并将这些子任务委托给智能工具处理,而不是由一个全能的模型来处理整个任务,这种协调方式有望展现出比系统中的任何工具和传统的单体式解决方案更高的智能水平。 |
One approach to implementing the orchestrator paradigm is to employ a language model as the orchestrator and allow it to invoke stronger models only when it deems it necessary. This can be done naively by prompting an off-the-shelf language model or by training a general-purpose orchestrator. For the former, we find that relying on straightforward model prompting is brittle and introduces systemic biases. As shown in Figure 3 (left and middle), GPT-5 disproportionately delegates tasks to GPT-5-mini, while Qwen3-8B defers to GPT-5 at a markedly higher rate. This illustrates two present issues of prompting in the context of complex tool orchestration: (i) the overuse of developmentally-related variants of oneself, i.e., self-enhancement bias [11], and (ii) defaulting to the strongest available tool regardless of the cost or relative utility (see Appendix A for more details and §4 for a thorough comparison to baselines). As such, we conclude that the scenarios in which an orchestrating model may call on models and tools of capabilities both inferior and superior to its own are idiosyncratic in the context of model tool calling and warrant their own approach to training. In addition, controllability in tool-use agents remains underexplored along two axes: cost–efficiency and user preferences (cf. §7). | 实现编排器范式的一种方法是采用语言模型作为编排器,并仅在其认为必要时调用更强大的模型。这可以通过提示现成的语言模型来实现,也可以通过训练通用编排器来实现。对于前者,我们发现单纯依靠直接提示模型的方法很脆弱,并且会引入系统性偏差。如图 3(左图和中图)所示,GPT-5 过度将任务委托给 GPT-5-mini,而 Qwen3-8B 则明显更倾向于调用 GPT-5。这说明了在复杂工具编排的背景下提示存在的两个当前问题:(i)过度使用自身发展相关的变体,即自我增强偏差[11],以及(ii)无论成本或相对效用如何,都默认调用可用的最强工具(更多细节见附录 A,与基线的全面比较见第 4 节)。因此,我们得出结论,在模型工具调用的背景下,编排模型可能调用能力低于或高于自身的模型和工具的场景是独特的,需要专门的方法来进行训练。此外,在工具使用代理的可控性方面,沿成本效益和用户偏好这两个轴线的研究仍处于探索阶段(参见第 7 节)。 |
We address these shortcomings by proposing ToolOrchestra (shown in Figure 2), a novel method for training a small language model to act as the orchestrator – the “brain” of a heterogeneous tool-use agent. Using ToolOrchestra, we produce the Orchestrator, an 8B-parameter model trained end-to-end with reinforcement learning (RL) to decide when and how to invoke more intelligent language models and various tools such as web search or code interpreters, and how to combine them in multi-turn reasoning. Our reward design balances three objectives – correctness of the final outcome, efficiency in resource usage, and alignment with user preferences – to yield a cost-effective and user-controllable tool-use policy. To aid RL training, we build an automatic data synthesis pipeline that generates thousands of verifiable multi-turn tool-use training examples with complex environments across 10 domains. We will make the resulting dataset, ToolScale, publicly available to facilitate further research on tool-use agent training. In our experiments, we rigorously evaluate the merits of our approach on three challenging tasks. On HLE [1], a benchmark consisting of difficult questions across many disciplines, we find that Orchestrator substantially outperforms prior methods with far lower computational cost. We also test on τ 2-Bench [12], a function-calling benchmark, where Orchestrator demonstrates the ability to schedule a variety of tools effectively, calling a large model (GPT-5) in only ∼40% of the steps and utilizing cheaper models or tools for the rest, yet still exceeding the performance of an agent that uses the large model for every step. Finally, additional evaluations on the FRAMES [13], a factuality reasoning benchmark, provide further evidence of the versatility and robustness of our approach. We observe that even though the training and testing tasks differ markedly, the RL-trained Orchestrator adapts its tool-use policy to new challenges, indicating a high degree of general reasoning ability. Our contributions can be summarized as follows: (1) We introduce ToolOrchestra, a method for training a small language model to serve as the orchestrator of a diverse toolkit, including classical tools and more intelligent models. This dovetails with recent developments in the field testifying that small language models are often sufficiently powerful and far more economical in agentic systems [14, 15]. (2) We develop a novel reward training design that goes beyond accuracy. The resulting Orchestrator is trained end-to-end to balance task outcome correctness, efficiency in cost and latency, and alignment with user cost and tool preferences. (3) We demonstrate that Orchestrator trained by ToolOrchestra achieves state-of-the-art performance on challenging reasoning benchmarks, surpassing frontier models while using only a fraction of their compute and wall-clock time, and that it generalizes robustly to unseen tasks and tools. | 为了解决这些不足,我们提出了 ToolOrchestra(如图 2 所示),这是一种新颖的方法,用于训练一个小语言模型充当编排器——异构工具使用代理的“大脑”。通过 ToolOrchestra,我们生成了 Orchestrator,这是一个 80 亿参数的模型,通过强化学习(RL)进行端到端训练,以决定何时以及如何调用更智能的语言模型和诸如网络搜索或代码解释器等各种工具,并如何在多轮推理中将它们结合起来。我们的奖励设计平衡了三个目标——最终结果的正确性、资源使用的效率以及与用户偏好的一致性,从而产生了一种成本效益高且用户可控的工具使用策略。为了辅助强化学习训练,我们构建了一个自动数据合成管道,生成了跨越 10 个领域的数千个可验证的多轮工具使用训练示例,这些示例具有复杂的环境。我们将公开发布由此产生的数据集 ToolScale,以促进对工具使用智能体训练的进一步研究。 在我们的实验中,我们在三个具有挑战性的任务上严格评估了我们方法的优点。在 HLE [1] 上,这是一个包含多个学科难题的基准测试,我们发现 Orchestrator 显著优于先前的方法,且计算成本低得多。我们还在 τ 2-Bench [12] 上进行了测试,这是一个函数调用基准测试,在这里 Orchestrator 展示了有效调度各种工具的能力,在大约 40% 的步骤中调用大型模型(GPT-5),而在其余步骤中使用更便宜的模型或工具,但仍超过了每一步都使用大型模型的智能体的性能。最后,在 FRAMES [13] 上的额外评估,这是一个事实推理基准测试,进一步证明了我们方法的通用性和鲁棒性。我们观察到,尽管训练任务和测试任务差异显著,但通过强化学习训练的协调器能够适应新的挑战,调整其工具使用策略,这表明其具备高度的通用推理能力。 我们的贡献可总结如下:(1)我们引入了 ToolOrchestra 方法,该方法用于训练一个小语言模型来充当包含传统工具和更智能模型的多样化工具包的协调器。这与该领域近期的发展相契合,证明了在代理系统中,小语言模型通常足够强大且经济得多[14, 15]。(2)我们开发了一种新颖的奖励训练设计,超越了单纯准确性。由此训练出的协调器能够端到端地平衡任务结果的正确性、成本和延迟效率以及与用户成本和工具偏好的一致性。(3)我们证明,通过 ToolOrchestra 训练的协调器在具有挑战性的推理基准测试中达到了最先进的性能,超越了前沿模型,同时仅使用了它们一小部分的计算资源和运行时间,并且能够稳健地泛化到未见过的任务和工具。 |
Figure 2:Overview of Orchestrator. Given a task, Orchestrator alternates between reasoning and tool calling in multiple turns to solve it. Orchestrator interacts with a diverse tool set, including basic tools (web search, functions such as get_flight_status, etc.), specialized LLMs (coding models, math models, etc.) and generalist LLMs (GPT-5, Claude Opus 4.1, etc.). In training under ToolOrchestra, Orchestrator is jointly optimized by outcome, efficiency and preference rewards via reinforcement learning.图 2:编排器概述。给定一项任务,编排器通过多次交替进行推理和调用工具来解决它。编排器与多样化的工具集进行交互,包括基础工具(网络搜索、获取航班状态等函数)、专业 LLM(编程模型、数学模型等)和通用 LLM(GPT-5、Claude Opus 4.1 等)。在 ToolOrchestra 的训练中,编排器通过强化学习,依据结果、效率和偏好奖励共同进行优化。
Figure 3:Tool-calling preferences exhibited by a prompted off-the-shelf or RL-trained model. GPT-5 tends to call GPT-5-mini most of the time, while Qwen3-8B relies heavily on GPT-5.图 3:提示式现成模型或通过强化学习训练的模型所表现出的工具调用偏好。GPT-5 大部分时候倾向于调用 GPT-5-mini,而 Qwen3-8B 则严重依赖 GPT-5。
8Conclusion
In this work, we presented ToolOrchestra, a method for training a small orchestration model to unify diverse tools and specialized models. By training Orchestrator end-to-end with reinforcement learning, we showed that it can learn to plan adaptive tool-use strategies guided by both outcome quality, efficiency, and human preference rewards. This enables the agent to dynamically balance performance and cost, rather than relying on static heuristics or purely supervised approaches. To aid reinforcement learning, we also contribute a complex user-agent-tool synthetic dataset ToolScale. Our experiments on challenging benchmarks demonstrate that our Orchestrator-8B attains state-of-the-art performance while operating at significantly lower cost compared to larger models. Looking ahead, we envision more sophisticated recursive orchestrator systems to push the upper bound of intelligence but also to further enhance efficiency in solving increasingly complex agentic tasks. | 在本研究中,我们提出了 ToolOrchestra 方法,这是一种通过训练小型编排模型来统一各种工具和专业模型的方法。通过使用强化学习对 Orchestrator 进行端到端的训练,我们证明了它能够学习到以结果质量、效率和人类偏好奖励为指导的自适应工具使用策略。这使得智能体能够动态地平衡性能和成本,而不是依赖于静态启发式方法或纯监督方法。为了辅助强化学习,我们还贡献了一个复杂的用户-智能体-工具合成数据集 ToolScale。我们在具有挑战性的基准测试上的实验表明,我们的 Orchestrator-8B 达到了最先进的性能,同时与更大的模型相比,其运行成本显著降低。展望未来,我们设想更复杂的递归编排系统,以提升智能的上限,同时进一步提高效率。 |