PhysToolâBench is constructed through a semiâautomated pipeline with three stages: (1) Tool bank initialization & extension (2,678 tools from 57 UNSPSC segments), (2) Query generation with target tool combinations (1â3 tools per scene), step labeling, natural language instructions, and distractor injection, followed by realistic rendering using Nano Banana Pro, and (3) multiâstage quality assurance (Geminiâ3.1 auditing, programmatic descriptionâimage alignment, human review).
Evaluation is zeroâshot. Task I measures raw visual enumeration (precision, recall, F1). Task II evaluates planning with metrics including Exact Match (EM), TaskâCompletable Rate (TCR), Success@k, and orderâagnostic F1, plus rootâcause error classification.
Key findings: Even the best models (GPTâ4o, Geminiâ1.5âPro) achieve low Exact Match on Task II (â15â25%). TaskâCompletable Rate is higher (â40â55%), but models frequently substitute functionally similar tools or reorder steps incorrectly. Openâweight models like MiniCPMâV and mPLUGâOwl3 lag behind by 10â20% on F1. Error analysis shows that functional substitution (e.g., replacing a torque wrench with a regular wrench) is the dominant failure mode, indicating that MLLMs lack robust physical commonsense beyond visual recognition.
@article{PhysTool-Bench2026,
title = {Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use},
author = {Zhixin Ma and Yutong Zhou and Yongqi Li and Chong-Wah Ngo and Wenjie Li},
year = {2026},
eprint = {2606.10803},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
We would like to thank the contributors, openâsource projects, and research communities whose work made PhysToolâBench possible.
This project is licensed under the MIT License. Please refer to the LICENSE file for full details.