🪐 Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

A benchmark for physical tool perception, selection, and sequencing in real-world scenes

Zhixin Ma¹, Yutong Zhou², Yongqi Li² Chong-Wah Ngo¹ Wenjie Li²

¹Singapore Management University, ²The Hong Kong Polytechnic University

PhysTool‑Bench tests how well Multimodal Large Language Models (MLLMs) can spot physical tools in messy images, understand their functions, and plan the correct order to use them—skills that current MLLMs lack. The benchmark contains 2,510 test scenarios, 2,678 real-world tools, evaluation code, and human-verified ground truth.

📘 Abstract

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the “brain” of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs’ ability to assist humans in real-world tasks. Despite the importance, MLLMs’ proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs’ ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool- Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

🧠 Method

PhysTool‑Bench is constructed through a semi‑automated pipeline with three stages: (1) Tool bank initialization & extension (2,678 tools from 57 UNSPSC segments), (2) Query generation with target tool combinations (1–3 tools per scene), step labeling, natural language instructions, and distractor injection, followed by realistic rendering using Nano Banana Pro, and (3) multi‑stage quality assurance (Gemini‑3.1 auditing, programmatic description‑image alignment, human review).

Evaluation is zero‑shot. Task I measures raw visual enumeration (precision, recall, F1). Task II evaluates planning with metrics including Exact Match (EM), Task‑Completable Rate (TCR), Success@k, and order‑agnostic F1, plus root‑cause error classification.

PhysTool-Bench construction and evaluation pipeline — Overview of the *PhysTool‑Bench* construction and evaluation pipeline (adapted from Figure 2 of the paper).

📊 Results

Example result chart 1 — Performance comparison of 13+ MLLMs on Task I (Tool Recognition) and Task II (Planning).

Example result chart 2 — Error decomposition: substitution, missing, extra, and out‑of‑order errors across models.

Key findings: Even the best models (GPT‑4o, Gemini‑1.5‑Pro) achieve low Exact Match on Task II (≈15–25%). Task‑Completable Rate is higher (≈40–55%), but models frequently substitute functionally similar tools or reorder steps incorrectly. Open‑weight models like MiniCPM‑V and mPLUG‑Owl3 lag behind by 10–20% on F1. Error analysis shows that functional substitution (e.g., replacing a torque wrench with a regular wrench) is the dominant failure mode, indicating that MLLMs lack robust physical commonsense beyond visual recognition.

📚 BibTeX

Please consider citing our work if you find it useful:

@article{PhysTool-Bench2026,
  title        = {Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use},
  author       = {Zhixin Ma and Yutong Zhou and Yongqi Li and Chong-Wah Ngo and Wenjie Li},
  year    = {2026},
  eprint  = {2606.10803},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}

🙏 Acknowledgements

We would like to thank the contributors, open‑source projects, and research communities whose work made PhysTool‑Bench possible.

🖼️ Image Generation – Nano Banana Pro (synthetic scene rendering)
🧠 Open‑weight Models – MiniCPM‑V, mPLUG‑Owl3, OpenFlamingo, InternVL, DeepSeek‑VL, Kimi‑VL, Ovis
💻 Code & Libraries – 🤗 Transformers, vLLM, PyTorch, PIL, requests
📚 Dataset & Classification – UNSPSC, manual annotation & QC team

This project is licensed under the MIT License. Please refer to the LICENSE file for full details.

🪐 Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use · ModalityDance
Maintained by Yutong Zhou. Updated on 2026.6.9.