Prompt to Protection: A Comparative Study of Multimodal LLMs in Construction Hazard Recognition
Journal:
arXiv
Published Date:
Jun 9, 2025
Abstract
The recent emergence of multimodal large language models (LLMs) has
introduced new opportunities for improving visual hazard recognition on
construction sites. Unlike traditional computer vision models that rely on
domain-specific training and extensive datasets, modern LLMs can interpret and
describe complex visual scenes using simple natural language prompts. However,
despite growing interest in their applications, there has been limited
investigation into how different LLMs perform in safety-critical visual tasks
within the construction domain. To address this gap, this study conducts a
comparative evaluation of five state-of-the-art LLMs: Claude-3 Opus, GPT-4.5,
GPT-4o, GPT-o3, and Gemini 2.0 Pro, to assess their ability to identify
potential hazards from real-world construction images. Each model was tested
under three prompting strategies: zero-shot, few-shot, and chain-of-thought
(CoT). Zero-shot prompting involved minimal instruction, few-shot incorporated
basic safety context and a hazard source mnemonic, and CoT provided
step-by-step reasoning examples to scaffold model thinking. Quantitative
analysis was performed using precision, recall, and F1-score metrics across all
conditions. Results reveal that prompting strategy significantly influenced
performance, with CoT prompting consistently producing higher accuracy across
models. Additionally, LLM performance varied under different conditions, with
GPT-4.5 and GPT-o3 outperforming others in most settings. The findings also
demonstrate the critical role of prompt design in enhancing the accuracy and
consistency of multimodal LLMs for construction safety applications. This study
offers actionable insights into the integration of prompt engineering and LLMs
for practical hazard recognition, contributing to the development of more
reliable AI-assisted safety systems.