Using Vision Language Models for Safety Hazard Identification in Construction
Journal:
arXiv
Published Date:
Apr 12, 2025
Abstract
Safety hazard identification and prevention are the key elements of proactive
safety management. Previous research has extensively explored the applications
of computer vision to automatically identify hazards from image clips collected
from construction sites. However, these methods struggle to identify
context-specific hazards, as they focus on detecting predefined individual
entities without understanding their spatial relationships and interactions.
Furthermore, their limited adaptability to varying construction site guidelines
and conditions hinders their generalization across different projects. These
limitations reduce their ability to assess hazards in complex construction
environments and adaptability to unseen risks, leading to potential safety
gaps. To address these challenges, we proposed and experimentally validated a
Vision Language Model (VLM)-based framework for the identification of
construction hazards. The framework incorporates a prompt engineering module
that structures safety guidelines into contextual queries, allowing VLM to
process visual information and generate hazard assessments aligned with the
regulation guide. Within this framework, we evaluated state-of-the-art VLMs,
including GPT-4o, Gemini, Llama 3.2, and InternVL2, using a custom dataset of
1100 construction site images. Experimental results show that GPT-4o and Gemini
1.5 Pro outperformed alternatives and displayed promising BERTScore of 0.906
and 0.888 respectively, highlighting their ability to identify both general and
context-specific hazards. However, processing times remain a significant
challenge, impacting real-time feasibility. These findings offer insights into
the practical deployment of VLMs for construction site hazard detection,
thereby contributing to the enhancement of proactive safety management.