Arch-Eval benchmark for assessing chinese architectural domain knowledge in large language models.
Journal:
Scientific reports
PMID:
40251269
Abstract
The burgeoning application of Large Language Models (LLMs) in Natural Language Processing (NLP) has prompted scrutiny of their domain-specific knowledge processing, especially in the construction industry. Despite high demand, there is a scarcity of evaluative studies for LLMs in this area. This paper introduces the "Arch-Eval" framework, a comprehensive tool for assessing LLMs across the architectural domain, encompassing design, engineering, and planning knowledge. It employs a standardized dataset with a minimum of 875 questions tested over seven iterations to ensure reliable assessment outcomes. Through experiments using the "Arch-Eval" framework on 14 different LLMs, we evaluated two key metrics: Stability (to quantify the consistency of LLM responses under random variations) and Accuracy (the correctness of the information provided by LLMs). The results reveal significant differences in the performance of these models in the domain of architectural knowledge question-answering. Our findings show that the average accuracy difference between Chain-of-Thought (COT) evaluation and Answer-Only (AO) evaluation is less than 3%, but the response time for COT is significantly longer, extending to 26 times that of AO (62.23 seconds per question vs. 2.38 seconds per question). Advancing LLM utility in construction necessitates future research focusing on domain customization, reasoning enhancement, and multimodal interaction improvements.