UrbanSense:AFramework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models
Journal:
arXiv
Published Date:
Jun 12, 2025
Abstract
Urban cultures and architectural styles vary significantly across cities due
to geographical, chronological, historical, and socio-political factors.
Understanding these differences is essential for anticipating how cities may
evolve in the future. As representative cases of historical continuity and
modern innovation in China, Beijing and Shenzhen offer valuable perspectives
for exploring the transformation of urban streetscapes. However, conventional
approaches to urban cultural studies often rely on expert interpretation and
historical documentation, which are difficult to standardize across different
contexts. To address this, we propose a multimodal research framework based on
vision-language models, enabling automated and scalable analysis of urban
streetscape style differences. This approach enhances the objectivity and
data-driven nature of urban form research. The contributions of this study are
as follows: First, we construct UrbanDiffBench, a curated dataset of urban
streetscapes containing architectural images from different periods and
regions. Second, we develop UrbanSense, the first vision-language-model-based
framework for urban streetscape analysis, enabling the quantitative generation
and comparison of urban style representations. Third, experimental results show
that Over 80% of generated descriptions pass the t-test (p less than 0.05).
High Phi scores (0.912 for cities, 0.833 for periods) from subjective
evaluations confirm the method's ability to capture subtle stylistic
differences. These results highlight the method's potential to quantify and
interpret urban style evolution, offering a scientifically grounded lens for
future design.