Remote Sensing Semantic Segmentation Quality Assessment based on Vision Language Model
Journal:
arXiv
Published Date:
Feb 19, 2025
Abstract
The complexity of scenes and variations in image quality result in
significant variability in the performance of semantic segmentation methods of
remote sensing imagery (RSI) in supervised real-world scenarios. This makes the
evaluation of semantic segmentation quality in such scenarios an issue to be
resolved. However, most of the existing evaluation metrics are developed based
on expert-labeled object-level annotations, which are not applicable in such
scenarios. To address this issue, we propose RS-SQA, an unsupervised quality
assessment model for RSI semantic segmentation based on vision language model
(VLM). This framework leverages a pre-trained RS VLM for semantic understanding
and utilizes intermediate features from segmentation methods to extract
implicit information about segmentation quality. Specifically, we introduce
CLIP-RS, a large-scale pre-trained VLM trained with purified text to reduce
textual noise and capture robust semantic information in the RS domain. Feature
visualizations confirm that CLIP-RS can effectively differentiate between
various levels of segmentation quality. Semantic features and low-level
segmentation features are effectively integrated through a semantic-guided
approach to enhance evaluation accuracy. To further support the development of
RS semantic segmentation quality assessment, we present RS-SQED, a dedicated
dataset sampled from four major RS semantic segmentation datasets and annotated
with segmentation accuracy derived from the inference results of 8
representative segmentation methods. Experimental results on the established
dataset demonstrate that RS-SQA significantly outperforms state-of-the-art
quality assessment models. This provides essential support for predicting
segmentation accuracy and high-quality semantic segmentation interpretation,
offering substantial practical value.