Leveraging potential of limpid attention transformer with dynamic tokenization for hyperspectral image classification.
Journal:
PloS one
Published Date:
Aug 4, 2025
Abstract
Hyperspectral data consists of continuous narrow spectral bands. Due to this, it has less spatial and high spectral information. Convolutional neural networks (CNNs) emerge as a highly contextual information model for remote sensing applications. Unfortunately, CNNs have constraints in their underlying network architecture in regards to the global correlation of spatial and spectral features, making them less reliable for mining and representing the sequential properties of spectral signatures. In this article, limpid size attention network (LSANet) is proposed, which contains 3D and 2D convolution blocks for enhancement of spatial-spectral features of the hyperspectral image (HSI). In addition, limpid attention block (LAB) is designed to provide a global correlation of the spectral and spatial features through LS attention. Furthermore, the computational costs of LS-attention are less compared to the multi-head self-attention (MHSA) of the classical vision transformer (ViT). In the ViT encoder a conditional position encoding (CPE) module is utilized that dynamically generates tokens from the feature maps to capture a richer contextual representation. The LSANet obtained overall accuracy (OA) of 98.78%, 98.67%, 97.52% and 89.45%, respectively, on the Indian Pines (IP), Pavia University (PU), Salina Valley (SV) and Botswana datasets. Our model's quantitative and qualitative results are considerably better than the classical CNN and transformer-based methods.