详细信息

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing ( EI收录) `被引量：111`

文献类型：期刊文献

英文题名：RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

作者：Liu, Fan[1] Chen, Delong[2] Guan, Zhangqingyun[1] Zhou, Xiaocong[1] Zhu, Jiale[1] Ye, Qiaolin[3] Fu, Liyong[4] Zhou, Jun[5]

第一作者：Liu, Fan

机构：[1] College of Computer and Information, Hohai University, Nanjing, 210098, China; [2] Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong; [3] College of Information Science and Technology, Nanjing Forestry University, Nanjing, 210037, China; [4] Institute of Forest Resource Information Techniques, Chinese Academy of Forestry, Beijing, 100091, China; [5] School of Information and Communication Technology, Griffith University, Nathan, QLD, 4111, Australia

年份：2023

外文期刊名：arXiv

收录：EI(收录号：20230227360)

语种：英文

外文关键词：Classification (of information) - Foundations - Image classification - Remote sensing - Text processing - Visual languages - Zero-shot learning

摘要：General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 × larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, k-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-theart method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. ? 2023, CC BY.

参考文献：

正在载入数据...

详细信息

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing ( EI收录) 被引量：111

参考文献：

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing ( EI收录) `被引量：111`