详细信息

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing ( SCI-EXPANDED收录 EI收录) `被引量：156`

文献类型：期刊文献

英文题名：RemoteCLIP: A Vision Language Foundation Model for Remote Sensing

作者：Liu, Fan[1,2] Chen, Delong[3] Guan, Zhangqingyun[1] Zhou, Xiaocong[1] Zhu, Jiale[1] Ye, Qiaolin[4] Fu, Liyong[5] Zhou, Jun[6]

第一作者：Liu, Fan

通信作者：Liu, F[1];Liu, F[2];Chen, DL[3]

机构：[1]Hohai Univ, Coll Comp Sci & Software Engn, Nanjing 210098, Peoples R China;[2]Minist Water Resources, Key Lab Water Big Data Technol, Nanjing 211100, Peoples R China;[3]Hong Kong Univ Sci & Technol, Dept Elect & Comp Engn, Hong Kong, Peoples R China;[4]Nanjing Forestry Univ, Coll Informat Sci & Technol, Nanjing 210037, Peoples R China;[5]Chinese Acad Forestry, Inst Forest Resource Informat Tech, Beijing 100091, Peoples R China;[6]Griffith Univ, Sch Informat & Commun Technol, Nathan, Qld 4111, Australia

年份：2024

卷号：62

起止页码：1-1

外文期刊名：IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING

收录：;EI(收录号：20241715959760);Scopus(收录号：2-s2.0-85190796785);WOS:【SCI-EXPANDED(收录号:WOS:001219598000024)】；

基金：No Statement Available

语种：英文

外文关键词：Contrastive language image pretraining (CLIP); foundation model; multimodality; remote sensing; vision-language

摘要：General-purpose foundation models have led to recent breakthroughs in artificial intelligence (AI). In remote sensing, self-supervised learning (SSL) and masked image modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pretraining data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on box-to-caption (B2C) and mask-to-box (M2B) conversion. By further incorporating unmanned aerial vehicle (UAV) imagery, we produce a 12x larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, k-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation of 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-the-art (SOTA) method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the contrastive language image pretraining (CLIP) baseline by up to 6.39% average accuracy on 12 downstream datasets.

参考文献：

正在载入数据...

详细信息

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing ( SCI-EXPANDED收录 EI收录) 被引量：156

参考文献：

RemoteCLIP: A Vision Language Foundation Model for Remote Sensing ( SCI-EXPANDED收录 EI收录) `被引量：156`