详细信息
RemoteCLIP: A Vision Language Foundation Model for Remote Sensing ( EI收录) 被引量:111
文献类型:期刊文献
英文题名:RemoteCLIP: A Vision Language Foundation Model for Remote Sensing
作者:Liu, Fan[1] Chen, Delong[2] Guan, Zhangqingyun[1] Zhou, Xiaocong[1] Zhu, Jiale[1] Ye, Qiaolin[3] Fu, Liyong[4] Zhou, Jun[5]
第一作者:Liu, Fan
机构:[1] College of Computer and Information, Hohai University, Nanjing, 210098, China; [2] Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong; [3] College of Information Science and Technology, Nanjing Forestry University, Nanjing, 210037, China; [4] Institute of Forest Resource Information Techniques, Chinese Academy of Forestry, Beijing, 100091, China; [5] School of Information and Communication Technology, Griffith University, Nathan, QLD, 4111, Australia
年份:2023
外文期刊名:arXiv
收录:EI(收录号:20230227360)
语种:英文
外文关键词:Classification (of information) - Foundations - Image classification - Remote sensing - Text processing - Visual languages - Zero-shot learning
摘要:General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of pre-training data, we leverage data scaling which converts heterogeneous annotations into a unified image-caption data format based on Box-to-Caption (B2C) and Mask-to-Box (M2B) conversion. By further incorporating UAV imagery, we produce a 12 × larger pretraining dataset than the combination of all available datasets. RemoteCLIP can be applied to a variety of downstream tasks, including zero-shot image classification, linear probing, k-NN classification, few-shot classification, image-text retrieval, and object counting in remote sensing images. Evaluation on 16 datasets, including a newly introduced RemoteCount benchmark to test the object counting ability, shows that RemoteCLIP consistently outperforms baseline foundation models across different model scales. Impressively, RemoteCLIP beats the state-of-theart method by 9.14% mean recall on the RSITMD dataset and 8.92% on the RSICD dataset. For zero-shot classification, our RemoteCLIP outperforms the CLIP baseline by up to 6.39% average accuracy on 12 downstream datasets. ? 2023, CC BY.
参考文献:
正在载入数据...
