详细信息
ForestryBERT: A pre-trained language model with continual learning adapted to changing forestry text ( SCI-EXPANDED收录 EI收录)
文献类型:期刊文献
英文题名:ForestryBERT: A pre-trained language model with continual learning adapted to changing forestry text
作者:Tan, Jingwei[1,2] Zhang, Huaiqing[1,2] Yang, Jie[1,2,3] Liu, Yang[1,2] Zheng, Dongping[4] Liu, Xiqin[5]
第一作者:Tan, Jingwei
通信作者:Zhang, HQ[1];Zhang, HQ[2]
机构:[1]Chinese Acad Forestry, Inst Forest Resource Informat Tech, Beijing 100091, Peoples R China;[2]Natl Forestry & Grassland Sci Data Ctr NFGSDC, Beijing 100091, Peoples R China;[3]Beijing Forestry Univ, Coll Forestry, Beijing 100083, Peoples R China;[4]Univ Hawaii Manoa, Dept Language Studies 2, 1890 East-West Rd, Honolulu, HI 96822 USA;[5]South China Univ Technol, Sch Foreign Languages, Guangzhou 510641, Peoples R China
年份:2025
卷号:320
外文期刊名:KNOWLEDGE-BASED SYSTEMS
收录:;EI(收录号:20252018413992);Scopus(收录号:2-s2.0-105004810427);WOS:【SCI-EXPANDED(收录号:WOS:001501994600008)】;
基金:This work was supported by the National Key Research and Development Program of China [grant number 2022YFE0128100]; and the National Natural Science Foundation of China [grant number 32271877].
语种:英文
外文关键词:Pre-trained language model; Domain-specific pre-training; Continual learning; Forestry text processing; Text classification; Extractive question answering; Natural language processing
摘要:Efficient utilization and enhancement of the growing volume of forestry-related textual data is crucial for advancing smart forestry. Pre-trained language models (PLMs) have demonstrated strong capabilities in processing large unlabeled text. To adapt a general PLM to a specific domain, existing studies typically employ a single target corpus for one-time pre-training to incorporate domain-specific knowledge. However, this approach fails to align with the dynamic processes of continuous adaptation and knowledge accumulation that are essential in real-world scenarios. Here, this study proposes ForestryBERT, a BERT model that is continually pretrained on three Chinese forestry corpora comprising 204,636 texts (19.66 million words) using a continual learning method called DAS.1 We evaluate the model on both text classification and extractive question answering tasks using five datasets for each task. Experimental results show that ForestryBERT outperforms five general-domain PLMs and further pre-trained PLMs (without DAS) across eight custom-built forestry datasets. Moreover, PLMs using DAS exhibit a forgetting rate of 0.65, which is 1.41 lower than PLMs without DAS, and demonstrate superior performance on both new and old tasks. These findings indicate that ForestryBERT, based on continual learning, effectively mitigates catastrophic forgetting and facilitates the continuous acquisition of new knowledge. It expands its forestry knowledge by continually absorbing new unlabeled forestry corpora, showcasing its potential for sustainability and scalability. Our study provides a strategy for handling the growing volume of forestry text during PLM construction, a strategy that is also applicable to other domains.
参考文献:
正在载入数据...