随机森林算法基本思想及其在生态学中的应用--以云南松分布模拟为例 被引量:131
The basic principle of random forest and its applications in ecology: a case study of Pinus yunnanensis
英文题名:The basic principle of random forest and its applications in ecology: a case study of Pinus yunnanensis
作者:张雷[1] 王琳琳[2] 张旭东[1] 刘世荣[3] 孙鹏森[3] 王同立[4]
机构:[1]中国林业科学研究院林业研究所、国家林业局林木培育重点实验室,北京100091;[2]北京林业大学林学院,北京100083;[3]中国林业科学研究院森林生态环境与保护研究所、国家林业局森林生态环境重点实验室,北京100091;[4]Department of Forest Sciences,University of British Columbia,3041-2424 Main Mall,Vancouver B.C.Canada V6T 1Z4
外文期刊名:Acta Ecologica Sinica
外文关键词:random forest ; classification and regression tree ; variable importance ; multi-dimensional scaling ; speciedistribution modelling
Ecological data are often complex. The numerical variables. The ecological relationships interactions between explanatory variables. Missing outliers almost always exist. Random forest (RF), explanatory and the response variables may be categorical variables or that need to be defined are often nonlinear and involve high-order values for both response and predictor variables are very common, and a novel machine learning technique, is ideally suited for the analysis ofcomplex ecological data. RF predictors are a ensemble-learning approach based on regression or classification trees. Instead of building one classification tree ( classifier), the RF algorithm builds multiple classifiers using randomly selected subsets of the observations and random subsets of the predictor variables. The predictions from the ensemble of trees are then averaged in the case of regression trees, or tallied using a voting system for classification trees. RF is efficient to support flexible modelling strategies. RF is capable of detecting and making use of more complex relationships among the variables. RF is unexcelled in accuracy among current algorithms and does not overfit. It also generates an internal unbiased estimate of the generalization error as the forest building progresses. Potential applications of RF to ecology include: classification and regression analysis, survival analysis, variable importance estimate and data proximities. Proximities can be used for clustering, detecting outliers, multi-dimensional scaling, and unsupervised classification. RF can interpolate missing value and maintain high accuracy even when a large proportion of the data are missing. RF can handle thousands of input variables without variable exclusion. It runs efficiently on large data bases. RF can also handle a spectrum of response types, including categorical, numeric, ratings, and survival data. Another advantage of the RF is that it requires only two user- defined parameters (The number of trees and the number of randomly selected predictive variables used to split the nodes) to be defined. These two parameters should be optimized in order to improve predictive accuracy. In recent years, RF has been widely used by ecologists to model complex ecological relationships because they are easy to implement and easy to interpret. To understand and use the RF, further information about how they are computed is useful. Here, we summarized the basic principle of RF and showed how RF handle complex data by modelling the geographical distribution of Yunan Pine (Pinus yunnanensis) in China. RF is a robust and widely used technique in the field of species distribution modelling ( SDM), since it meets the basic needs of SDM : simulating species distribution and identifying the main drivers of species distribution. In this work, RF showed a high predictive performance in simulating the distribution of Yunan Pine, which was consistent with the multi-dimensional scaling plot that showed it was possible to separate the presences from the absences. We also estimated the relative importance of predictor variables and produced the partial dependence plots for selected predictor variables for random forest predictions of the presences of Yunan Pine. The main aim of the article is to familiarize the reader with the general concepts, terminology and basic principle behind RF. We believe RF will get more applications and development in ecology.