4.6 Article

Nonparametric Variable Selection: The EARTH Algorithm

期刊

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION
卷 103, 期 484, 页码 1609-1620

出版社

TAYLOR & FRANCIS INC
DOI: 10.1198/016214508000000878

关键词

Adaptive bandwidth selection; Co-plots; Efficacy; GUIDE; Importance scores; Local polynomial regression; MARS; Prediction; Random forest

资金

  1. National Science Foundation [DMS-0505651, DMS-0604931]

向作者/读者索取更多资源

We consider regression experiments involving a response variable Y and a large number of predictor variables X-1,...,X-d, many of which may be irrelevant for the prediction of Y and thus must be removed before Y can be predicted from the X's. We consider two procedures that select variables by using importance scores that measure the strength of the relationship between predictor variables and a response and keep those variables whose importance scores exceed a threshold. In the first of these procedures, scores are obtained by randomly drawn subregions (tubes) of the predictor space that constrain all but one predictor and in each subregion computing a signal-to-noise ratio (efficacy) based on a nonparametric univariate regression of Y on the unconstrained variable. The subregions are adapted to boost weak variables iteratively by searching (hunting) for the subregions in which the efficacy is maximized. The efficacy can be viewed as an approximation to a one-to-one function of the probability of identifying features. By using importance scores based on averages of maximized efficacies. We develop a variable selection algorithm called EARTH (efficacy adaptive regression tube hunting) based on examining the conditional expectation of the response given all but one of the predictor variables for a collection of randomly, adaptively, and iteratively selected regions. The second importance score method (RFVS) is based on using random forest importance values to select variable. Computer simulations show that EARTH and RFVS are successful variable selection methods compared with other procedures in nonparametric situations with a large number of irrelevant predictor variables, and that when each is combined with the model selection and prediction procedure MARS, the tree-based prediction procedure GUIDE, and the random forest method, the combinations lead to improved prediction accuracy for certain models with many irrelevant variables. We give conditions under which a version of the EARTH algorithm selects the correct model with probability lending to 1 as the sample size n tends to infinity even if d -> infinity as n -> infinity. We include the analysis of a real data set in which we show how a training set can be used to find a threshold for the EARTH importance scores.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据