4.6 Article

Preconditioning for feature selection and regression in high-dimensional problems'

期刊

ANNALS OF STATISTICS
卷 36, 期 4, 页码 1595-1618

出版社

INST MATHEMATICAL STATISTICS
DOI: 10.1214/009053607000000578

关键词

model selection; prediction error; lasso

向作者/读者索取更多资源

We consider regression problems where the number of predictors greatly exceeds the number of observations. We propose a method for variable selection that first estimates the regression function, yielding a preconditioned response variable. The primary method used for this initial regression is supervised principal components. Then we apply a standard procedure such as forward stepwise selection or the LASSO to the preconditioned response variable. In a number of simulated and real data examples, this two-step procedure outperforms forward stepwise selection or the usual LASSO (applied directly to the raw outcome). We also show that under a certain Gaussian latent variable model, application of the LASSO to the preconditioned response variable is consistent as the number of predictors and observations increases. Moreover, when the observational noise is rather large, the suggested procedure can give a more accurate estimate than LASSO. We illustrate our method on some real problems, including survival analysis with microarray data.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

Correction Mathematical & Computational Biology

Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank (Sept, 10.1093/biostatistics/kxaa038, 2020)

Ruilin Li, Christopher Chang, Johanne M. Justesen, Yosuke Tanigawa, Junyang Qian, Trevor Hastie, Manuel A. Rivas, Robert Tibshirani

BIOSTATISTICS (2022)

Article Statistics & Probability

BACKFITTING FOR LARGE SCALE CROSSED RANDOM EFFECTS REGRESSIONS

Swarnadip Ghosh, Trevor Hastie, Art B. Owen

Summary: This paper presents a computationally efficient algorithm for regression models with crossed random effect errors. The proposed algorithm has lower cost and more flexible conditions compared to other methods, and it is validated through empirical analysis.

ANNALS OF STATISTICS (2022)

Article Statistics & Probability

Canonical correlation analysis in high dimensions with structured regularization

Elena Tuzhilina, Leonardo Tozzi, Trevor Hastie

Summary: Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. Regularized modification of CCA (RCCA) is widely used for high-dimensional data but may disregard data structure. This article introduces several approaches to regularizing CCA that consider the underlying data structure and demonstrates strategies for avoiding excessive computations in high dimensions.

STATISTICAL MODELLING (2023)

Article Statistics & Probability

SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION

Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani

Summary: This paper studies minimum l(2) norm interpolation least squares regression in the high-dimensional regime, focusing on linear and nonlinear models. The study discovers the phenomena of double descent behavior in prediction risk and potential benefits of overparametrization.

ANNALS OF STATISTICS (2022)

Article Computer Science, Interdisciplinary Applications

Multiclass-penalized logistic regression

Didier Nibbering, Trevor J. Hastie

Summary: This study introduces a multinomial logistic regression model that penalizes the number of class-specific parameters, showing improved performance in both in-sample and out-of-sample situations compared to a standard model. The model clusters parameters by penalizing differences between class-specific parameter vectors, providing interpretable parameter estimates.

COMPUTATIONAL STATISTICS & DATA ANALYSIS (2022)

Article Genetics & Heredity

Significant sparse polygenic risk scores across 813 traits in UK Biobank

Yosuke Tanigawa, Junyang Qian, Guhan Venkataraman, Johanne Marie Justesen, Ruilin Li, Robert Tibshirani, Trevor Hastie, Manuel A. Rivas

Summary: We conducted a systematic assessment of polygenic risk score (PRS) prediction for over 1,500 traits using genetic and phenotype data from the UK Biobank. We found that sparse PRS models showed significant incremental predictive performance and that the number of genetic variants selected in the model correlated with predictive performance. However, the transferability of sparse PRS models trained on European individuals to non-European individuals in the UK Biobank was limited.

PLOS GENETICS (2022)

Article Statistics & Probability

FEATURE-WEIGHTED ELASTIC NET: USING FEATURES OF FEATURES FOR BETTER PREDICTION

J. Kenneth Tay, Nima Aghaeepour, Trevor Hastie, Robert Tibshirani

Summary: In some supervised learning settings, practitioners may have additional information on prediction features. Our proposed method, called the feature-weighted elastic net (fwelnet), uses this information to improve prediction by adjusting penalties on feature coefficients in the elastic net penalty. In simulations, fwelnet outperforms the lasso in terms of test mean squared error and often improves true positive or false positive rates for feature selection. Comparison with other methods reveals fwelnet's superiority, and its application to early prediction of preeclampsia shows improved performance compared to the lasso.

STATISTICA SINICA (2023)

Article Multidisciplinary Sciences

A tissue atlas of ulcerative colitis revealing evidence of sex-dependent differences in disease-driving inflammatory cell types and resistance to TNF inhibitor therapy

Aaron T. Mayer, Derek R. Holman, Anav Sood, Utkarsh Tandon, Salil S. Bhate, Sunil Bodapati, Graham L. Barlow, Jeff Chang, Sarah Black, Erica C. Crenshaw, Alexander N. Koron, Sarah E. Streett, Sanjiv S. Gambhir, William J. Sandborn, Brigid S. Boland, Trevor Hastie, Robert Tibshirani, John T. Chang, Garry P. Nolan, Christian M. Schuerch, Stephan Rogalla

Summary: This study uses CODEX technology to create a tissue atlas of inflammation in UC patients and healthy individuals. The analysis reveals the association between cellular functional states and cellular neighborhoods, as well as the presence of resistant niches in UC patients with TNFi treatment. Additionally, the study explores the use of CNNs in predicting patient clinical variables and provides guidelines for reporting predictions in similar datasets.

SCIENCE ADVANCES (2023)

Article Environmental Sciences

Comparing spatial patterns of marine vessels between vessel-tracking data and satellite imagery

Shinnosuke Nakayama, WenXin Dong, Richard G. G. Correro, Elizabeth R. R. Selig, Colette C. C. Wabnitz, Trevor J. J. Hastie, Jim Leape, Serena Yeung, Fiorenza Micheli

Summary: Monitoring marine vessel activities is crucial but challenging, especially with limited capacity and resources. Satellite imagery offers a promising solution to observe vessel activities not captured by publicly available tracking data. However, the lack of understanding on its complementarity with existing data hampers its broader use.

FRONTIERS IN MARINE SCIENCE (2023)

Article Statistics & Probability

Cross-Validation: What Does It Estimate and How Well Does It Do It?

Stephen Bates, Trevor Hastie, Robert Tibshirani

Summary: Cross-validation is a widely used technique for estimating prediction error, but its behavior is not fully understood. It does not estimate the prediction error of the model trained on the data used for cross-validation, but rather the average prediction error of models trained on unseen data from the same population. The standard confidence intervals derived from cross-validation may have lower coverage than desired, due to correlations among the measured accuracies within each fold. A nested cross-validation scheme is introduced to estimate variance more accurately and improve coverage of confidence intervals.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION (2023)

Article Mathematics, Interdisciplinary Applications

Reorienting Latent Variable Modeling for Supervised Learning

Booil Jo, Trevor J. J. Hastie, Zetan Li, Eric A. A. Youngstrom, Robert L. L. Findling, Sarah McCue Horwitz

Summary: This study proposes a method for integrating latent variable (LV) modeling into supervised learning. By combining the traditions of LV modeling, psychometrics, and supervised learning, practical prediction targets can be generated and systematically validated based on clinical validators. The feasibility of this integrated approach is demonstrated using data from the LAMS Study.

MULTIVARIATE BEHAVIORAL RESEARCH (2023)

Article Statistics & Probability

Modeling Longitudinal Data Using Matrix Completion

Lukasz Kidzinski, Trevor Hastie

Summary: In clinical practice and biomedical research, it is common to collect sparse and irregularly time-series data, which can be costly and inconvenient. Traditional analysis methods, such as mixed-effect models, Gaussian processes, and functional data analysis, rely on probabilistic assumptions, require careful implementation, and tend to be slow. In this study, we propose a novel framework based on matrix completion for analyzing longitudinal data. By iteratively applying Singular Value Decomposition, our method can estimate progression curves efficiently and easily, and it can be extended to other settings. We applied this method to study the motor impairment progression in children with Cerebral Palsy, and achieved good approximations of individual progression curves and ability to identify different progression trends in subtypes of Cerebral Palsy.

JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS (2023)

Article Medicine, General & Internal

Defining Usual Oral Temperature Ranges in Outpatients Using an Unsupervised Learning Algorithm

Catherine Ley, Frederik Heath, Trevor Hastie, Zijun Gao, Myroslava Protsiv, Julie Parsonnet

Summary: This cross-sectional study determines the normal oral temperature ranges based on age, sex, height, weight, and time of day by analyzing a large number of clinical visit records. The findings have important implications for temperature assessment and disease diagnosis in clinical medicine.

JAMA INTERNAL MEDICINE (2023)

Article Computer Science, Interdisciplinary Applications

Elastic Net Regularization Paths for All Generalized Linear Models

J. Kenneth Tay, Balasubramanian Narasimhan, Trevor Hastie

Summary: The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for various regression models, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models. In this paper, the authors further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with right-censored data, and a simplified version of the relaxed lasso, and also discuss convenient utility functions for measuring the performance of these fitted models.

JOURNAL OF STATISTICAL SOFTWARE (2023)

Article Automation & Control Systems

LinCDE: Conditional Density Estimation via Lindsey's Method

Zijun Gao, Trevor Hastie

Summary: In this paper, we propose a conditional density estimator (LinCDE) based on gradient boosting and Lindsey's method. LinCDE allows flexible modeling of density family and captures distributional characteristics. It produces smooth and non-negative density estimates.

JOURNAL OF MACHINE LEARNING RESEARCH (2022)

暂无数据