☆ 4.6 Article

Preconditioning for feature selection and regression in high-dimensional problems'

ANNALS OF STATISTICS (2008)

期刊

ANNALS OF STATISTICS

卷 36, 期 4, 页码 1595-1618

出版社

INST MATHEMATICAL STATISTICS

DOI: 10.1214/009053607000000578

关键词

model selection; prediction error; lasso

类别

Statistics & Probability

向作者/读者索取更多资源

Protocol

Reagent

摘要

We consider regression problems where the number of predictors greatly exceeds the number of observations. We propose a method for variable selection that first estimates the regression function, yielding a preconditioned response variable. The primary method used for this initial regression is supervised principal components. Then we apply a standard procedure such as forward stepwise selection or the LASSO to the preconditioned response variable. In a number of simulated and real data examples, this two-step procedure outperforms forward stepwise selection or the usual LASSO (applied directly to the raw outcome). We also show that under a certain Gaussian latent variable model, application of the LASSO to the preconditioned response variable is consistent as the number of predictors and observations increases. Moreover, when the observational noise is rather large, the suggested procedure can give a more accurate estimate than LASSO. We illustrate our method on some real problems, including survival analysis with microarray data.

作者

我是这篇论文的作者

点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.6

评分不足

次要评分

新颖性

-

重要性

-

科学严谨性

-

评价这篇论文

推荐

Article Biochemical Research Methods

XGBLC: an improved survival prediction model based on XGBoost

Baoshan Ma, Ge Yan, Bingjie Chai, Xiaoyu Hou

Summary: This study proposed an improved survival prediction model XGBLC based on the XGBoost framework, using Lasso-Cox to enhance the ability to analyze high-dimensional genomic data. Tested on 20 cancer datasets, XGBLC outperforms five state-of-the-art survival methods in terms of C-index, Brier score, and AUC.

BIOINFORMATICS (2022)

添加到收藏夹

Article Plant Sciences

NeuralLasso: Neural Networks Meet Lasso in Genomic Prediction

Boby Mathew, Andreas Hauptmann, Jens Leon, Mikko J. Sillanpaeae

Summary: Prediction of complex traits based on genome-wide marker information is crucial for animal and plant breeding. Many models have been proposed and efforts are being made to improve their accuracy, considering factors such as additive, dominance, and epistasis effects. In this study, a new algorithm that combines neural networks with LASSO is proposed, which accounts for local epistasis in the prediction. The new method was compared with commonly used prediction methods and showed superior accuracy.

FRONTIERS IN PLANT SCIENCE (2022)

添加到收藏夹

Article Multidisciplinary Sciences

Predicting peritoneal recurrence in gastric cancer with serosal invasion using a pathomics nomogram

Dexin Chen, Jianbo Lai, Jiaxin Cheng, Meiting Fu, Liyan Lin, Feng Chen, Rong Huang, Jun Chen, Jianping Lu, Yuning Chen, Guangyao Huang, Miaojia Yan, Xiaodan Ma, Guoxin Li, Gang Chen, Jun Yan

Summary: Peritoneal recurrence is the most common and lethal type of recurrence in gastric cancer with serosal invasion after surgery. Current evaluation methods are not sufficient for predicting peritoneal recurrence in this type of gastric cancer. Pathomics analyses, consisting of multiple pathomics features extracted from stained images, have shown potential for risk stratification and outcome prediction. A pathomics signature was found to be significantly associated with peritoneal recurrence, and a pathomics nomogram was developed for more accurate prediction.

ISCIENCE (2023)

添加到收藏夹

Article Energy & Fuels

Attributing agnostically detected large reductions in road CO2 emissions to policy mixes

Nicolas Koch, Lennard Naumann, Felix Pretis, Nolan Ritter, Moritz Schwarz

Summary: This study examines the effectiveness of decarbonization policies in the European transport sector by detecting structural breaks in CO2 emissions. The findings suggest that a combination of carbon or fuel taxes with green vehicle incentives is the most successful policy mix, capable of achieving emission reductions that align with the EU zero emission targets.

NATURE ENERGY (2022)

添加到收藏夹

Article Computer Science, Interdisciplinary Applications

Lasso Kriging for efficiently selecting a global trend model

Inseok Park

Summary: Kriging is widely used in engineering fields, with Penalized Blind Kriging (PBK) improving predictive performance by systematically selecting models and penalizing likelihood functions.

STRUCTURAL AND MULTIDISCIPLINARY OPTIMIZATION (2021)

添加到收藏夹

Article Multidisciplinary Sciences

Comparing methods for statistical inference with model uncertainty

Anupreet Porwal, Adrian E. Raftery

Summary: Probability models are widely used in statistical tasks and it is important to choose an appropriate model and consider the uncertainty associated with this choice. This study focuses on variable selection in linear regression models and compares 21 popular methods through simulation studies. The results show that three adaptive Bayesian model averaging (BMA) methods perform the best across all statistical tasks.

PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA (2022)

添加到收藏夹

Article Computer Science, Hardware & Architecture

An Improved Genetic-XGBoost Classifier for Customer Consumption Behavior Prediction

Yue Li, Jianfang Qi, Haibin Jin, Dong Tian, Weisong Mu, Jianying Feng

Summary: In this study, a new classifier for predicting customer consumption behavior is proposed. The classifier utilizes a feature selection method based on Lasso and PCA to efficiently select relevant features and eliminate correlations between variables. An improved genetic-XGBoost algorithm is also used to optimize the prediction accuracy by adjusting XGBoost parameters and preventing the model from falling into local extremum. Experimental results demonstrate the superiority of the proposed methods over existing ones, providing a decision-making basis for enterprises to formulate better marketing strategies.

COMPUTER JOURNAL (2023)

添加到收藏夹

Article Thermodynamics

Identification method of market power abuse of generators based on lasso-logit model in spot market

Bo Sun, Ruilin Deng, Bin Ren, Minmin Teng, Siyuan Cheng, Fan Wang

Summary: The study introduces the Lasso algorithm to improve model performance, successfully achieving accurate identification of market power abuse in the electricity spot market through the construction of indicator systems and model identification methods.

ENERGY (2022)

添加到收藏夹

Article Computer Science, Artificial Intelligence

Prediction of stock price direction using the LASSO-LSTM model combines technical indicators and financial sentiment analysis

Junwen Yang, Yunmin Wang, Xiang Li

Summary: This article proposes a methodology that combines technical analysis and sentiment analysis to predict stock movement. By crawling financial textual content and stock historical transaction data and utilizing transfer learning and the TTR package, emotions are recognized and technical indicators are calculated. The improved LASSO-LSTM model is used for variable selection, and the LASSO-LSTM model shows a significant improvement in accuracy compared to the baseline LSTM model.

PEERJ COMPUTER SCIENCE (2022)

添加到收藏夹

Article Mathematics

Iterative Variable Selection for High-Dimensional Data: Prediction of Pathological Response in Triple-Negative Breast Cancer

Juan C. Laria, M. Carmen Aguilera-Morillo, Enrique Alvarez, Rosa E. Lillo, Sara Lopez-Taruella, Maria del Monte-Millan, Antonio C. Picornell, Miguel Martin, Juan Romo

Summary: This paper introduces a methodology to deal with variable selection and model estimation problems in a high-dimensional set-up, which can be particularly useful in the whole genome context.

MATHEMATICS (2021)

添加到收藏夹

Article Computer Science, Interdisciplinary Applications

An efficient correlation based adaptive LASSO regression method for air quality index prediction

Jasleen Kaur Sethi, Mamta Mittal

Summary: This research investigates the effectiveness of a feature selection method based on LASSO for predicting air quality in Delhi and surrounding cities, identifying meteorological factors and pollutant concentrations as the most important influencing factors, and suggesting preventive measures to improve air quality.

EARTH SCIENCE INFORMATICS (2021)

添加到收藏夹

Article Biochemical Research Methods

Fast and interpretable genomic data analysis using multiple approximate kernel learning

Ayyuce Begum Bektas, Cigdem Ak, Mehmet Gonen

Summary: With the increasing sizes of computational biology datasets, previous kernel-based machine learning algorithms have failed to provide satisfactory interpretability. To address this issue, we propose a fast and efficient multiple kernel learning algorithm that can extract significant information from genomic data. Our experiments demonstrate that the algorithm outperforms baseline methods while using only a small fraction of input features, and it has the potential to discover new biomarkers and therapeutic guidelines.

BIOINFORMATICS (2022)

添加到收藏夹

Article Business, Finance

A study of cross-industry return predictability in the Chinese stock market

Michael Ellington, Michalis P. Stamatogiannis, Yawen Zheng

Summary: This study investigates the predictability of cross-industry returns for the Shanghai and Shenzhen stock exchanges by constructing portfolios from different industries. The research findings show that the returns of the Oil, Telecommunications, and Finance industries are significant predictors for other industries. The machine learning methods used in the study outperform various benchmarks in the out-of-sample forecasting exercise, with an average annual excess return of 13%.

INTERNATIONAL REVIEW OF FINANCIAL ANALYSIS (2022)

添加到收藏夹

Article Mathematical & Computational Biology

Accurate Prediction of Children's ADHD Severity Using Family Burden Information: A Neural Lasso Approach

Juan C. Laria, David Delgado-Gomez, Inmaculada Penuelas-Calvo, Enrique Baca-Garcia, Rosa E. Lillo

Summary: The deep lasso algorithm, dlasso, is a neural version of the statistical linear lasso algorithm that combines feature selection and automatic parameter optimization, showing superior performance in small sample feature selection. It outperforms the traditional lasso in predictive error and variable selection. With dlasso, it is possible to predict the severity of symptoms in children with ADHD based on scales measuring family burden, family functioning, parental satisfaction, and parental mental health.

FRONTIERS IN COMPUTATIONAL NEUROSCIENCE (2021)

添加到收藏夹

Article Mathematics

Group Feature Screening Based on Information Gain Ratio for Ultrahigh-Dimensional Data

Zhongzheng Wang, Guangming Deng, Jianqi Yu

Summary: The proposed group screening procedure based on the information gain ratio for a classification model is shown to have better screening performance and classification accuracy.

JOURNAL OF MATHEMATICS (2022)

添加到收藏夹

Correction Mathematical & Computational Biology

Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank (Sept, 10.1093/biostatistics/kxaa038, 2020)

Ruilin Li, Christopher Chang, Johanne M. Justesen, Yosuke Tanigawa, Junyang Qian, Trevor Hastie, Manuel A. Rivas, Robert Tibshirani

BIOSTATISTICS (2022)

添加到收藏夹

Article Statistics & Probability

BACKFITTING FOR LARGE SCALE CROSSED RANDOM EFFECTS REGRESSIONS

Swarnadip Ghosh, Trevor Hastie, Art B. Owen

Summary: This paper presents a computationally efficient algorithm for regression models with crossed random effect errors. The proposed algorithm has lower cost and more flexible conditions compared to other methods, and it is validated through empirical analysis.

ANNALS OF STATISTICS (2022)

添加到收藏夹

Article Statistics & Probability

Canonical correlation analysis in high dimensions with structured regularization

Elena Tuzhilina, Leonardo Tozzi, Trevor Hastie

Summary: Canonical correlation analysis (CCA) is a technique for measuring the association between two multivariate data matrices. Regularized modification of CCA (RCCA) is widely used for high-dimensional data but may disregard data structure. This article introduces several approaches to regularizing CCA that consider the underlying data structure and demonstrates strategies for avoiding excessive computations in high dimensions.

STATISTICAL MODELLING (2023)

添加到收藏夹

Article Statistics & Probability

SURPRISES IN HIGH-DIMENSIONAL RIDGELESS LEAST SQUARES INTERPOLATION

Trevor Hastie, Andrea Montanari, Saharon Rosset, Ryan J. Tibshirani

Summary: This paper studies minimum l(2) norm interpolation least squares regression in the high-dimensional regime, focusing on linear and nonlinear models. The study discovers the phenomena of double descent behavior in prediction risk and potential benefits of overparametrization.

ANNALS OF STATISTICS (2022)

添加到收藏夹

Article Computer Science, Interdisciplinary Applications

Multiclass-penalized logistic regression

Didier Nibbering, Trevor J. Hastie

Summary: This study introduces a multinomial logistic regression model that penalizes the number of class-specific parameters, showing improved performance in both in-sample and out-of-sample situations compared to a standard model. The model clusters parameters by penalizing differences between class-specific parameter vectors, providing interpretable parameter estimates.

COMPUTATIONAL STATISTICS & DATA ANALYSIS (2022)

添加到收藏夹

Article Genetics & Heredity

Significant sparse polygenic risk scores across 813 traits in UK Biobank

Yosuke Tanigawa, Junyang Qian, Guhan Venkataraman, Johanne Marie Justesen, Ruilin Li, Robert Tibshirani, Trevor Hastie, Manuel A. Rivas

Summary: We conducted a systematic assessment of polygenic risk score (PRS) prediction for over 1,500 traits using genetic and phenotype data from the UK Biobank. We found that sparse PRS models showed significant incremental predictive performance and that the number of genetic variants selected in the model correlated with predictive performance. However, the transferability of sparse PRS models trained on European individuals to non-European individuals in the UK Biobank was limited.

PLOS GENETICS (2022)

添加到收藏夹

Article Statistics & Probability

FEATURE-WEIGHTED ELASTIC NET: USING FEATURES OF FEATURES FOR BETTER PREDICTION

J. Kenneth Tay, Nima Aghaeepour, Trevor Hastie, Robert Tibshirani

Summary: In some supervised learning settings, practitioners may have additional information on prediction features. Our proposed method, called the feature-weighted elastic net (fwelnet), uses this information to improve prediction by adjusting penalties on feature coefficients in the elastic net penalty. In simulations, fwelnet outperforms the lasso in terms of test mean squared error and often improves true positive or false positive rates for feature selection. Comparison with other methods reveals fwelnet's superiority, and its application to early prediction of preeclampsia shows improved performance compared to the lasso.

STATISTICA SINICA (2023)

添加到收藏夹

Article Multidisciplinary Sciences

A tissue atlas of ulcerative colitis revealing evidence of sex-dependent differences in disease-driving inflammatory cell types and resistance to TNF inhibitor therapy

Aaron T. Mayer, Derek R. Holman, Anav Sood, Utkarsh Tandon, Salil S. Bhate, Sunil Bodapati, Graham L. Barlow, Jeff Chang, Sarah Black, Erica C. Crenshaw, Alexander N. Koron, Sarah E. Streett, Sanjiv S. Gambhir, William J. Sandborn, Brigid S. Boland, Trevor Hastie, Robert Tibshirani, John T. Chang, Garry P. Nolan, Christian M. Schuerch, Stephan Rogalla

Summary: This study uses CODEX technology to create a tissue atlas of inflammation in UC patients and healthy individuals. The analysis reveals the association between cellular functional states and cellular neighborhoods, as well as the presence of resistant niches in UC patients with TNFi treatment. Additionally, the study explores the use of CNNs in predicting patient clinical variables and provides guidelines for reporting predictions in similar datasets.

SCIENCE ADVANCES (2023)

添加到收藏夹

Article Environmental Sciences

Comparing spatial patterns of marine vessels between vessel-tracking data and satellite imagery

Shinnosuke Nakayama, WenXin Dong, Richard G. G. Correro, Elizabeth R. R. Selig, Colette C. C. Wabnitz, Trevor J. J. Hastie, Jim Leape, Serena Yeung, Fiorenza Micheli

Summary: Monitoring marine vessel activities is crucial but challenging, especially with limited capacity and resources. Satellite imagery offers a promising solution to observe vessel activities not captured by publicly available tracking data. However, the lack of understanding on its complementarity with existing data hampers its broader use.

FRONTIERS IN MARINE SCIENCE (2023)

添加到收藏夹

Article Statistics & Probability

Cross-Validation: What Does It Estimate and How Well Does It Do It?

Stephen Bates, Trevor Hastie, Robert Tibshirani

Summary: Cross-validation is a widely used technique for estimating prediction error, but its behavior is not fully understood. It does not estimate the prediction error of the model trained on the data used for cross-validation, but rather the average prediction error of models trained on unseen data from the same population. The standard confidence intervals derived from cross-validation may have lower coverage than desired, due to correlations among the measured accuracies within each fold. A nested cross-validation scheme is introduced to estimate variance more accurately and improve coverage of confidence intervals.

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION (2023)

添加到收藏夹

Article Mathematics, Interdisciplinary Applications

Reorienting Latent Variable Modeling for Supervised Learning

Booil Jo, Trevor J. J. Hastie, Zetan Li, Eric A. A. Youngstrom, Robert L. L. Findling, Sarah McCue Horwitz

Summary: This study proposes a method for integrating latent variable (LV) modeling into supervised learning. By combining the traditions of LV modeling, psychometrics, and supervised learning, practical prediction targets can be generated and systematically validated based on clinical validators. The feasibility of this integrated approach is demonstrated using data from the LAMS Study.

MULTIVARIATE BEHAVIORAL RESEARCH (2023)

添加到收藏夹

Article Statistics & Probability

Modeling Longitudinal Data Using Matrix Completion

Lukasz Kidzinski, Trevor Hastie

Summary: In clinical practice and biomedical research, it is common to collect sparse and irregularly time-series data, which can be costly and inconvenient. Traditional analysis methods, such as mixed-effect models, Gaussian processes, and functional data analysis, rely on probabilistic assumptions, require careful implementation, and tend to be slow. In this study, we propose a novel framework based on matrix completion for analyzing longitudinal data. By iteratively applying Singular Value Decomposition, our method can estimate progression curves efficiently and easily, and it can be extended to other settings. We applied this method to study the motor impairment progression in children with Cerebral Palsy, and achieved good approximations of individual progression curves and ability to identify different progression trends in subtypes of Cerebral Palsy.

JOURNAL OF COMPUTATIONAL AND GRAPHICAL STATISTICS (2023)

添加到收藏夹

Article Medicine, General & Internal

Defining Usual Oral Temperature Ranges in Outpatients Using an Unsupervised Learning Algorithm

Catherine Ley, Frederik Heath, Trevor Hastie, Zijun Gao, Myroslava Protsiv, Julie Parsonnet

Summary: This cross-sectional study determines the normal oral temperature ranges based on age, sex, height, weight, and time of day by analyzing a large number of clinical visit records. The findings have important implications for temperature assessment and disease diagnosis in clinical medicine.

JAMA INTERNAL MEDICINE (2023)

添加到收藏夹

Article Computer Science, Interdisciplinary Applications

Elastic Net Regularization Paths for All Generalized Linear Models

J. Kenneth Tay, Balasubramanian Narasimhan, Trevor Hastie

Summary: The lasso and elastic net are popular regularized regression models for supervised learning. Friedman, Hastie, and Tibshirani (2010) introduced a computationally efficient algorithm for computing the elastic net regularization path for various regression models, while Simon, Friedman, Hastie, and Tibshirani (2011) extended this work to Cox models. In this paper, the authors further extend the reach of the elastic net-regularized regression to all generalized linear model families, Cox models with right-censored data, and a simplified version of the relaxed lasso, and also discuss convenient utility functions for measuring the performance of these fitted models.

JOURNAL OF STATISTICAL SOFTWARE (2023)

添加到收藏夹

Article Automation & Control Systems

LinCDE: Conditional Density Estimation via Lindsey's Method

Zijun Gao, Trevor Hastie

Summary: In this paper, we propose a conditional density estimator (LinCDE) based on gradient boosting and Lindsey's method. LinCDE allows flexible modeling of density family and captures distributional characteristics. It produces smooth and non-negative density estimates.

JOURNAL OF MACHINE LEARNING RESEARCH (2022)

添加到收藏夹

暂无数据

© Peeref 2019-2024. All rights reserved.