☆ 4.5 Article

Correcting the Optimal Resampling-Based Error Rate by Estimating the Error Rate of Wrapper Algorithms

BIOMETRICS (2013)

Journal

BIOMETRICS

Volume 69, Issue 3, Pages 693-702

Publisher

WILEY

DOI: 10.1111/biom.12041

Keywords

Classification; High-dimensional data; Method selection bias; Repeated subsampling; Tuning bias

Categories

Biology Mathematical & Computational Biology Statistics & Probability

Funding

LMU-innovativ Project BioMed-S
German Research Foundation (DFG) [BO3139/2-1, BO3139/2-2]

Ask authors/readers for more resources

Protocol

Community support

Reagent

Community support

Abstract

High-dimensional binary classification tasks, for example, the classification of microarray samples into normal and cancer tissues, usually involve a tuning parameter. By reporting the performance of the best tuning parameter value only, over-optimistic prediction errors are obtained. For correcting this tuning bias, we develop a new method which is based on a decomposition of the unconditional error rate involving the tuning procedure, that is, we estimate the error rate of wrapper algorithms as introduced in the context of internal cross-validation (ICV) by Varma and Simon (2006, BMC Bioinformatics 7, 91). Our subsampling-based estimator can be written as a weighted mean of the errors obtained using the different tuning parameter values, and thus can be interpreted as a smooth version of ICV, which is the standard approach for avoiding tuning bias. In contrast to ICV, our method guarantees intuitive bounds for the corrected error. Additionally, we suggest to use bias correction methods also to address the conceptually similar method selection bias that results from the optimal choice of the classification method itself when evaluating several methods successively. We demonstrate the performance of our method on microarray and simulated data and compare it to ICV. This study suggests that our approach yields competitive estimates at a much lower computational price.

Authors

I am an author on this paper

Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5

Not enough ratings

Secondary Ratings

Novelty

-

Significance

-

Scientific rigor

-

Rate this paper

Recommended

Article Computer Science, Artificial Intelligence

An efficient multivariate feature ranking method for gene selection in high-dimensional microarray data

Junghye Lee, In Young Choi, Chi-Hyuck Jun

Summary: Classification of microarray data is crucial for cancer diagnosis and prediction, but the high dimensionality could pose challenges.

EXPERT SYSTEMS WITH APPLICATIONS (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

CSCIM_FS: Cosine similarity coefficient and information measurement criterion-based feature selection method for high-dimensional data

Gaoteng Yuan, Yi Zhai, Jiansong Tang, Xiaofeng Zhou

Summary: This paper proposes a feature selection algorithm based on cosine similarity coefficient and information measurement criterion (CSCIM_FS). The algorithm calculates the mutual information (MI) between features and tags, and sorts the features according to the calculated MI. It constructs a feature matrix to transform the one-dimensional feature sequence into a two-dimensional square matrix. The experimental results show that the CSCIM_FS algorithm selected a feature subset with high accuracy and outperforms other algorithms.

NEUROCOMPUTING (2023)

Add to Collection

Article Engineering, Electrical & Electronic

A classification method for high-dimensional imbalanced multi-classification data

Mengmeng Li, Qibin Zheng, Yi Liu, Gengsong Li, Wei Qin, Xiaoguang Ren

Summary: This paper proposes an evolutionary algorithm-based classification method, HIMALO, for high-dimensional imbalanced multi-classification problems. It introduces a new individual initialization strategy and a multi-classification strategy, and experiments demonstrate its superior classification performance and stability.

ELECTRONICS LETTERS (2023)

Add to Collection

Article Engineering, Electrical & Electronic

A classification method for high-dimensional imbalanced multi-classification data

Mengmeng Li, Qibin Zheng, Yi Liu, Gengsong Li, Wei Qin, Xiaoguang Ren

Summary: This paper proposes an evolutionary algorithm-based classification method, named HIMALO, for high-dimensional imbalanced multi-classification problems. HIMALO achieves superior classification performance and stability by introducing a new individual initialization strategy and a multi-classification strategy that combines one versus all and one-against-higher-order.

ELECTRONICS LETTERS (2023)

Add to Collection

Article Automation & Control Systems

A hybrid feature selection scheme for high-dimensional data

Mohammad Ahmadi Ganjei, Reza Boostani

Summary: In this paper, a new hybrid feature selection approach that combines filter and wrapper methods is proposed. By ranking, clustering, and searching the features, this method achieves better performance on high-dimensional datasets.

ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE (2022)

Add to Collection

Article Engineering, Electrical & Electronic

An evolutionary computation classification method for high-dimensional mixed missing variables data

Mengmeng Li, Yi Liu, Qibin Zheng, Gengsong Li, Wei Qin

Summary: This paper introduces a novel data imputation algorithm, PSOHM, which utilizes particle swarm optimization to impute both continuous and discrete features in high-dimensional mixed missing variables data. The algorithm outperforms traditional methods in terms of classification performance on various datasets.

ELECTRONICS LETTERS (2023)

Add to Collection

Article Physics, Multidisciplinary

Far from Asymptopia: Unbiased High-Dimensional Inference Cannot Assume Unlimited Data

Michael C. Abbott, Benjamin B. Machta

Summary: Inference from limited data requires a measure on parameter space, which is most explicit in the Bayesian framework as a prior distribution. However, the well-known Jeffreys prior leads to significant bias in high-dimensional models because the effective dimensionality of models in science is usually smaller than the number of microscopic parameters. A principled choice of measure that focuses on relevant parameters can avoid this issue and lead to unbiased posteriors. This optimal prior depends on the quantity of data and approaches Jeffreys prior in the asymptotic limit, but justifying this limit requires an impractically large increase in data quantity for typical models.

ENTROPY (2023)

Add to Collection

Article Automation & Control Systems

A generalized stability estimator based on inter-intrastability of subsets for high-dimensional feature selection

Abdul Wahid, Dost Muhammad Khan, Nadeem Iqbal, Hammad Tariq Janjuhah, Sajjad Ahmad Khan

Summary: Feature selection is crucial in high-dimensional regression and classification problems. This paper introduces a novel stability estimator to measure the internal and external stability of feature subsets chosen by different methods. Experimental results validate the usefulness of the proposed stability estimator.

CHEMOMETRICS AND INTELLIGENT LABORATORY SYSTEMS (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

A Sequential Addressing Subsampling Method for Massive Data Analysis Under Memory Constraint

Rui Pan, Yingqiu Zhu, Baishan Guo, Xuening Zhu, Hansheng Wang

Summary: The emergence of massive data brings challenges to statistical inference. New sampling techniques are needed to sample data from a hard drive. In this paper, a sequential addressing subsampling (SAS) method is proposed that samples data directly from the hard drive. The SAS method is time saving compared to the random addressing subsampling (RAS) method, and its estimators are studied and tested through simulation studies and comparison with RAS method.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

An efficient feature selection method based on improved elephant herding optimization to classify high-dimensional biomedical data

Harpreet Singh, Birmohan Singh, Manpreet Kaur

Summary: This study proposes an efficient feature selection and parameter optimization method for classifying high-dimensional biomedical datasets. By introducing the improved elephant herding optimization algorithm and data normalization techniques, the impact of noisy features can be reduced, and the optimal feature set can be obtained, thereby improving classification accuracy.

EXPERT SYSTEMS (2022)

Add to Collection

Article Genetics & Heredity

Interep: An R Package for High-Dimensional Interaction Analysis of the Repeated Measurement Data

Fei Zhou, Jie Ren, Yuwen Liu, Xiaoxi Li, Weiqun Wang, Cen Wu

Summary: We introduce interep, an R package for analyzing repeated measurement data with high-dimensional main and interaction effects. The package implements penalization methods based on generalized estimating equation (GEE), and provides alternative methods as well. This software article presents the statistical methodology, core and supporting functions usage, and a simulation example with R codes. The interep package is available at The Comprehensive R Archive Network (CRAN).

GENES (2022)

Add to Collection

Article Computer Science, Artificial Intelligence

Hybrid binary COOT algorithm with simulated annealing for feature selection in high-dimensional microarray data

Elnaz Pashaei, Elham Pashaei

Summary: Microarray analysis of gene expression is helpful for disease and cancer diagnosis and prognosis. This paper proposes a new gene selection strategy based on the binary COOT optimization algorithm, and compares it to other techniques. The experimental results show that the BCOOT-CSA approach outperforms other methods in terms of prediction accuracy and selected gene number.

NEURAL COMPUTING & APPLICATIONS (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

High-Dimensional Multi-Label Data Stream Classification With Concept Drifting Detection

Peipei Li, Haixiang Zhang, Xuegang Hu, Xindong Wu

Summary: Multi-label data streams, characterized by multiple labels, high dimensionality, high volume, high velocity, and concept drifts, have been popular on the Web. However, research attention to the challenging task of multi-label data stream classification with high-dimensional attributes and concept drifts has been limited. In this study, we propose an algorithm adaptation approach that integrates max-relevance and min-redundancy to effectively classify multi-label data streams. We refine the feature selection criteria and introduce a concept drifting detection approach, resulting in an incremental ensemble classification model with superior performance.

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING (2023)

Add to Collection

Article Computer Science, Artificial Intelligence

Feature selection in high dimensional data: A specific preordonnances-based memetic algorithm

Hasna Chamlal, Tayeb Ouaderhman, Basma El Mourtji

Summary: This study presents an algorithm for heterogeneous variable selection in discrimination problems. The algorithm utilizes both filter and wrapper approaches, and introduces a new feature discrimination power measure. Experimental results demonstrate the superiority of this algorithm over other methods.

KNOWLEDGE-BASED SYSTEMS (2023)

Add to Collection

Article Genetics & Heredity

Identification of Prognostic Biomarker Candidates Associated With Melanoma Using High-Dimensional Genomic Data

Brody Kutt, Rachel Burdorf, Travaughn Bain, Nicardo Cameron, Alexia Pearah, Ersoy Subasi, David J. Carroll, Lisa K. Moore, Munevver Mine Subasi

Summary: This study utilized data from CCLE to analyze gene expression and copy number features in melanoma cell lines, identifying specific genes and combinations that can distinguish between cell lines. A feature selection approach for high-dimensional datasets was designed to identify a small subset of genes that can accurately classify melanoma cell lines, potentially leading to personalized treatment approaches.

FRONTIERS IN GENETICS (2021)

Add to Collection

Article Mathematical & Computational Biology

Sampling uncertainty versus method uncertainty: A general framework with applications to omics biomarker selection

Simon Klau, Marie-Laure Martin-Magniette, Anne-Laure Boulesteix, Sabine Hoffmann

BIOMETRICAL JOURNAL (2020)

Add to Collection

Review Genetics & Heredity

Statistical learning approaches in the genetic epidemiology of complex diseases

Anne-Laure Boulesteix, Marvin N. Wright, Sabine Hoffmann, Inke R. Koenig

HUMAN GENETICS (2020)

Add to Collection

Article Health Care Sciences & Services

A plea for taking all available clinical information into account when assessing the predictive value of omics data

Alexander Volkmann, Riccardo De Bin, Willi Sauerbrei, Anne-Laure Boulesteix

BMC MEDICAL RESEARCH METHODOLOGY (2019)

Add to Collection

Review Biochemical Research Methods

Combining clinical and molecular data in regression prediction models: insights from a simulation study

Riccardo De Bin, Anne-Laure Boulesteix, Axel Benner, Natalia Becker, Willi Sauerbrei

BRIEFINGS IN BIOINFORMATICS (2020)

Add to Collection

Article Obstetrics & Gynecology

Delivery room desaturations and bradycardia in the early postnatal period of healthy term neonates - a prospective observational study

D. -M Burgmann, K. Foerster, M. Klemme, M. Delius, C. Huebener, R. Wiskott, A. L. Boulesteix, A. W. Flemmer

Summary: This study evaluated the frequency, duration, and severity of desaturations and bradycardia in the first hours of life in term neonates. The results showed that approximately 30% of infants experienced desaturations, with 25% of them being prolonged desaturations. Infants born by planned Cesarean section had a significantly higher occurrence of desaturations compared to other modes of delivery.

JOURNAL OF MATERNAL-FETAL & NEONATAL MEDICINE (2022)

Add to Collection

Article Oncology

Single-center versus multi-center data sets for molecular prognostic modeling: a simulation study

Daniel Samaga, Roman Hornung, Herbert Braselmann, Julia Hess, Horst Zitzelsberger, Claus Belka, Anne-Laure Boulesteix, Kristian Unger

RADIATION ONCOLOGY (2020)

Add to Collection

Article Mathematics, Interdisciplinary Applications

Improved Outcome Prediction Across Data Sources Through Robust Parameter Tuning

Nicole Ellenbach, Anne-Laure Boulesteix, Bernd Bischl, Kristian Unger, Roman Hornung

Summary: The paper discusses the reasons why prediction rules trained on high-dimensional data do not generalize well across different sources, introduces a new method for tuning parameter selection, and concludes through a large-scale comparison study that tuning on external data and robust tuning with a tuned robustness parameter lead to better generalizing prediction rules.

JOURNAL OF CLASSIFICATION (2021)

Add to Collection

Article Statistics & Probability

On the asymptotic behaviour of the variance estimator of a U-statistic

Mathias Fuchs, Roman Hornung, Anne-Laure Boulesteix, Riccardo De Bin

JOURNAL OF STATISTICAL PLANNING AND INFERENCE (2020)

Add to Collection

Article Statistics & Probability

Adapted single-cell consensus clustering (adaSC3)

Cornelia Fuetterer, Thomas Augustin, Christiane Fuchs

ADVANCES IN DATA ANALYSIS AND CLASSIFICATION (2020)

Add to Collection

Article Medicine, General & Internal

Introduction to statistical simulations in health research

Anne-Laure Boulesteix, Rolf H. H. Groenwold, Michal Abrahamowicz, Harald Binder, Matthias Briel, Roman Hornung, Tim P. Morris, Jorg Rahnenfuhrer, Willi Sauerbrei

BMJ OPEN (2020)

Add to Collection

Article Multidisciplinary Sciences

The multiplicity of analysis strategies jeopardizes replicability: lessons learned across disciplines

Sabine Hoffmann, Felix Schoenbrodt, Ralf Elsas, Rory Wilson, Ulrich Strasser, Anne-Laure Boulesteix

Summary: This paper presents a general framework on sources of uncertainty in computational analyses that lead to multiplicity of analysis strategies, and applies it to various approaches proposed in different disciplines to address this issue.

ROYAL SOCIETY OPEN SCIENCE (2021)

Add to Collection

Article Computer Science, Artificial Intelligence

Information efficient learning of complexly structured preferences: Elicitation procedures and their application to decision making under uncertainty

C. Jansen, H. Blocher, T. Augustin, G. Schollmeyer

Summary: This paper proposes efficient methods for eliciting complex preferences and applies them to decision making problems. The methods enable decision makers to reveal their preference system through as few ranking questions as possible. The study presents two approaches, one utilizing ranking data to obtain ordinal preferences and the other explicitly eliciting an approximate version of the cardinal preferences. Conditions for obtaining the decision maker's true preference system and improving efficiency are discussed.

INTERNATIONAL JOURNAL OF APPROXIMATE REASONING (2022)

Add to Collection

Article Biotechnology & Applied Microbiology

On the optimistic performance evaluation of newly introduced bioinformatic methods

Stefan Buchka, Alexander Hapfelmeier, Paul P. Gardner, Rory Wilson, Anne-Laure Boulesteix

Summary: Many research articles claim that new data analysis methods outperform existing ones, but the veracity of such claims is questionable. This manuscript discusses the consequences of optimistic bias in evaluating novel data analysis methods, and quantitatively investigates this bias using an example from epigenetic analysis.

GENOME BIOLOGY (2021)

Add to Collection

Article Public, Environmental & Occupational Health

Examining the robustness of observational associations to model, measurement and sampling uncertainty with the vibration of effects framework

Simon Klau, Sabine Hoffmann, Chirag J. Patel, John P. A. Ioannidis, Anne-Laure Boulesteix

Summary: The study highlights the significant impact of sampling, model, and measurement uncertainty on the stability of observational associations, potentially leading to large variability in results. Measurement error in observational studies can attenuate the true effect in most cases, but may also occasionally result in overestimation.

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY (2021)

Add to Collection

Article Cardiac & Cardiovascular Systems

Outcome of patients treated with extracorporeal life support in cardiogenic shock complicating acute myocardial infarction: 1-year result from the ECLS-Shock study

Korbinian Lackermair, Stefan Brunner, Mathias Orban, Sven Peterss, Martin Orban, Hans D. Theiss, Bruno C. Huber, Gerd Juchem, Frank Born, Anne-Laure Boulesteix, Axel Bauer, Maximilian Pichlmaier, Joerg Hausleiter, Steffen Massberg, Christian Hagl, Sabina P. W. Guenther

Summary: This pilot study showed that randomized studies with ECLS in CS patients are feasible and safe. Small numbers of included patients impede meaningful conclusions about mortality and neurological outcome. Our findings of numerical differences in mortality and survival with severe neurological impairment give an urgent call for larger multi-centric randomized trials assessing the endpoint of all-cause mortality but also considering the effects on neurological outcome measures.

CLINICAL RESEARCH IN CARDIOLOGY (2021)

Add to Collection

No Data Available

© Peeref 2019-2024. All rights reserved.