4.5 Article

AN EMPIRICAL COMPARISON OF TECHNIQUES FOR HANDLING INCOMPLETE DATA USING DECISION TREES

Journal

APPLIED ARTIFICIAL INTELLIGENCE
Volume 23, Issue 5, Pages 373-405

Publisher

TAYLOR & FRANCIS INC
DOI: 10.1080/08839510902872223

Keywords

-

Funding

  1. Engineering and Physical Sciences Research Council [GR/S55347/01] Funding Source: researchfish

Ask authors/readers for more resources

Increasing the awareness of how incomplete data affects learning and classification accuracy has led to increasing numbers of missing data techniques. This article investigates the robustness and accuracy of seven popular techniques for tolerating incomplete training and test data for different patterns of missing datadifferent proportions and mechanisms of missing data on resulting tree-based models. The seven missing data techniques were compared by artificially simulating different proportions, patterns, and mechanisms of missing data using 21 complete datasets (i.e., with no missing values) obtained from the University of California, Irvine repository of machine-learning databases (Blake and Merz, 1998). A four-way repeated measures design was employed to analyze the data. The simulation results suggest important differences. All methods have their strengths and weaknesses. However, listwise deletion is substantially inferior to the other six techniques, while multiple imputation, that utilizes the expectation maximization algorithm, represents a superior approach to handling incomplete data. Decision tree single imputation and surrogate variables splitting are more severely impacted by missing values distributed among all attributes compared to when they are only on a single attribute. Otherwise, the imputationversus model-based imputation procedures gavereasonably good results although some discrepancies remained. Different techniques for addressing missing values when using decision trees can give substantially diverse results, and must be carefully considered to protect against biases and spurious findings. Multiple imputation should always be used, especially if the data contain many missing values. If few values are missing, any of the missing data techniques might be considered. The choice of technique should be guided by the proportion, pattern, and mechanisms of missing data, especially the latter two. However, the use of older techniques like listwise deletion and mean or mode single imputation is no longer justifiable given the accessibility and ease of use of more advanced techniques, such as multiple imputation and supervised learning imputation.

Authors

I am an author on this paper
Click your name to claim this paper and add it to your profile.

Reviews

Primary Rating

4.5
Not enough ratings

Secondary Ratings

Novelty
-
Significance
-
Scientific rigor
-
Rate this paper

Recommended

No Data Available
No Data Available