4.7 Article

Variable ranking based on the estimated degree of separation for two distributions of data by the length of the receiver operating characteristic curve

期刊

ANALYTICA CHIMICA ACTA
卷 876, 期 -, 页码 39-48

出版社

ELSEVIER
DOI: 10.1016/j.aca.2015.03.024

关键词

Degree of separation; Variable ranking; Probability density function; Length of receiver operating characteristic curve; Area between receiver operating; characteristic curve and diagonal

资金

  1. Edgewood Chemical Biological Center (ECBC)

向作者/读者索取更多资源

Variable responses are fundamental for all experiments, and they can consist of information-rich, redundant, and low signal intensities. A dataset can consist of a collection of variable responses over multiple classes or groups. Usually some of the variables are removed in a dataset that contain very little information. Sometimes all the variables are used in the data analysis phase. It is common practice to discriminate between two distributions of data; however, there is no formal algorithm to arrive at a degree of separation (DS) between two distributions of data. The DS is defined herein as the average of the sum of the areas from the probability density functions (PDFs) of A and B that contain a >= percentage of A and/or B. Thus, DS90 is the average of the sum of the PDF areas of A and B that contain >= 90% of A and/or B. To arrive at a DS value, two synthesized PDFs or very large experimental datasets are required. Experimentally it is common practice to generate relatively small datasets. Therefore, the challenge was to find a statistical parameter that can be used on small datasets to estimate and highly correlate with the DS90 parameter. Established statistical methods include the overlap area of the two data distribution profiles, Welch's t-test, Kolmogorov-Smirnov (K-S) test, Mann-Whitney-Wilcoxon test, and the area under the receiver operating characteristics (ROC) curve (AUC). The area between the ROC curve and diagonal (ACD) and the length of the ROC curve (LROC) are introduced. The established, ACD, and LROC methods were correlated to the DS90 when applied on many pairs of synthesized PDFs. The LROC method provided the best linear correlation with, and estimation of, the DS90. The estimated DS90 from the LROC (DS90-LROC) is applied to a database, as an example, of three Italian wines consisting of thirteen variable responses for variable ranking consideration. An important highlight of the DS90-LROC method is utilizing the LROC curve methodology to test all variables one-at-a-time with all pairs of classes in a dataset. (C) 2015 Elsevier B.V. All rights reserved.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.7
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据