4.3 Article

Prediction of chemical carcinogenicity by machine learning approaches

期刊

SAR AND QSAR IN ENVIRONMENTAL RESEARCH
卷 20, 期 1-2, 页码 27-75

出版社

TAYLOR & FRANCIS LTD
DOI: 10.1080/10629360902724085

关键词

support vector machine; carcinogenicity; feature selection; Monte Carlo simulated annealing

资金

  1. National Natural Science Foundation of China [20572073]

向作者/读者索取更多资源

In this paper we report a successful application of machine learning approaches to the prediction of chemical carcinogenicity. Two different approaches, namely a support vector machine (SVM) and artificial neural network (ANN), were evaluated for predicting chemical carcinogenicity from molecular structure descriptors. A diverse set of 844 compounds, including 600 carcinogenic (CG+) and 244 noncarcinogenic (CG-) molecules, was used to estimate the accuracies of these approaches. The database was divided into two sets: the model construction set and the independent test set. Relevant molecular descriptors were selected by a hybrid feature selection method combining Fischer's score and Monte Carlo simulated annealing from a wide set of molecular descriptors, including physiochemical properties, constitutional, topological, and geometrical descriptors. The first model validation method was based a five-fold cross-validation method, in which the model construction set is split into five subsets. The five-fold cross-validation was used to select descriptors and optimise the model parameters by maximising the averaged overall accuracy. The final SVM model gave an averaged prediction accuracy of 90.7% for CG+ compounds, 81.6% for CG- compounds and 88.1% for the overall accuracy, while the corresponding ANN model provided an averaged prediction accuracy of 86.1% for CG+ compounds, 79.3% for CG- compounds and 84.2% for the overall accuracy. These results indicate that the hybrid feature selection method is very efficient and the selected descriptors are truly relevant to the carcinogenicity of compounds. Another model validation method, i.e. a hold-out method, was used to build the classification model using the selected descriptors and the optimised model parameters, in which the whole model construction set was used to build the classification model and the independent test set was used to test the predictive ability of the model. The SVM model gave a prediction accuracy of 87.6% for CG+ compounds, 79.1% for CG- compounds and 85.0% for the overall accuracy. The ANN model gave a prediction accuracy of 85.6% for CG+ compounds, 79.1% for CG- compounds and 83.6% for the overall accuracy. The results indicate that the built models are potentially useful for facilitating the prediction of chemical carcinogenicity of untested compounds.

作者

我是这篇论文的作者
点击您的名字以认领此论文并将其添加到您的个人资料中。

评论

主要评分

4.3
评分不足

次要评分

新颖性
-
重要性
-
科学严谨性
-
评价这篇论文

推荐

暂无数据
暂无数据